Arxiv今日论文 | 2025-03-19

本篇博文主要内容为 2025-03-19 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决数学推理验证中的准确性问题，特别是在数学过程错误识别任务中。论文提出了一种新的时间一致性方法（Temporal Consistency Method），其关键是通过验证器在多轮迭代中基于先前评估结果逐步精炼判断，利用自我反思序列的一致性来提升验证精度。与单轮验证或基于多模型辩论的方法不同，此方法显著提升了在Mathcheck、ProcessBench和PRM800K等基准上的表现，并且当应用于DeepSeek R1蒸馏模型时，使较小规模的7B/8B模型超越了更大规模的70B/72B模型以及GPT-4o在ProcessBench上的性能，其中采用该方法的14B蒸馏模型表现接近原始Deepseek-R1模型。

链接: https://arxiv.org/abs/2503.14495
作者: Jiacheng Guo,Yue Wu,Jiahao Qiu,Kaixuan Huang,Xinzhe Juan,Ling Yang,Mengdi Wang
机构: Department of Electrical & Computer Engineering, Princeton University (普林斯顿大学电气与计算机工程系); AI Lab, Princeton University (普林斯顿大学人工智能实验室); Department of Computer Science & Engineering, University of Michigan (密歇根大学计算机科学与工程系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at this https URL
zh

[NLP-1] Gricean Norms as a Basis for Effective Collaboration AAMAS2025

【速读】：本文旨在解决人类与人工智能协作过程中，AI代理在处理模糊、不完整、无效或无关指令时能力不足的问题。传统方法仅依赖明确指令，而忽视了交流中的不确定性。为应对这一挑战，论文提出了一种规范性框架，将Gricean会话规范（包括数量、质量、关联性和方式四大准则）与认知模型（如共同基础、相关性理论及心智理论）整合到基于大语言模型（LLM）的代理中，以提升其理解含糊指令的能力。关键在于通过引入这些规范，使AI能够依据合作原则推导意图，从而实现更高效的人机协作。实验结果显示，采用Gricean规范的Lamoid在任务准确性及响应清晰度方面显著优于未使用规范的版本，证明了该框架的有效性。

链接: https://arxiv.org/abs/2503.14484
作者: Fardin Saad,Pradeep K. Murukannaiah,Munindar P. Singh
机构: North Carolina State University (NC State) (北卡罗来纳州立大学); Delft University of Technology (TU Delft) (代尔夫特理工大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to AAMAS 2025. 8 pages (excl. references), 9 figures/tables. (Appendix: 5 pages, 6 figures/tables). Code available at: this https URL

点击查看摘要

Abstract:Effective human-AI collaboration hinges not only on the AI agent’s ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks – common ground, relevance theory, and theory of mind – into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent’s pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
zh

[NLP-2] Dont lie to your friends: Learning what you know from collaborative self-play

【速读】：该论文旨在解决人工智能代理在协助人类时需具备自我认知能力的问题，即明确自身知识与工具使用之间的界限，包括何时依赖参数化知识、何时信任工具输出以及何时选择回避或谨慎预测。传统通过有监督微调方法难以有效传授此类元知识，因为这需要构建反映代理特定能力的例子。为此，论文提出了一种全新的协作自我博弈（collaborative self-play）方法作为解决方案的关键。该方法通过设计群体奖励机制，促使多智能体系统在集体层面达成正确答案，从而自然涌现出所需的元知识。研究聚焦于拥有异构工具（如特定语料库检索）的小型智能体社会，这些智能体需协同工作以最大化效率并最小化努力。实验表明，针对多智能体社区设置的群体奖励能够诱导出可迁移至单个智能体部署场景下的策略，进而提升工具使用效率及选择性预测能力。

链接: https://arxiv.org/abs/2503.14481
作者: Jacob Eisenstein,Reza Aghajani,Adam Fisch,Dheeru Dua,Fantine Huot,Mirella Lapata,Vicky Zayats,Jonathan Berant
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outputs, and when to abstain or hedge. Such capabilities are hard to teach through supervised fine-tuning because they require constructing examples that reflect the agent’s specific capabilities. We therefore propose a radically new approach to teaching agents what they know: \emphcollaborative self-play. We construct multi-agent collaborations in which the group is rewarded for collectively arriving at correct answers. The desired meta-knowledge emerges from the incentives built into the structure of the interaction. We focus on small societies of agents that have access to heterogeneous tools (corpus-specific retrieval), and therefore must collaborate to maximize their success while minimizing their effort. Experiments show that group-level rewards for multi-agent communities can induce policies that \emphtransfer to improve tool use and selective prediction in settings where individual agents are deployed in isolation.
zh

[NLP-3] Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations

【速读】：该论文旨在解决大型语言模型（LLMs）在表达虚假声明时采用过于自信的语言风格（即“过自信幻觉”）的问题，这种问题会误导用户并削弱信任。论文的核心目标是实现通过语言表达声明周围实际不确定性程度的能力。研究发现，“口头不确定性”由模型表示空间中的单一线性特征控制，并且与模型的实际“语义不确定性”仅具有中等程度的相关性。基于这一洞察，论文提出的关键解决方案是利用口头不确定性与语义不确定性之间的不匹配来预测幻觉现象，并在推理阶段干预口头不确定性以减少短形式答案中的幻觉现象，实现了平均相对减少32%的效果。

链接: https://arxiv.org/abs/2503.14477
作者: Ziwei Ji,Lei Yu,Yeskendir Koishekenov,Yejin Bang,Anthony Hartshorn,Alan Schelten,Cheng Zhang,Pascale Fung,Nicola Cancedda
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs often adopt an assertive language style also when making false claims. Such overconfident hallucinations'' mislead users and erode trust. Achieving the ability to express in language the actual degree of uncertainty around a claim is therefore of great importance. We find that verbal uncertainty’’ is governed by a single linear feature in the representation space of LLMs, and show that this has only moderate correlation with the actual ``semantic uncertainty’’ of the model. We apply this insight and show that (1) the mismatch between semantic and verbal uncertainty is a better predictor of hallucinations than semantic uncertainty alone and (2) we can intervene on verbal uncertainty at inference time and reduce hallucinations on short-form answers, achieving an average relative reduction of 32%.
zh

[NLP-4] DAPO: An Open-Source LLM Reinforcement Learning System at Scale

【速读】：该论文试图解决推理规模扩展（Inference Scaling）中复杂推理能力难以通过强化学习（Reinforcement Learning, RL）有效复现的问题，特别是当前最先进的推理大语言模型（Reasoning LLMs）的关键技术细节被隐藏，导致社区难以重现其RL训练结果。论文提出了解决方案——Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) 算法，并开源了一个基于Qwen2.5-32B基础模型、在AIME 2024上达到50分的大规模RL系统。关键在于引入了四个核心技术：解耦裁剪（Decoupled Clip）、动态采样策略优化（Dynamic Sampling Policy Optimization）等，同时开源了基于verl框架的完整代码与精心设计和处理的数据集，从而显著提升了可复现性并支持未来大规模LLM RL的研究。

链接: https://arxiv.org/abs/2503.14476
作者: Qiying Yu,Zheng Zhang,Ruofei Zhu,Yufeng Yuan,Xiaochen Zuo,Yu Yue,Tiantian Fan,Gaohong Liu,Lingjun Liu,Xin Liu,Haibin Lin,Zhiqi Lin,Bole Ma,Guangming Sheng,Yuxuan Tong,Chi Zhang,Mofan Zhang,Wang Zhang,Hang Zhu,Jinhua Zhu,Jiaze Chen,Jiangjie Chen,Chengyi Wang,Hongli Yu,Weinan Dai,Yuxuan Song,Xiangpeng Wei,Hao Zhou,Jingjing Liu,Wei-Ying Ma,Ya-Qin Zhang,Lin Yan,Mu Qiao,Yonghui Wu,Mingxuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the \textbfD ecoupled Clip and \textbfD ynamic s \textbfA mpling \textbfP olicy \textbfO ptimization ( \textbfDAPO ) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.
zh

[NLP-5] RWKV-7 “Goose” with Expressive Dynamic State Evolution

【速读】：该论文试图解决在大规模多语言任务中提升序列建模性能的同时降低计算资源需求的问题。解决方案的关键在于提出了一种新的序列建模架构RWKV-7 “Goose”，其通过引入广义化的delta规则（配备向量门控和上下文学习率）以及松弛值替换规则，在仅使用恒定内存消耗和恒定每token推理时间的前提下，实现了在30亿参数规模上的多语言下游任务新SOTA表现，并且在英语语言任务上达到了当前顶级模型的性能，尽管训练所用的tokens数量显著少于其他同类模型。此外，RWKV-7展示了状态跟踪能力和正则语言识别能力，同时保持了训练的并行化特性。

链接: https://arxiv.org/abs/2503.14456
作者: Bo Peng,Ruichong Zhang,Daniel Goldstein,Eric Alcaide,Haowen Hou,Janna Lu,William Merrill,Guangyu Song,Kaifeng Tan,Saiteja Utpala,Nathan Wilce,Johan S. Wind,Tianyi Wu,Daniel Wuttke,Christian Zhou-Zheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present RWKV-7 “Goose”, a new sequence modeling architecture, along with pre-trained language models that establish a new state-of-the-art in downstream performance at the 3 billion parameter scale on multilingual tasks, and match current SoTA English language performance despite being trained on dramatically fewer tokens than other top 3B models. Nevertheless, RWKV-7 models require only constant memory usage and constant inference time per token. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to \mathsfTC^0 . To demonstrate RWKV-7’s language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at this https URL, and our training and inference code at this https URL all under the Apache 2.0 License. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.0; I.2.7 Cite as: arXiv:2503.14456 [cs.CL] (or arXiv:2503.14456v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.14456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-6] LLM -FE: Automated Feature Engineering for Tabular Data with LLM s as Evolutionary Optimizers

【速读】：该论文试图解决传统自动化特征工程方法受限于预定义变换和固定的手动设计搜索空间的问题，这些问题通常忽略了领域知识，并且现有的基于大语言模型（LLMs）的方法仅通过直接提示或依赖验证分数进行特征选择，未能利用先前特征发现实验的见解或建立特征生成与数据驱动性能之间的有意义推理。论文的关键解决方案是提出了一种名为LLM-FE的新框架，它结合了进化搜索与LLMs的领域知识和推理能力，以自动为表格学习任务发现有效的特征。LLM-FE将特征工程表述为程序搜索问题，在此过程中，LLMs迭代地提出新的特征变换程序，并通过数据驱动反馈指导搜索过程。

链接: https://arxiv.org/abs/2503.14434
作者: Nikhil Abhyankar,Parshin Shojaee,Chandan K. Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
zh

[NLP-7] Splintering Nonconcatenative Languages for Better Tokenization

【速读】：该论文旨在解决现有子词（subword）分词算法（如BPE和UnigramLM）在处理非拼接形态学语言时的局限性。这些算法假设文本可以通过单纯的拼接方式划分为有意义的单元，但对于希伯来语、阿拉伯语等依赖词根模板模式编码形态的语言，以及马来语、格鲁吉亚语等常见分割词缀的语言，这种假设并不成立。论文的关键解决方案是提出SPLINTER，这是一种预处理步骤，通过重新排列文本为更线性的形式，更好地表示这类非拼接形态学特性，从而使分词器能够识别出有意义的连续片段。

链接: https://arxiv.org/abs/2503.14433
作者: Bar Gazit(1),Shaltiel Shmidman(2),Avi Shmidman(2),Yuval Pinter(1) ((1) Ben-Gurion University of the Negev, (2) DICTA)
机构: Ben-Gurion University of the Negev (本-古里安大学); DICTA (未知中文名称)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER’s merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.
zh

[NLP-8] PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

【速读】：该论文旨在解决大型语言模型（LLMs）在零样本（zero-shot）使用专用外部工具时面临的挑战，特别是在工具文档有限或存在噪声的情况下。现有方法依赖于手动重写或标注数据进行验证，这限制了其在真正零样本场景下的适用性。为应对这些挑战，论文提出了一种名为PLAY2PROMPT的自动化框架，其关键是通过系统化地“玩”每个工具来探索其输入-输出行为，从而迭代地优化工具文档并生成无标注数据的使用示例。这些示例不仅能够引导LLM推理，还可用于进一步提升工具的利用率。实验结果表明，PLAY2PROMPT显著提高了开放模型和封闭模型在真实任务中的零样本工具性能，提供了一种可扩展且有效的领域特定工具集成方案。

链接: https://arxiv.org/abs/2503.14432
作者: Wei Fang,Yang Zhang,Kaizhi Qian,James Glass,Yada Zhu
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); IBM (国际商业机器公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically “plays” with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
zh

[NLP-9] ExDDV: A New Dataset for Explainable Deepfake Detection in Video

【速读】：该论文试图解决的问题是如何提高深度伪造视频检测的可解释性，以应对日益逼真的生成视频带来的挑战。现有的深度伪造检测器虽然能够自动检测假视频，但其决策缺乏透明度且容易出错，使得人类在面对基于深度伪造的欺诈和虚假信息时仍然脆弱。为了解决这一问题，论文提出了ExDDV（首个针对视频深度伪造检测可解释性的数据集和基准），该数据集包含约5400个真实与伪造视频，并通过人工标注的文字描述（解释伪影）和点击操作（定位伪影）提供监督信号。解决方案的关键在于结合文本和点击两种监督方式，训练能够准确定位并描述伪造视频中伪影的鲁棒可解释模型。研究结果表明，仅依赖单一形式的监督无法满足需求，必须同时利用文本和点击信息来提升模型性能。

链接: https://arxiv.org/abs/2503.14421
作者: Vlad Hondru,Eduard Hogea,Darian Onchis,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学), Romania; West University of Timişoara (蒂米什瓦拉西部大学), Romania
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at this https URL.
zh

[NLP-10] Unifying Text Semantics and Graph Structures for Temporal Text-attributed Graphs with Large Language Models ICML2025

【速读】：该论文致力于解决动态图谱中结合文本信息建模时面临的挑战，具体而言，现有时间图神经网络（TGNNs）在处理时间文本属性图（TTAGs）时，通常静态嵌入文本信息，并过度依赖侧重结构信息的编码机制，忽视了文本语义随时间演化的动态特性以及语义与结构之间协同增强的关键交互关系。为应对这些问题，论文提出了一种名为\textbfCross的新框架，其关键在于利用先进的大型语言模型（LLMs）提取文本空间中的动态语义，并生成统一语义与结构的表达。具体实现上，通过引入时间语义提取器，使LLMs能够理解节点文本邻域上下文的演化语义；随后，通过语义-结构协同编码器，结合上述提取器生成兼具语义与结构洞察力的表示，同时促进两者之间的相互强化。实验结果表明，\textbfCross在多个公开及工业数据集上展现出显著的有效性和鲁棒性。

链接: https://arxiv.org/abs/2503.14411
作者: Siwei Zhang,Yun Xiong,Yateng Tang,Xi Chen,Zian Jia,Zehao Gu,Jiarong Xu,Jiawei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submit to ICML2025

点击查看摘要

Abstract:Temporal graph neural networks (TGNNs) have shown remarkable performance in temporal graph modeling. However, real-world temporal graphs often possess rich textual information, giving rise to temporal text-attributed graphs (TTAGs). Such combination of dynamic text semantics and evolving graph structures introduces heightened complexity. Existing TGNNs embed texts statically and rely heavily on encoding mechanisms that biasedly prioritize structural information, overlooking the temporal evolution of text semantics and the essential interplay between semantics and structures for synergistic reinforcement. To tackle these issues, we present \textbfCross, a novel framework that seamlessly extends existing TGNNs for TTAG modeling. The key idea is to employ the advanced large language models (LLMs) to extract the dynamic semantics in text space and then generate expressive representations unifying both semantics and structures. Specifically, we propose a Temporal Semantics Extractor in the Cross framework, which empowers the LLM to offer the temporal semantic understanding of node’s evolving contexts of textual neighborhoods, facilitating semantic dynamics. Subsequently, we introduce the Semantic-structural Co-encoder, which collaborates with the above Extractor for synthesizing illuminating representations by jointly considering both semantic and structural information while encouraging their mutual reinforcement. Extensive experimental results on four public datasets and one practical industrial dataset demonstrate Cross’s significant effectiveness and robustness.
zh

[NLP-11] Large Language Models for Virtual Human Gesture Selection AAMAS2025

【速读】：本文旨在解决自动化生成语境相关且有意义的共发言手势（co-speech gestures）这一挑战性问题。现有方法要么依赖于难以确保语境意义的数据驱动全自动技术，要么需要耗费大量时间且缺乏通用性的手动设计方式。为应对这一难题，论文的关键解决方案是利用大型语言模型（Large Language Models）的语义能力，开发了一种通过提示（prompting）机制建议合适共发言手势的方法。首先，研究描述了如何将手势信息编码到GPT-4中；其次，通过实验评估不同提示方法在选择语义丰富且上下文相关的手势以及与话语内容适配方面的表现；最后，展示了此方法如何被集成到虚拟代理系统中，实现手势的选择与动画生成的自动化，从而提升人机交互质量。

链接: https://arxiv.org/abs/2503.14408
作者: Parisa Ghanad Torshizi,Laura B. Hensel,Ari Shapiro,Stacy C. Marsella
机构: Northeastern University (东北大学); University of Glasgow (格拉斯哥大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, Accepted at the AAMAS 2025 conference

点击查看摘要

Abstract:Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee’s engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.
zh

[NLP-12] From “Hallucination” to “Suture”: Insights from Language Philosophy to Enhance Large Language Models

【速读】：该论文致力于解决大型语言模型（Large Language Models, LLMs）中的幻觉（hallucination）现象，通过语言哲学与精神分析学的视角探讨其根源。论文的关键在于提出Anchor-RAG框架，该框架基于拉康（Lacan）的“符号链”（chain of signifiers）和“缝合点”（suture points）概念，从语言学的基本原理出发分析LLMs产生幻觉的核心原因。不同于依赖试错实验、反复调整数学公式或资源密集型方法的主流做法，该方案通过构建坚实的理论基础，设计出不仅能有效减少幻觉现象，还能提升LLM性能并改善输出质量的算法与模型。论文旨在建立理解LLMs幻觉现象的综合性理论框架，并挑战当前以“猜测-测试”为主的传统研究思路，推动可解释性LLMs的发展，为语言驱动的人工智能系统提供更深层次的理解。

链接: https://arxiv.org/abs/2503.14392
作者: Qiantong Wang
机构: Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL)
备注: 7 pages

点击查看摘要

Abstract:This paper explores hallucination phenomena in large language models (LLMs) through the lens of language philosophy and psychoanalysis. By incorporating Lacan’s concepts of the “chain of signifiers” and “suture points,” we propose the Anchor-RAG framework as a novel approach to mitigate hallucinations. In contrast to the predominant reliance on trial-and-error experiments, constant adjustments of mathematical formulas, or resource-intensive methods that emphasize quantity over quality, our approach returns to the fundamental principles of linguistics to analyze the root causes of hallucinations in LLMs. Drawing from robust theoretical foundations, we derive algorithms and models that are not only effective in reducing hallucinations but also enhance LLM performance and improve output quality. This paper seeks to establish a comprehensive theoretical framework for understanding hallucinations in LLMs and aims to challenge the prevalent “guess-and-test” approach and rat race mentality in the field. We aspire to pave the way for a new era of interpretable LLMs, offering deeper insights into the inner workings of language-based AI systems.
zh

[NLP-13] How much do LLM s learn from negative examples?

【速读】：该论文试图解决在大型语言模型（Large Language Models, LLMs）训练过程中如何有效利用负例（negative examples）以提升模型性能的问题。解决方案的关键在于通过引入一种基于似然比（likelihood-ratio, Likra）的方法，在多选题问答基准测试中精确控制负例的影响范围和数量。研究发现，相比于仅使用正例的有监督微调（Supervised Fine-Tuning, SFT），在训练的关键阶段，引入负例的Likra方法能在每个训练样本上带来更大的性能提升，导致学习曲线出现显著跃升而非平滑渐进的改进。此外，那些看似合理但错误的“接近正确”负例对模型的影响尤为显著，并且通过负例训练能够更有效地识别这些潜在的错误答案，从而减少幻觉现象（hallucinations）并提高模型准确性。

链接: https://arxiv.org/abs/2503.14391
作者: Shadi Hamdan,Deniz Yuret
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples – incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
zh

[NLP-14] Good/Evil Reputation Judgment of Celebrities by LLM s via Retrieval Augmented Generation

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）是否能够理解善恶，并据此判断名人声誉中的善恶属性。论文的具体目标是评估LLMs在处理名人声誉评价任务中的有效性，尤其是通过分类与生成相结合的方法，对名人的“方面”（aspects）及其描述进行善恶判断。

解决方案的关键在于结合检索增强生成（Retrieval-Augmented Generation, RAG）框架，利用大型语言模型（如ChatGPT）首先从网页文章中收集提及目标名人的句子，并基于内容自动为其分配类别名称（即“方面”）。随后，通过RAG框架验证LLMs在判断这些方面及其描述的善恶属性方面的有效性，并进一步证明所提出方法相较于已有的集成RAG功能的服务具有显著优势。

链接: https://arxiv.org/abs/2503.14382
作者: Rikuto Tsuchida,Hibiki Yokoyama,Takehito Utsuro
机构: Deg. Prog. Sys.&Inf. Eng., Grad. Sch. Sci.&Tech., University of Tsukuba (筑波大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as “aspects” of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
zh

[NLP-15] VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

【速读】：本文旨在解决在统一框架内处理指令性视频编辑（如添加、删除、修改等）以及多样化任务的挑战。为了解决这一问题，论文提出了VEGGIE（Video Editor with Grounded Generation from Instructions），这是一个基于用户指令统一视频概念编辑、定位与推理的简单端到端框架。其关键是利用多模态大型语言模型（MLLM）解析用户意图并将其定位到视频上下文中，生成针对特定帧的任务查询，然后通过扩散模型将这些计划转化为符合用户意图的编辑视频。此外，通过课程学习策略对模型进行训练，并引入新颖的数据合成管道以生成配对的指令性视频编辑数据，从而支持多样化的任务和复杂的指令。这一方法使VEGGIE在多种视频编辑技能方面表现出色，同时在视频对象定位与推理分割任务中超越其他基线模型。

链接: https://arxiv.org/abs/2503.14350
作者: Shoubin Yu,Difan Liu,Ziqiao Ma,Yicong Hong,Yang Zhou,Hao Tan,Joyce Chai,Mohit Bansal
机构: Adobe Research (Adobe 研究院); University of Michigan (密歇根大学); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First three authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:Recent video diffusion models have enhanced video editing, but it remains challenging to handle instructional editing and diverse tasks (e.g., adding, removing, changing) within a unified framework. In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. Specifically, given a video and text query, VEGGIE first utilizes an MLLM to interpret user intentions in instructions and ground them to the video contexts, generating frame-specific grounded task queries for pixel-space responses. A diffusion model then renders these plans and generates edited videos that align with user intent. To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data. Additionally, we introduce a novel data synthesis pipeline to generate paired instructional video editing data for model training. It transforms static image data into diverse, high-quality video editing samples by leveraging Image-to-Video models to inject dynamics. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model, while other models struggle with multi-tasking. VEGGIE also excels in video object grounding and reasoning segmentation, where other baselines fail. We further reveal how the multiple tasks help each other and highlight promising applications like zero-shot multimodal instructional and in-context video editing.
zh

[NLP-16] Spatio-Temporal Graph Neural Networks for Infant Language Acquisition Prediction

【速读】：该论文旨在解决儿童词汇习得预测的问题，现有基于神经网络和图模型的方法虽能捕捉词汇状态随时间的变化或词间关系，但未能充分反映婴儿语言学习过程的复杂性。为解决此问题，论文提出将语言习得模型与时空图卷积网络（STGCN）结合，考虑儿童语言学习过程中不同类型的语言学关系。关键在于引入一种新颖的方法，利用感觉运动关系和语义关系等多模态信息来提升预测准确性，并通过模型校准和标准选择进行优化。实验结果表明，使用感觉运动关系和语义关系的模型在预测新词方面的平均准确率（分别为0.733和0.729）优于传统两层前馈神经网络，且某些关系（如视觉）在识别相关词汇方面表现更优。

链接: https://arxiv.org/abs/2503.14341
作者: Andrew Roxburgh,Floriana Grasso,Terry R. Payne
机构: University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the words that a child is going to learn next can be useful for boosting language acquisition, and such predictions have been shown to be possible with both neural network techniques (looking at changes in the vocabulary state over time) and graph model (looking at data pertaining to the relationships between words). However, these models do not fully capture the complexity of the language learning process of an infant when used in isolation. In this paper, we examine how a model of language acquisition for infants and young children can be constructed and adapted for use in a Spatio-Temporal Graph Convolutional Network (STGCN), taking into account the different types of linguistic relationships that occur during child language learning. We introduce a novel approach for predicting child vocabulary acquisition, and evaluate the efficacy of such a model with respect to the different types of linguistic relationships that occur during language acquisition, resulting in insightful observations on model calibration and norm selection. An evaluation of this model found that the mean accuracy of models for predicting new words when using sensorimotor relationships (0.733) and semantic relationships (0.729) were found to be superior to that observed with a 2-layer Feed-forward neural network. Furthermore, the high recall for some relationships suggested that some relationships (e.g. visual) were superior in identifying a larger proportion of relevant words that a child should subsequently learn than others (such as auditory).
zh

[NLP-17] PENCIL: Long Thoughts with Short Memory

【速读】：该论文旨在解决在推理任务中，随着长链式思维（Long Chain-of-Thought, CoT）的应用，测试阶段因无效内存使用导致的挑战，即中间计算结果在上下文（context）中无限累积，即便这些结果对未来推导不再必要。为应对这一问题，论文提出了一种名为PENCIL的方法，其关键在于引入了一种降维机制（reduction mechanism），将其整合到自回归生成过程中。该机制允许模型基于从训练数据中学到的模式，递归地清理不必要的中间思维表示。通过这一创新，PENCIL显著减少了生成过程所需的最大上下文长度，从而能够在有限内存条件下生成更长的思维链条，进而解决更大规模的问题。例如，在Einstein’s Puzzle任务中，PENCIL仅使用一个参数量为25M的小型Transformer模型（上下文长度为2048）就达到了97%的准确率，而这一任务即使是GPT-4等大型模型也难以处理。理论分析进一步表明，PENCIL能够通过模拟图灵机实现通用的空间高效计算，并以最优的时间和空间复杂度解决受上下文窗口限制而原本不可解的任务。

链接: https://arxiv.org/abs/2503.14337
作者: Chenxiao Yang,Nathan Srebro,David McAllester,Zhiyuan Li
机构: Toyota Technological Institute at Chicago (芝加哥丰田技术研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While recent works (e.g. o1, DeepSeek R1) have demonstrated great promise of using long Chain-of-Thought (CoT) to improve reasoning capabilities of language models, scaling it up during test-time is challenging due to inefficient memory usage – intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We propose PENCIL, which incorporates a reduction mechanism into the autoregressive generation process, allowing the model to recursively clean up intermediate thoughts based on patterns learned from training. With this reduction mechanism, PENCIL significantly reduces the maximal context length required during generation, and thus can generate longer thoughts with limited memory, solving larger-scale problems given more thinking time. For example, we demonstrate PENCIL achieves 97% accuracy on the challenging Einstein’s puzzle – a task even large models like GPT-4 struggle with – using only a small 25M-parameter transformer with 2048 context length. Theoretically, we prove PENCIL can perform universal space-efficient computation by simulating Turing machines with optimal time and space complexity, and thus can solve arbitrary computational tasks that would otherwise be intractable given context window constraints.
zh

[NLP-18] DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

【速读】：该论文旨在解决视觉理解和生成在大规模语言模型自回归范式下统一表示空间的挑战。现有方法中，基于重建训练的视觉标记器擅长捕捉低级感知细节，适合生成任务但缺乏高级语义表示以支持理解任务；而通过对比学习训练的视觉编码器虽与语言对齐良好，但在解码回像素空间以完成生成任务时表现欠佳。为弥合这一差距，论文提出DualToken方法，其关键在于通过引入分别针对高低级特征的独立代码本（codebooks），将原本相互冲突的重建与语义目标转化为协同关系，从而在单一标记器内实现理解与生成任务的统一表示。最终，DualToken在重建与语义任务中达到最先进性能，并在下游多模态大型语言模型（MLLM）的理解与生成任务中展现出显著效果，同时优于简单组合两种不同视觉编码器的方式。

链接: https://arxiv.org/abs/2503.14324
作者: Wei Song,Yuran Wang,Zijia Song,Yadong Li,Haoze Sun,Weipeng Chen,Zenan Zhou,Jianhua Xu,Jiaqi Wang,Kaicheng Yu
机构: Baichuan Inc.; Westlake University; Zhejiang University; Shanghai AI Laboratory; Shanghai Innovation Institute; Wuhan University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level perceptual details, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks while demonstrating remarkable effectiveness in downstream MLLM understanding and generation tasks. Notably, we also show that DualToken, as a unified tokenizer, surpasses the naive combination of two distinct types vision encoders, providing superior performance within a unified MLLM.
zh

[NLP-19] DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal

【速读】：该论文旨在解决现有大型语言模型（LLMs）驱动的编码代理在自动化复杂软件开发任务时面临的次优决策问题，这些问题通常需要大量人工干预或采用效率低下的计算扩展策略。为了解决这一挑战，论文提出了一种名为动态动作重采样（Dynamic Action Re-Sampling, DARS）的新颖推理时间计算扩展方法。DARS 的关键创新在于它通过在特定的关键决策点上基于历史轨迹和先前尝试的执行反馈选择替代动作来分支出新的轨迹，从而比传统线性轨迹遵循或随机采样的方法更快速且有效地从次优决策中恢复。这种方法显著提升了编码代理的性能。

链接: https://arxiv.org/abs/2503.14269
作者: Vaibhav Aggarwal,Ojasv Kamal,Abhinav Japesh,Zhijing Jin,Bernhard Schölkopf
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
zh

[NLP-20] JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System

【速读】：本文旨在解决中文法律体系下判决文书生成任务的性能评估问题，定义了从案件事实描述生成完整法律判决文书的任务，并构建了一个包含真实案例事实描述及其对应判决文书的综合数据集作为ground truth。解决方案的关键在于通过整合两个外部法律语料库（包含法规和历史判决文书）增强数据集，并与法律专家合作建立全面的自动化评估框架，同时评估了多种基线方法（如few-shot in-context learning、fine-tuning以及多源检索增强生成RAG方法）在通用及法律领域大语言模型上的表现，实验结果表明RAG方法虽能有效提升性能，但仍需进一步优化。

链接: https://arxiv.org/abs/2503.14258
作者: Weihang Su,Baoqing Yue,Qingyao Ai,Yiran Hu,Jiaqi Li,Changyue Wang,Kaiyuan Zhang,Yueyue Wu,Yiqun Liu
机构: DCST, Tsinghua University (清华大学自动化系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: this https URL.
zh

[NLP-21] Benchmarking Failures in Tool-Augmented Language Models

【速读】：该论文旨在研究工具增强型语言模型（Tool-augmented Language Models, TaLMs）在实际应用中的不完美性，特别是面对未明确指定的用户查询（under-specified user queries）和不可用工具（non-available tools）时的表现。论文引入了一个名为FAIL-TALMS的新基准数据集，包含1,749个示例，涉及906种工具，涵盖21个类别，并评估了当前顶级专有和开源模型。结果显示，除了Claude模型外，其他所有模型在识别缺失工具或信息方面均表现不佳。

为缓解这些失败问题，论文提出了一种名为Ask-and-Help (AAH) 的实时人机交互方法，用于提供缺失信息或替换不可用工具。尽管AAH方法在处理未明确指定的查询时能显著提高模型任务完成的准确性，但在复杂工具失效的情况下，其带来的改进有限。因此，关键在于通过AAH方法实现人机协作以部分弥补TaLMs在现实世界中的不足。

链接: https://arxiv.org/abs/2503.14227
作者: Eduardo Treviño,Hugo Contant,James Ngai,Graham Neubig,Zora Zhiruo Wang
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume ‘perfect’ information access and tool availability, which may not hold in the real world. To systematically study TaLMs’ imperfections, we introduce the FAIL-TALMS benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TALMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help (AAH) method, to provide missing information or replace non-functional tools. While AAH can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.
zh

[NLP-22] owards Harmless Multimodal Assistants with Blind Preference Optimization

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在安全应用中的偏好数据不足及其安全性优化方法的问题。随着MLLMs的广泛应用，确保其行为符合人类期望并避免潜在危害的安全性问题变得尤为重要。论文指出，尽管偏好优化（Preference Optimization, PO）在使MLLMs与人类偏好对齐方面表现出色，但缺乏专门针对安全性的偏好数据限制了其实际应用。为此，论文构建了一个名为MMSafe-PO的安全偏好数据集，并通过分析发现两种关键现象：模态协同防御（modality co-defense）和模态欺骗（modality cheating），揭示了MLLMs在安全性和潜在风险上的特性。基于这些观察，论文提出了一种盲偏好优化（Blind Preference Optimization, BPO）方法，以显著提升MLLMs的安全能力。实验结果表明，BPO能够将基础模型的安全率提高45.0%，并在其他安全基准测试中大幅降低不安全率（如在MM-SafetyBench中减少14.5%，在HarmEval中减少82.9%）。因此，解决方案的关键在于通过构建MMSafe-PO数据集和提出BPO方法，系统性地增强MLLMs的安全性能。

链接: https://arxiv.org/abs/2503.14189
作者: Yongqi Li,Lu Yang,Jian Wang,Runyang You,Wenjie Li,Liqiang Nie
机构: The Hong Kong Polytechnic University (香港理工大学); Wuhan University (武汉大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM’s unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at this https URL.
zh

[NLP-23] AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation ACL2021

【速读】：该论文旨在解决传统端到端语音翻译模型中编码器学习到的声学表示固定且静态的问题，从解码器的角度来看，这种特性不利于应对语音翻译中的跨模态和跨语言挑战。论文的关键解决方案在于提出了一种自适应语音到文本翻译模型，能够动态调整解码器中的声学状态。具体而言，通过将声学状态与目标词嵌入序列拼接后输入到解码器后续模块，并引入语音-文本混合注意力子层替代传统的跨注意力网络，以建模声学状态与目标隐藏状态之间的深层交互。实验结果表明，该方法在两个广泛使用的数据集上显著优于现有的神经语音翻译模型。

链接: https://arxiv.org/abs/2503.14185
作者: Wuwei Huang,Dexin Wang,Deyi Xiong
机构: College of Intelligence and Computing, Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ACL 2021 Findings

点击查看摘要

Abstract:In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
zh

[NLP-24] NERCat: Fine-Tuning for Enhanced Named Entity Recognition in Catalan

【速读】：该论文旨在解决低资源语言（如加泰罗尼亚语）中命名实体识别 (NER) 性能不佳的问题，主要由于高质量标注数据的缺乏。解决方案的关键在于开发了一种针对加泰罗尼亚语优化的 NER 模型 NERCat，它是 GLiNER 模型的微调版本。研究通过使用人工标注的加泰罗尼亚电视转录数据进行训练，并专注于政治、体育和文化等特定领域，从而实现了显著提升，特别是在法律 (Law)、产品 (Product) 和设施 (Facility) 等代表性不足的命名实体类别上的精度 (precision)、召回率 (recall) 和 F1 值的改进。这表明在低资源语言中采用领域特定的微调方法以及高质量标注数据的有效性。

链接: https://arxiv.org/abs/2503.14173
作者: Guillem Cadevall Ferreres,Marc Serrano Sanz,Marc Bardeli Gámez,Pol Gerdt Basullas,Francesc Tarres Ruiz,Raul Quijada Ferrero
机构: Ugiat Technologies
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 table

点击查看摘要

Abstract:Named Entity Recognition (NER) is a critical component of Natural Language Processing (NLP) for extracting structured information from unstructured text. However, for low-resource languages like Catalan, the performance of NER systems often suffers due to the lack of high-quality annotated datasets. This paper introduces NERCat, a fine-tuned version of the GLiNER[1] model, designed to improve NER performance specifically for Catalan text. We used a dataset of manually annotated Catalan television transcriptions to train and fine-tune the model, focusing on domains such as politics, sports, and culture. The evaluation results show significant improvements in precision, recall, and F1-score, particularly for underrepresented named entity categories such as Law, Product, and Facility. This study demonstrates the effectiveness of domain-specific fine-tuning in low-resource languages and highlights the potential for enhancing Catalan NLP applications through manual annotation and high-quality datasets.
zh

[NLP-25] Synthetic Clarification and Correction Dialogues about Data-Centric Tasks – A Teacher-Student Approach

【速读】：该论文旨在解决因用户提供的不完美信息或数据中的不确定性导致的人机对话路径动态且难以预测的问题，提出了一种新的框架以合成生成受控的多轮对话，用于表格型问答任务，并可从现有数据集中扩展到任何目标领域。关键在于通过模拟两种现实场景（AI发起澄清或用户发起修正）来协作解决问题，并利用强大的教师大语言模型（LLM）验证合成对话的正确性，从而确保数据集的质量。

链接: https://arxiv.org/abs/2503.14167
作者: Christian Poelitz,Nick McKenna
机构: Microsoft Research (微软研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Real dialogues with AI assistants for solving data-centric tasks often follow dynamic, unpredictable paths due to imperfect information provided by the user or in the data, which must be caught and handled. Developing datasets which capture such user-AI interactions is difficult and time-consuming. In this work, we develop a novel framework for synthetically generating controlled, multi-turn conversations between a user and AI assistant for the task of table-based question answering, which can be generated from an existing dataset with fully specified table QA examples for any target domain. Each conversation aims to solve a table-based reasoning question through collaborative effort, modeling one of two real-world scenarios: (1) an AI-initiated clarification, or (2) a user-initiated correction. Critically, we employ a strong teacher LLM to verify the correctness of our synthetic conversations, ensuring high quality. We demonstrate synthetic datasets generated from TAT-QA and WikiTableQuestions as benchmarks of frontier LLMs. We find that even larger models struggle to effectively issuing clarification questions and accurately integrate user feedback for corrections.
zh

[NLP-26] Speculative Decoding for Verilog: Speed and Quality All in One

【速读】：该论文致力于解决传统大型语言模型（Large Language Models, LLMs）在生成特定编程语言代码（如Verilog）时面临的挑战，尤其是在训练数据表示不足且语法结构复杂的场景下，常规分词与解码方法难以有效捕捉语言逻辑的问题。论文的关键创新在于引入了一种新颖的推测解码（Speculative Decoding）方法，通过将解码停止点与具有语法意义的标记对齐，优化了分词过程并提升了输出质量。这一改进不仅显著提高了Verilog代码生成的速度（最高可达5.05倍加速），还增强了模型在功能准确性上的表现（RTLLM的pass@10提升高达17.19%），从而有效弥合了生成质量差距。

链接: https://arxiv.org/abs/2503.14153
作者: Changran Xu,Yi Liu,Yunhao Zhou,Shan Huang,Ningyi Xu,Qiang Xu
机构: The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); National Technology Innovation Center for EDA (国家EDA技术创新中心)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注: Accepted by the 62nd Design Automation Conference (DAC 2025)

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific syntax and lower representation in training datasets, pose significant challenges for conventional tokenization and decoding approaches. In this paper, we introduce a novel application of speculative decoding for Verilog code generation, showing that it can improve both inference speed and output quality, effectively achieving speed and quality all in one. Unlike standard LLM tokenization schemes, which often fragment meaningful code structures, our approach aligns decoding stops with syntactically significant tokens, making it easier for models to learn the token distribution. This refinement addresses inherent tokenization issues and enhances the model’s ability to capture Verilog’s logical constructs more effectively. Our experimental results show that our method achieves up to a 5.05x speedup in Verilog code generation and increases pass@10 functional accuracy on RTLLM by up to 17.19% compared to conventional training strategies. These findings highlight speculative decoding as a promising approach to bridge the quality gap in code generation for specialized programming languages.
zh

[NLP-27] CARE: A QLoRA-Fine Tuned Multi-Domain Chatbot With Fast Learning On Minimal Hardware

【速读】：本文试图解决在特定领域内实现高效且低成本的大型语言模型（Large Language Models）应用问题。传统方法需要大量训练时间和昂贵硬件资源进行微调（fine-tuning），而本文提出的CARE（Customer Assistance and Response Engine）通过在极小规模硬件和数据集上微调Phi3.5-mini模型，成功构建了一个轻量级解决方案。其关键是利用Phi3.5-mini的高效性，在无需高性能计算设备的情况下，实现了跨电信支持、医疗支持及银行支持三大领域的客户咨询处理能力，并在医疗领域提供了初步诊断与建议功能，从而显著降低了部署门槛并提升了可用性。

链接: https://arxiv.org/abs/2503.14136
作者: Ankit Dutta,Nabarup Ghosh,Ankush Chatterjee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language models have demonstrated excellent domain-specific question-answering capabilities when finetuned with a particular dataset of that specific domain. However, fine-tuning the models requires a significant amount of training time and a considerable amount of hardware. In this work, we propose CARE (Customer Assistance and Response Engine), a lightweight model made by fine-tuning Phi3.5-mini on very minimal hardware and data, designed to handle queries primarily across three domains: telecommunications support, medical support, and banking support. For telecommunications and banking, the chatbot addresses issues and problems faced by customers regularly in the above-mentioned domains. In the medical domain, CARE provides preliminary support by offering basic diagnoses and medical suggestions that a user might take before consulting a healthcare professional. Since CARE is built on Phi3.5-mini, it can be used even on mobile devices, increasing its usability. Our research also shows that CARE performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions.
zh

[NLP-28] Frac-Connections: Fractional Extension of Hyper-Connections

【速读】：该论文试图解决深度神经网络训练过程中梯度消失与表征坍塌之间的权衡问题，以及现有方法（如Hyper-Connections）带来的内存访问成本增加的问题。解决方案的关键在于提出了一种名为Frac-Connections的新方法，通过将隐藏状态划分为多个部分而非扩展其宽度，保留了Hyper-Connections的部分优势，同时显著减少了内存消耗。实验验证表明，Frac-Connections在大规模语言任务中（包括基于高达3万亿tokens训练的7B规模MoE模型）的表现明显优于传统的残差连接(Residual Connections)。

链接: https://arxiv.org/abs/2503.14125
作者: Defa Zhu,Hongzhi Huang,Jundong Zhou,Zihao Huang,Yutao Zeng,Banggu Wu,Qiyang Min,Xun Zhou
机构: ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Residual connections are central to modern deep learning architectures, enabling the training of very deep networks by mitigating gradient vanishing. Hyper-Connections recently generalized residual connections by introducing multiple connection strengths at different depths, thereby addressing the seesaw effect between gradient vanishing and representation collapse. However, Hyper-Connections increase memory access costs by expanding the width of hidden states. In this paper, we propose Frac-Connections, a novel approach that divides hidden states into multiple parts rather than expanding their width. Frac-Connections retain partial benefits of Hyper-Connections while reducing memory consumption. To validate their effectiveness, we conduct large-scale experiments on language tasks, with the largest being a 7B MoE model trained on up to 3T tokens, demonstrating that Frac-Connections significantly outperform residual connections.
zh

[NLP-29] Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia

【速读】：该论文试图解决自然语言处理领域中定量数据提取的问题，特别是缺乏适合识别文本中数量及其上下文的标注数据集。解决方案的关键在于提出了两个基于维基百科（Wikipedia）和Wikidata的大规模数据集：Wiki-Quantities包含超过120万个标注的数量，而Wiki-Measurements则提供了38,738个带标注的数量及其对应的被测量实体、属性和可选限定符。这两个数据集能够支持管道式测量提取方法，在该方法中首先识别数量，随后确定其测量上下文。此外，通过发布创建数据集所用的代码，研究者确保了工作的可重复性。

链接: https://arxiv.org/abs/2503.14090
作者: Jan Göpfert,Patrick Kuckertz,Jann M. Weinand,Detlef Stolten
机构: Forschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems Analysis ( Forschungszentrum Jülich 有限公司, 气候与能源系统研究所, 尤利希系统分析); RWTH Aachen University, Chair for Fuel Cells, Faculty of Mechanical Engineering (亚琛工业大学, 燃料电池系, 机械工程学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38,738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.
zh

[NLP-30] Growing a Twig to Accelerate Large Vision-Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Vision-Language Models, VLMs）在实际部署中因高计算开销带来的挑战，同时克服现有基于早期层注意力图剪枝冗余视觉标记的方法所面临的两大主要局限性：(1) 由于早期层注意力信号不敏感导致的精度显著下降；(2) 在生成较长响应（如30个标记）时速度提升有限。论文的关键解决方案是提出TwigVLM架构，通过在其基础VLM的早期层之上生长一个轻量级“twig”来实现加速。TwigVLM不仅采用Twig引导的标记剪枝（Twig-Guided Token Pruning, TTP）策略以更好地保留准确性，还利用自推测解码（Self-Speculative Decoding, SSD）策略提高生成速度。实验表明，以LLaVA-1.5-7B为基础VLM，TwigVLM在剪枝88.9%视觉标记后保留了原模型96%的性能，并在生成长响应时实现了154%的速度提升。

链接: https://arxiv.org/abs/2503.14075
作者: Zhenwei Shao,Mingyang Wang,Zhou Yu,Wenwen Pan,Yan Yang,Tao Wei,Hongyuan Zhang,Ning Mao,Wei Chen,Jun Yu
机构: School of Computer Science, Hangzhou Dianzi University (杭州电子科技大学); Li Auto Inc. (理想汽车); School of Intelligence Science and Engineering, Harbin Institute of Technology (哈尔滨工业大学)(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM’s early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM – a simple and general architecture by growing a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Code will be made publicly available.
zh

[NLP-31] Synthetic Data Generation Using Large Language Models : Advances in Text and Code

【速读】：该论文旨在探索利用大型语言模型（Large Language Models, LLMs）生成合成训练数据的方法，以解决自然语言和代码任务中低资源场景下的难题，特别是当标注数据稀缺或敏感时。论文重点研究基于提示（prompt-based）的生成、检索增强管道（retrieval-augmented pipelines）以及迭代自优化（iterative self-refinement）等技术，这些方法通过自动化功能正确性验证，显著提升了分类、问答、指令微调、代码翻译及缺陷修复等任务的表现。论文的关键解决方案在于结合合成数据的成本效益、广泛覆盖范围和可控多样性优势的同时，提出应对生成文本中事实不准确、风格缺乏真实性以及潜在偏差放大等问题的策略，例如输出过滤与加权、带执行反馈的强化学习等。最终，论文呼吁在自动化提示工程、跨模态数据合成和鲁棒评估框架等方向展开进一步研究，同时强调在推进AI发展过程中需兼顾伦理与质量保障的重要性。

链接: https://arxiv.org/abs/2503.14023
作者: Mihai Nadas,Laura Diosan,Andreea Tomescu
机构: Babe\textcommabelows-Bolyai University; KlusAI Labs
类目: Computation and Language (cs.CL)
备注: 21 pages, 3 tables, 64 references, preprint

点击查看摘要

Abstract:Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards.
zh

[NLP-32] he KoLMogorov Test: Compression by Code Generation

【速读】：该论文试图解决的问题是如何通过压缩测试（Compression Test）评估和提升代码生成大语言模型（Code Generation LLMs）的智能水平。传统方法如“Kolmogorov 压缩”因不可计算性无法直接应用于现有模型，而当前模型在推理、规划和搜索能力上的局限性限制了其逼近理论最优压缩的能力。论文的关键解决方案是提出了一种名为 KoLMogorov-Test (KT) 的新压缩测试，该测试要求模型在推理阶段生成最短的程序以生成给定数据序列。KT 测试具备多个优势，包括丰富的难度可变的问题实例、明确的强基线、难以被操纵的评价指标以及低预训练数据污染风险。通过使用音频、文本、DNA 数据及随机合成序列验证当前模型，研究发现旗舰模型（如 GPT4-o 和 Llama-3.1-405B）表现不佳，并进一步证明了在合成数据上的改进难以泛化到真实数据，从而强调了未来需创新方法以实现更优性能。

链接: https://arxiv.org/abs/2503.13992
作者: Ori Yoran,Kunhao Zheng,Fabian Gloeckle,Jonas Gehring,Gabriel Synnaeve,Taco Cohen
机构: Meta AI (FAIR); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compression is at the heart of intelligence. A theoretically optimal way to compress any sequence of data is to find the shortest program that outputs that sequence and then halts. However, such ‘Kolmogorov compression’ is uncomputable, and code generating LLMs struggle to approximate this theoretical ideal, as it requires reasoning, planning and search capabilities beyond those of current models. In this work, we introduce the KoLMogorov-Test (KT), a compression-as-intelligence test for code generating LLMs. In KT a model is presented with a sequence of data at inference time, and asked to generate the shortest program that produces the sequence. We identify several benefits of KT for both evaluation and training: an essentially infinite number of problem instances of varying difficulty is readily available, strong baselines already exist, the evaluation metric (compression) cannot be gamed, and pretraining data contamination is highly unlikely. To evaluate current models, we use audio, text, and DNA data, as well as sequences produced by random synthetic programs. Current flagship models perform poorly - both GPT4-o and Llama-3.1-405B struggle on our natural and synthetic sequences. On our synthetic distribution, we are able to train code generation models with lower compression rates than previous approaches. Moreover, we show that gains on synthetic data generalize poorly to real data, suggesting that new innovations are necessary for additional gains on KT.
zh

[NLP-33] Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks

【速读】：该论文旨在解决大型语言模型在低资源环境下处理推理密集型任务（如标准化教育测试）的能力不足问题，同时关注小规模或紧凑型开放权重语言模型在支持欠代表语言（如乌克兰语）时存在的性能差距。论文的关键解决方案在于通过参数高效微调方法优化紧凑型模型（如LLaMA 3.1、LLaMA 3.2和Gemma 2），利用逐步解题方案（chain-of-thought solutions）显著提升了复杂匹配任务上的测试分数，最高可达17.4%，并在整体性能上提高了1.6%。此外，结合任务主题与逐步解题生成的联合微调方法进一步增强了模型的解释性和鲁棒性，特别是在匹配任务中超越了标准链式思维微调，并实现了比最佳LLaMA 3.2模型高出5.4%的增益。该方法通过引导模型回忆和应用领域相关知识来提升性能。研究还表明，使用少量可训练参数（20至50百万）和量化适配器对LLaMA和Gemma模型进行微调，使其在单个A100 GPU上即可超越一些领先的开源和专有模型（如GPT-4o mini、Mistral Large等）。

链接: https://arxiv.org/abs/2503.13988
作者: Mykyta Syromiatnikov,Victoria Ruvinskaya,Nataliia Komleva
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 tables, 2 figures

点击查看摘要

Abstract:Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at this https URL.
zh

[NLP-34] Navigating Rifts in Human-LLM Grounding: Study and Benchmark

【速读】：该论文旨在解决语言模型（Language Models, LMs）在协作对话中的接地（grounding）挑战问题，特别是在人类与大型语言模型（Large Language Models, LLMs）交互过程中，由于缺乏有效的接地行为而导致的沟通不畅或失败。论文指出，这种接地不足不仅会导致用户体验不佳，还可能在高风险场景下引发严重后果。为系统性研究这一问题，作者分析了三个包含人机对话的数据集（WildChat、MultiWOZ 和 Bing Chat），并提出了一个接地行为的分类法（taxonomy of grounding acts），同时开发了注释和预测接地行为的模型。研究发现，与人类相比，LLMs在发起澄清请求和提供后续请求方面显著不足。基于这些观察，论文引入了一个名为RIFTS的新基准数据集，用于评估LLMs在接地失败情况下的表现，并揭示了当前最先进的模型在此任务上的性能较差。为此，论文提出了一种初步干预措施，以缓解接地失败问题。解决方案的关键在于通过构建接地行为分类法和开发相应的注释及预测工具，识别并改进LLMs在接地方面的不足，从而提升其在人机交互中的表现。

链接: https://arxiv.org/abs/2503.13975
作者: Omar Shaikh,Hussein Mozannar,Gagan Bansal,Adam Fourney,Eric Horvitz
机构: Stanford University (斯坦福大学); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding – the process by which conversation participants establish mutual understanding – can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, early grounding failures predicted later interaction breakdowns. Building on these insights, we introduce RIFTS: a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on RIFTS, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention that mitigates grounding failures.
zh

[NLP-35] ConSCompF: Consistency-focused Similarity Comparison Framework for Generative Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）领域中随着新模型不断涌现而日益困难的模型间差异化比较问题。论文提出的关键解决方案是Consistency-focused Similarity Comparison Framework (ConSCompF)，其核心在于通过对比两个LLM生成文本的相似性，输出一个表示整体响应相似程度的分数。该框架的独特优势在于仅需少量未标注数据（如聊天机器人指令提示），且无需模型开发者披露任何产品信息即可运行。此外，通过与基准技术（如ROUGE-L）生成的输出差异进行相关性分析，并结合主成分分析（Principal Component Analysis, PCA）可视化多模型间的相似性矩阵，进一步验证了该框架的有效性和实用性，同时可能为LLM训练数据的潜在洞察及投资欺诈检测提供帮助。

链接: https://arxiv.org/abs/2503.13923
作者: Alexey Karev,Dong Xu
机构: School of Computer Engineering and Science (计算机工程与科学学院), Shanghai University (上海大学), Shanghai, China (中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been one of the most important discoveries in machine learning in recent years. LLM-based artificial intelligence (AI) assistants, such as ChatGPT, have consistently attracted the attention from researchers, investors, and the general public, driving the rapid growth of this industry. With the frequent introduction of new LLMs to the market, it becomes increasingly difficult to differentiate between them, creating a demand for new LLM comparison methods. In this research, the Consistency-focused Similarity Comparison Framework (ConSCompF) for generative large language models is proposed. It compares texts generated by two LLMs and produces a similarity score, indicating the overall degree of similarity between their responses. The main advantage of this framework is that it can operate on a small number of unlabeled data, such as chatbot instruction prompts, and does not require LLM developers to disclose any information about their product. To evaluate the efficacy of ConSCompF, two experiments aimed at identifying similarities between multiple LLMs are conducted. Additionally, these experiments examine the correlation between the similarity scores generated by ConSCompF and the differences in the outputs produced by other benchmarking techniques, such as ROUGE-L. Finally, a series of few-shot LLM comparison experiments is conducted to evaluate the performance of ConSCompF in a few-shot LLM comparison scenario. The proposed framework can be used for calculating similarity matrices of multiple LLMs, which can be effectively visualized using principal component analysis (PCA). The ConSCompF output may provide useful insights into data that might have been used during LLM training and help detect possible investment fraud attempts. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.13923 [cs.CL] (or arXiv:2503.13923v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.13923 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Journal of Artificial Intelligence Research 82 (2025) 1325-1347 Related DOI: https://doi.org/10.1613/jair.1.17028 Focus to learn more DOI(s) linking to related resources Submission history From: Alexey Karev [view email] [v1] Tue, 18 Mar 2025 05:38:04 UTC (442 KB)
zh

[NLP-36] COMM:Concentrated Margin Maximization for Robust Document-Level Relation Extraction AAAI2025

【速读】：该论文致力于解决文档级关系抽取（DocRE）任务中因数据复杂性导致的标注错误频发以及正样本稀疏性问题，这些问题进一步加剧了关系提取的难度。论文的关键解决方案在于提出了一种名为\textit\textbfCOMM的框架，其核心在于通过实例感知推理方法动态捕捉实体对的相关信息并提取关系特征，同时结合集中边际最大化（Concentrated Margin Maximization）策略，依据关系分布和样本难度动态调整预测logits与决策阈值之间的边界，从而有效提升模型在低质量数据上的鲁棒性和整体性能，实现了超过10%的性能提升。

链接: https://arxiv.org/abs/2503.13885
作者: Zhichao Duan,Tengyu Pan,Zhenyu Li,Xiuxing Li,Jianyong Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2025 poster

点击查看摘要

Abstract:Document-level relation extraction (DocRE) is the process of identifying and extracting relations between entities that span multiple sentences within a document. Due to its realistic settings, DocRE has garnered increasing research attention in recent years. Previous research has mostly focused on developing sophisticated encoding models to better capture the intricate patterns between entity pairs. While these advancements are undoubtedly crucial, an even more foundational challenge lies in the data itself. The complexity inherent in DocRE makes the labeling process prone to errors, compounded by the extreme sparsity of positive relation samples, which is driven by both the limited availability of positive instances and the broad diversity of positive relation types. These factors can lead to biased optimization processes, further complicating the task of accurate relation extraction. Recognizing these challenges, we have developed a robust framework called \textit\textbfCOMM to better solve DocRE. \textit\textbfCOMM operates by initially employing an instance-aware reasoning method to dynamically capture pertinent information of entity pairs within the document and extract relational features. Following this, \textit\textbfCOMM takes into account the distribution of relations and the difficulty of samples to dynamically adjust the margins between prediction logits and the decision threshold, a process we call Concentrated Margin Maximization. In this way, \textit\textbfCOMM not only enhances the extraction of relevant relational features but also boosts DocRE performance by addressing the specific challenges posed by the data. Extensive experiments and analysis demonstrate the versatility and effectiveness of \textit\textbfCOMM, especially its robustness when trained on low-quality data (achieves \textgreater 10% performance gains).
zh

[NLP-37] Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations

【速读】：该论文旨在解决系统性综述在比较效果研究中面临的及时证据综合挑战，特别是预印本加速知识传播但质量参差不齐的问题。论文提出了一种名为AutoConfidence的先进框架，用于预测预印本发表的可能性，以减少人工审阅的工作量并扩展预测指标的范围。解决方案的关键在于三项核心技术进步：(1) 使用自然语言处理技术进行自动化数据提取；(2) 标题和摘要的语义嵌入；(3) 大型语言模型驱动的评估分数。此外，论文采用了两种预测模型：随机森林分类器（用于二元结果）和生存治愈模型（同时预测二元结果和随时间推移的发表风险）。这些方法显著提升了预测性能，体现了自动化数据提取与多特征集成的价值。

链接: https://arxiv.org/abs/2503.13857
作者: Rui Yang,Jiayi Tong,Haoyuan Wang,Hui Huang,Ziyang Hu,Peiyu Li,Nan Liu,Christopher J. Lindsell,Michael J. Pencina,Yong Chen,Chuan Hong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AudoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate systematic incorporation of preprint articles in evidence-based medicine, supporting researchers in more effective evaluation and utilization of preprint resources. Comments: 28 pages, 6 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.13857 [cs.CL] (or arXiv:2503.13857v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.13857 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rui Yang [view email] [v1] Tue, 18 Mar 2025 03:14:23 UTC (7,671 KB)
zh

[NLP-38] Spotting Persuasion: A Low-cost Model for Persuasion Detection in Political Ads on Social Media

【速读】：本文旨在解决政治广告中说服性文本检测的问题，并通过分析社交媒体平台上的政治广告策略以提升社会媒体政治宣传的透明度。论文的关键创新在于提出了一种轻量级的说服性文本检测模型，该模型在SemEval 2023 Task 3的Subtask 3中达到了最先进的性能（state-of-the-art），同时大幅降低了计算资源需求。这一解决方案的核心优势在于其高效性和实用性，能够以有限的资源有效地检测和分析政治广告中的说服策略。

链接: https://arxiv.org/abs/2503.13844
作者: Elyas Meguellati,Stefano Civelli,Pietro Bernardelle,Shazia Sadiq,Gianluca Demartini
机构: University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the realm of political advertising, persuasion operates as a pivotal element within the broader framework of propaganda, exerting profound influences on public opinion and electoral outcomes. In this paper, we (1) introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3, while significantly reducing the computational resource requirements; and (2) leverage the proposed model to gain insights into political campaigning strategies on social media platforms by applying it to a real-world dataset we curated, consisting of Facebook political ads from the 2022 Australian Federal election campaign. Our study shows how subtleties can be found in persuasive political advertisements and presents a pragmatic approach to detect and analyze such strategies with limited resources, enhancing transparency in social media political campaigns.
zh

[NLP-39] Self-Vocabularizing Training for Neural Machine Translation NAACL

【速读】：该论文试图解决传统词汇学习技术在神经机器翻译中的局限性，这些技术通常基于统计和熵的假设，未能充分利用模型训练过程中的动态特性。论文观察到，经过训练的翻译模型倾向于使用与原始词汇表不同的 Byte-Pair Encoding (BPE) 子集，并且当重新训练时，这种诱导出的词汇表能够带来性能提升。论文的关键在于分析了自训练过程中词汇和熵的变化，揭示了模型如何通过迭代生成标注数据并定义新词汇来优化词汇选择。基于此，作者提出了“自词汇化训练”(self-vocabularizing training)，这是一种迭代方法，能够自适应地选择更小且更优的词汇表，从而实现高达 1.49 BLEU 的性能提升。此外，研究还发现更深的模型架构会增加独特标记的使用，并减少 6-8% 的词汇表规模，这是该方法有效性的进一步佐证。

链接: https://arxiv.org/abs/2503.13837
作者: Pin-Jie Lin,Ernie Chang
机构: Virginia Tech (弗吉尼亚理工大学); Meta (Meta)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NAACL SRW 2025

点击查看摘要

Abstract:Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training–where each iteration generates a labeled dataset by pairing source sentences with the model’s predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.
zh

[NLP-40] Mitigating KV Cache Competition to Enhance User Experience in LLM Inference

【速读】：该论文致力于解决在大规模语言模型（Large Language Model, LLM）服务中，KV缓存（KV-cache, KVC）瓶颈导致的尾部时间开销过高的问题，特别是首次令牌延迟（Time-to-First-Token, TTFT）和服务间令牌间隔延迟（Time-Between-Tokens, TBT）的问题，这些问题严重影响用户体验，尤其在对实时性要求较高的应用场景中。论文指出同时满足TTFT和TBT的服务水平目标（Service-Level Objectives, SLOs）极具挑战性。

解决方案的关键在于提出一个名为CacheOPT的新系统，通过以下创新组件缓解KV缓存竞争问题：首先，估计请求的输出长度，并根据请求到达率动态调整估计值；其次，为请求分配估算的KV缓存需求，并复用其他请求已分配的KV缓存以减少等待时间；第三，提前预分配KV缓存并进行全局保留，避免抢占；第四，在选择需要抢占的请求时，优先考虑具有较长TBT SLO、较长剩余任务时间和较短抢占时间的请求；第五，对于抢占操作，选择重计算与交换策略中延迟较低的一种。实验结果表明，CacheOPT相比现有方法显著降低了尾部TBT和尾部TTFT，提升了SLO达成率，并支持更高的请求到达率。

链接: https://arxiv.org/abs/2503.13773
作者: Haiying Shen,Tanmoy Sen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Large Language Model (LLM) serving, the KV-cache (KVC) bottleneck causes high tail Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT), impairing user experience, particularly in time-sensitive applications. However, satisfying both TTFT and TBT service-level objectives (SLOs) is challenging. To address this, we propose a system, named CacheOPT for mitigating KV Cache competition, based on key insights from our measurements, incorporating novel components. First, it estimates a request’s output length, bounding the deviation with a high specified probability, adjusted based on the request arrival rate. Second, it allocates the estimated KVC demand to a request, and reuses other requests’ allocated KVC to avoid preemptions while reducing waiting time. Third, it proactively allocates KVC before instead of at the time a request exhausts its allocation and reserves KVC globally to prevent preemptions. Fourth, it chooses a request that has long TBT SLO, long job remaining time and short preemption time to preempt. Fifth, it selects the shortest-latency strategy between swapping and recomputation for preemptions. Experiments show that CacheOPT achieves up to 3.29 \times and 2.83 \times lower tail TBT and tail TTFT, 47% and 53% higher TTFT and TBT SLO attainments, and supports up to 1.58 \times higher request arrival rate than the state-of-the-art methods.
zh

[NLP-41] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）推理服务系统在处理混合提示（mixed-prompt）场景时面临的挑战，即支持短提示与长提示并存的同时，满足不同应用的异构迭代时间服务级别目标（Service Level Objectives, SLOs）。传统方法通过分块（chunking）提升长提示处理的吞吐量，但未能有效应对异构SLO需求。为此，论文提出AccelGen系统，其核心在于提供针对多样化应用的高吞吐量LLM推理服务，并保证异构SLO的实现。

AccelGen的关键解决方案包括四个核心组件：(1) 基于SLO保证的动态分块（SLO-guaranteed dynamic chunking），动态调整分块大小以最大化GPU计算利用率，同时满足迭代级别的SLO；(2) 基于迭代级SLO的任务优先级排序（Iteration-level SLO-based task prioritization），优先处理严格SLO请求并聚合具有相似SLO的请求；(3) 多资源感知的批量处理（Multi-resource-aware batching），选择队列中的请求以最大化GPU计算资源和键值缓存（Key-Value Cache, KVC）的利用率。实验结果表明，AccelGen相比现有最佳方法实现了1.42-11.21倍的吞吐量提升、1.43-13.71倍的良好吞吐量（goodput）提升、37%-90%更高的SLO达成率以及1.61-12.22倍的响应延迟降低，性能接近最优Oracle系统。

链接: https://arxiv.org/abs/2503.13737
作者: Haiying Shen,Tanmoy Sen
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.
zh

[NLP-42] CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual Multi-Generator and Multi-Domain Settings

【速读】：该论文旨在解决检测大型语言模型（Large Language Models, LLMs）生成代码的问题，以确保编程技能、伦理及评估标准的问责性和完整性。目前相关研究在领域覆盖范围和鲁棒性方面存在不足，并且仅限于少数编程语言。论文的关键解决方案在于提出了一种能够跨多种编程语言、代码生成器和领域的框架，用于区分人类编写的代码与LLM生成的代码。该框架通过大规模数据集、严格的数据质量检查、特征工程以及传统机器学习模型、预训练语言模型（Pre-trained Language Models, PLMs）和LLMs的综合比较分析实现这一目标。此外，框架还针对跨领域场景进行评估，包括生成代码的作者身份识别、混合作者身份检测以及对未见过的模型、领域和编程语言的泛化能力。实验结果表明，该框架在区分人类与LLM生成代码方面表现出色，并为该任务设定了新的基准。

链接: https://arxiv.org/abs/2503.13733
作者: Daniil Orel,Dilshod Azizov,Preslav Nakov
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学), UAE
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.
zh

[NLP-43] xtInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark

【速读】：该论文旨在解决现有基于扩散（diffusion）的文本到图像模型在嵌入文本时面临的挑战，包括拼写准确性、上下文相关性和视觉连贯性等问题。由于缺乏全面的基准数据集，评估这些模型在生成图像中嵌入文本的能力变得复杂。为了解决这一问题，论文提出了TextInVision，这是一个大规模的基准数据集，以文本和提示复杂度为导向，用于评估扩散模型将视觉文本有效整合到图像中的能力。关键在于设计了一组多样化的提示和文本，并结合一个专门的图像数据集来测试变分自编码器（VAE）模型在不同字符表示下的表现，揭示了VAE架构在扩散框架内进行文本生成时可能遇到的挑战。通过分析多个模型的常见错误，论文指出了诸如拼写不准确和上下文不匹配等核心问题，为未来AI生成多模态内容的进步奠定了基础。

链接: https://arxiv.org/abs/2503.13730
作者: Forouzan Fallah,Maitreya Patel,Agneet Chatterjee,Vlad I. Morariu,Chitta Baral,Yezhou Yang
机构: Arizona State University (亚利桑那州立大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating images with embedded text is crucial for the automatic production of visual and multimodal documents, such as educational materials and advertisements. However, existing diffusion-based text-to-image models often struggle to accurately embed text within images, facing challenges in spelling accuracy, contextual relevance, and visual coherence. Evaluating the ability of such models to embed text within a generated image is complicated due to the lack of comprehensive benchmarks. In this work, we introduce TextInVision, a large-scale, text and prompt complexity driven benchmark designed to evaluate the ability of diffusion models to effectively integrate visual text into images. We crafted a diverse set of prompts and texts that consider various attributes and text characteristics. Additionally, we prepared an image dataset to test Variational Autoencoder (VAE) models across different character representations, highlighting that VAE architectures can also pose challenges in text generation within diffusion frameworks. Through extensive analysis of multiple models, we identify common errors and highlight issues such as spelling inaccuracies and contextual mismatches. By pinpointing the failure points across different prompts and texts, our research lays the foundation for future advancements in AI-generated multimodal content.
zh

[NLP-44] Atyaephyra at SemEval-2025 Task 4: Low-Rank NPO SEMEVAL ACL

【速读】：该论文旨在解决从大型语言模型（Large Language Models, LLMs）中卸载（unlearning）敏感内容的问题。论文的关键解决方案在于采用负偏好优化（negative preference optimization）结合低秩适应（low-rank adaptation）的方法，通过这种组合高效计算额外的正则化项，从而实现卸载过程的稳定化。实验结果表明，该方法显著优于共享任务的基线性能。

链接: https://arxiv.org/abs/2503.13690
作者: Jan Bronec(1),Jindřich Helcl(1) ((1) Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics)
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学，数学与物理学院); Institute of Formal and Applied Linguistics (形式与应用语言学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 1 figure, 1 table, submitted to SemEval proceedings for ACL Anthology

点击查看摘要

Abstract:We present a submission to the SemEval 2025 shared task on unlearning sensitive content from LLMs. Our approach employs negative preference optimization using low-rank adaptation. We show that we can utilize this combination to cheaply compute additional regularization terms, which help with unlearning stabilization. The results of our approach significantly exceed the shared task baselines.
zh

[NLP-45] Feature Extraction and Analysis for GPT -Generated Text

链接: https://arxiv.org/abs/2503.13687
作者: A. Selvioğlu,V. Adanova,M. Atagoziev
机构: TED University (TED大学); OSTIM Technical University (奥斯提姆技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-46] Pensez: Less Data Better Reasoning – Rethinking French LLM

【速读】：该论文旨在探索一种与大规模数据训练相对的替代方法，即通过在小规模（2,000个样本）高质量双语（英语-法语）数据集上的目标导向监督微调 (Targeted Supervised Fine-Tuning, SFT) 来提升大型语言模型 (LLMs) 的数学推理能力以及法语语言熟练度。论文的关键在于验证是否可以通过精心的数据筛选 (targeted data curation) 和优化的微调策略实现与大规模数据训练相当甚至更优的表现，而非依赖于数据规模本身。研究结果表明，这种方法显著提升了模型在数学推理任务中的表现，并在法语数学基准测试中取得了高达20%的准确率提升，挑战了传统观点，强调了战略性数据选择和优化微调的重要性。

链接: https://arxiv.org/abs/2503.13661
作者: Huy Hoang Ha
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks. However, achieving strong performance in specialized domains like mathematical reasoning and non-English languages often requires extensive training on massive datasets. This paper investigates a contrasting approach: strategic fine-tuning on a small, high-quality, bilingual (English-French) dataset to enhance both the reasoning capabilities and French language proficiency of a large language model. Rather than relying on scale, we explore the hypothesis that targeted data curation and optimized training can achieve competitive, or even superior, performance. We demonstrate, through targeted supervised fine-tuning (SFT) on only 2,000 carefully selected samples, significant improvements in mathematical reasoning. Specifically, Pensez 7B exhibits an increase in accuracy of the base model up to 20% on the AIME25 and a 12% increase on a French MATH level 5 benchmark. These results challenge the prevailing assumption that massive datasets are aprerequisite for strong reasoning performance in LLMs, highlighting the potential of strategic data curation and optimized fine-tuning for enhancing both specialized skills and multilingual capabilities. Our findings have implications for the efficient development of high-performing, multilingual LLMs, especially in resource-constrained scenarios.
zh

[NLP-47] Does the Appearance of Autonomous Conversational Robots Affect User Spoken Behaviors in Real-World Conference Interactions?

【速读】：该论文旨在探究机器人外观对用户在真实交互过程中口语行为的影响，通过对比具有高度拟人化特征的人形机器人ERICA与较少拟人化的人形机器人TELECO，分析用户对话数据中的语言特征（如口吃现象disfluencies和句法复杂性syntactic complexity）。研究的关键在于发现用户与不同外观机器人互动时的语言模式差异，并通过训练分类模型（如朴素贝叶斯Naïve Bayes，F1分数达71.60%）及特征重要性分析，揭示口吃现象和句法复杂性在跨机器人外观设计中对用户交流行为的重要影响。基于认知负荷理论和沟通适应理论(Communication Accommodation Theory)，论文指出，设计能够促使用户产生更结构化和流畅语言输出的机器人外观，有助于提升机器人与人类之间的沟通一致性。

链接: https://arxiv.org/abs/2503.13625
作者: Zi Haur Pang,Yahui Fu,Divesh Lala,Mikey Elmers,Koji Inoue,Tatsuya Kawahara
机构: Kyoto University (京都大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: This paper has been accepted as Late-Breaking Work at CHI Conference on Human Factors in Computing Systems (CHI EA '25)

点击查看摘要

Abstract:We investigate the impact of robot appearance on users’ spoken behavior during real-world interactions by comparing a human-like android, ERICA, with a less anthropomorphic humanoid, TELECO. Analyzing data from 42 participants at SIGDIAL 2024, we extracted linguistic features such as disfluencies and syntactic complexity from conversation transcripts. The results showed moderate effect sizes, suggesting that participants produced fewer disfluencies and employed more complex syntax when interacting with ERICA. Further analysis involving training classification models like Naïve Bayes, which achieved an F1-score of 71.60%, and conducting feature importance analysis, highlighted the significant role of disfluencies and syntactic complexity in interactions with robots of varying human-like appearances. Discussing these findings within the frameworks of cognitive load and Communication Accommodation Theory, we conclude that designing robots to elicit more structured and fluent user speech can enhance their communicative alignment with humans.
zh

[NLP-48] Analytic Subspace Routing: How Recursive Least Squares Works in Continual Learning of Large Language Model

链接: https://arxiv.org/abs/2503.13575
作者: Kai Tong,Kang Pan,Xiao Zhang,Erli Meng,Run He,Yawen Cui,Nuoyan Guo,Huiping Zhuang
机构: South China University of Technology (华南理工大学); Xiaomi Corporation (小米公司); The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

[NLP-49] ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts

【速读】：该论文旨在解决基于大语言模型（LLM）推理加速的问题，同时保持与全精度模型相同的推理准确性。传统推测解码（Speculative Decoding, SD）方法需要预先训练和对齐“草案”模型与目标模型，这限制了其在即插即用场景中的应用。论文的关键创新在于提出了一种以MXFP4量化权重仅需直接投射目标模型权重即可实现的“草案”模型，从而实现了无需预训练的即插即用式SD方案，相比BF16基线提升了高达2倍的速度。进一步地，论文通过引入多层次推测解码（ML-SpecQD），利用更小的量化草案模型加速MXFP4草案的生成过程，最终实现了比现有最佳SD方法快达2.72倍的性能提升。

链接: https://arxiv.org/abs/2503.13565
作者: Evangelos Georganas,Dhiraj Kalamkar,Alexander Kozlov,Alexander Heinecke
机构: Intel Corporation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) has emerged as a method to accelerate LLM inference without sacrificing any accuracy over the 16-bit model inference. In a typical SD setup, the idea is to use a full-precision, small, fast model as “draft” to generate the next few tokens and use the “target” large model to verify the draft-generated tokens. The efficacy of this method heavily relies on the acceptance ratio of the draft-generated tokens and the relative token throughput of the draft versus the target model. Nevertheless, an efficient SD pipeline requires pre-training and aligning the draft model to the target model, making it impractical for LLM inference in a plug-and-play fashion. In this work, we propose using MXFP4 models as drafts in a plug-and-play fashion since the MXFP4 Weight-Only-Quantization (WOQ) merely direct-casts the BF16 target model weights to MXFP4. In practice, our plug-and-play solution gives speedups up to 2x over the BF16 baseline. Then we pursue an opportunity for further acceleration: the MXFP4 draft token generation itself can be accelerated via speculative decoding by using yet another smaller draft. We call our method ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts since it recursively applies speculation for accelerating the draft-token generation. Combining Multi-Level Speculative Decoding with MXFP4 Quantized Drafts we outperform state-of-the-art speculative decoding, yielding speedups up to 2.72x over the BF16 baseline.
zh

[NLP-50] MES-RAG : Bringing Multi-modal Entity-Storag rag e and Secure Enhancements to RAG NAACL2025

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 模型在精确实体信息检索方面存在的不足。解决方案的关键在于提出了一种名为 MES-RAG 的框架，该框架通过增强针对特定实体查询的处理能力，实现了更准确、安全且一致的响应。MES-RAG 引入了主动安全措施，在数据访问之前应用保护以确保系统完整性，并支持实时多模态输出（文本、图像、音频和视频），同时无缝集成到现有的 RAG 架构中。实验结果表明，MES-RAG 在提升准确性与召回率方面表现优异，在目标任务上的准确率达到了 0.83，较原有模型提升了 0.25。

链接: https://arxiv.org/abs/2503.13563
作者: Pingyu Wu,Daiheng Gao,Jing Tang,Huimin Chen,Wenbo Zhou,Weiming Zhang,Nenghai Yu
机构: USTC (中国科学技术大学); Hefei ZhikeShuzi (合肥智科数智); Eliza Labs; HUST (华中科技大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to 0.83 (+0.25) on targeted task. Our code and data are available at this https URL.
zh

[NLP-51] Pareidolic Illusions of Meaning: ChatGPT Pseudolaw and the Triumph of Form over Substance

【速读】：该论文试图解决由伪法律（pseudolaw）和生成式人工智能（Generative AI/LLMs）这两种社会现象引发的问题。两者都通过形式和表象而非实质内容来影响用户认知，并使用户误将形式当作实质。论文基于法律理论、计算机科学、语言学和认知心理学，指出这两种现象均依赖于制造意义的幻觉，而用户往往将其误认为真实的底层现象。论文进一步探讨了这一观点的四个影响：一是利用概念拟像（pareidolia）导致从模糊输入中错误感知有意义的语言法律模式；二是依赖信心启发式（confidence heuristic），即人类倾向于将自信视为能力的替代指标；三是当关注点集中在输出的形式而非内容时，两者更易成功；四是过度依赖用户的魔法思维（magical thinking）及对方法可行性的期望。

解决方案的关键在于提升用户的法律和技术素养。只有当用户具备足够的法律知识和科技理解能力时，才能揭示这些现象的虚幻本质，从而有效应对由伪法律和生成式人工智能带来的挑战。

链接: https://arxiv.org/abs/2503.13556
作者: Joe McIntyre
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 54 pages, 6 figures

点击查看摘要

Abstract:The early 2020s has seen the rise of two strange and potentially quite impactful social phenomena, namely pseudolaw, where users rely upon pseudolegal arguments that mimic the form and ritual of legal argumentation but fundamentally distort the content of law, and generative AI/LLMs, which generate content that uses probabilistic calculations to create outputs that look like human generated text. This article argues that the juxtaposition of the two phenomena helps to reveal that they both share two fundamental traits as both elevate form and appearance over substance and content, and users of both routinely mistake the form for the substance. In drawing upon legal theory, computer science, linguistics and cognitive psychology, the article argues that both phenomena rely upon creating illusions of meaning that users mistake for the underlying primary phenomenon. I then explore four implications of this conception of both phenomena. Firstly, both rely on human tendencies of conceptual pareidolia resulting in the erroneous perception of meaningful linguistic legal patterns from nebulous inputs. Secondly, both rely upon the confidence heuristic, the human cognitive bias for treating confidence as a proxy for competence. Thirdly, both succeed when the primary concern is with the form of the output and not its content. Fourthly, both rely heavily upon the magical thinking of users and the desire for the promise of the approach to be real. The article argues that the legal context helps to reveal a solution for the problems caused by both phenomena as it is only where users possess sufficient legal and technological literacy that it becomes possible to reveal to them the illusionary nature of the phenomena.
zh

[NLP-52] LLM -Mediated Guidance of MARL Systems

【速读】：该论文试图解决在复杂多智能体环境中，Multi-Agent Reinforcement Learning (MARL) 系统实现高效学习和期望行为的挑战。解决方案的关键在于结合 MARL 与由大语言模型 (Large Language Model, LLM) 提供的干预机制，通过自然语言 (Natural Language, NL) 控制器和基于规则 (Rule-Based, RB) 控制器两种方式引导智能体的学习轨迹。研究发现，使用 LLM 模拟的人类干预（NL 控制器）比基于规则的干预具有更强的效果，并且早期干预对提升训练效率和性能尤为重要。这一方法显著优于未引入干预的基线设置，表明 LLM 引导在加速训练和增强 MARL 性能方面具有潜力。

链接: https://arxiv.org/abs/2503.13553
作者: Philipp D. Siedler,Ian Gemp
机构: Aleph Alpha Research (Aleph Alpha 研究); Google DeepMind (Google 深度思维)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 50 figures

点击查看摘要

Abstract:In complex multi-agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi-Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)-mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and facilitate interventions that shape the learning trajectories of multiple agents. We experimented with two types of interventions, referred to as controllers: a Natural Language (NL) Controller and a Rule-Based (RB) Controller. The NL Controller, which uses an LLM to simulate human-like interventions, showed a stronger impact than the RB Controller. Our findings indicate that agents particularly benefit from early interventions, leading to more efficient training and higher performance. Both intervention types outperform the baseline without interventions, highlighting the potential of LLM-mediated guidance to accelerate training and enhance MARL performance in challenging environments.
zh

[NLP-53] owards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models

【速读】：该论文旨在解决现有 Process Reward Model (PRM) 在评估推理步骤时易受奖励黑客攻击（reward hacking）的问题，导致其在识别最佳中间步骤时不够可靠。论文的关键创新在于提出了一种新的奖励模型方法——Hierarchical Reward Model (HRM)，它通过细粒度和粗粒度两个层次评估单个及连续的推理步骤，从而更有效地评估推理连贯性和自我反思能力，特别是在先前推理步骤错误的情况下表现更优。此外，为了解决基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）自动生成 PRM 训练数据的低效问题，论文引入了一种轻量且有效的数据增强策略——Hierarchical Node Compression (HNC)，通过节点合并（将两个连续的推理步骤合并为一个）来压缩树结构，从而以可忽略的计算开销多样化 MCTS 结果，并通过引入噪声增强标签鲁棒性。实验结果表明，结合 HNC 的 HRM 相较于 PRM 在稳定性与可靠性上具有显著优势，并在跨领域任务（如 MATH500 和 GSM8K）中展现出更好的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2503.13551
作者: Teng Wang,Zhangyi Jiang,Zhenqi He,Wenhan Yang,Yanan Zheng,Zeyu Li,Zifan He,Shenyang Tong,Hailei Gong
机构: the University of Hong Kong (香港大学); Peking University (北京大学); National University of Singapore (新加坡国立大学); Georgia Institute of Technology (佐治亚理工学院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies show that Large Language Models (LLMs) achieve strong reasoning capabilities through supervised fine-tuning or reinforcement learning. However, a key approach, the Process Reward Model (PRM), suffers from reward hacking, making it unreliable in identifying the best intermediate steps. In this paper, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM performs better in assessing reasoning coherence and self-reflection, particularly when the previous reasoning step is incorrect. Furthermore, to address the inefficiency of autonomous generating PRM training data via Monte Carlo Tree Search (MCTS), we introduce a lightweight and effective data augmentation strategy called Hierarchical Node Compression (HNC) based on node merging (combining two consecutive reasoning steps into one step) in the tree structure. This approach diversifies MCTS results for HRM with negligible computational overhead, enhancing label robustness by introducing noise. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K confirm HRM’s superior generalization and robustness across diverse reasoning tasks. The code for all experiments will be released at https: //github.com/tengwang0318/hierarchial_reward_model.
zh

[NLP-54] Agent -Enhanced Large Language Models for Researching Political Institutions

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）在政治科学领域的应用潜力，并提出一种解决方案以提升其任务处理效率。论文的关键在于引入“代理检索增强生成”（agentic retrieval-augmented generation, Agentic RAG）机制，使LLMs具备调用外部知识库及执行特定任务的能力。通过结合预定义功能与专用工具，LLMs被转化为动态代理，能够高效完成数据收集、预处理及分析等任务。此外，论文展示了如何利用模块化工具扩展LLM代理的功能，如文档摘要、转录编码、定性变量分类以及统计建模等。为验证这一方法的有效性，作者开发了CongressRA，一个专门支持研究美国国会的LLM代理，以此说明LLM代理能够显著降低基于特定领域数据进行实证研究复制、测试和扩展的成本。

链接: https://arxiv.org/abs/2503.13524
作者: Joseph R. Loffredo,Suyeol Yun
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 46 pages, 6 figures

点击查看摘要

Abstract:The applications of Large Language Models (LLMs) in political science are rapidly expanding. This paper demonstrates how LLMs, when augmented with predefined functions and specialized tools, can serve as dynamic agents capable of streamlining tasks such as data collection, preprocessing, and analysis. Central to this approach is agentic retrieval-augmented generation (Agentic RAG), which equips LLMs with action-calling capabilities for interaction with external knowledge bases. Beyond information retrieval, LLM agents may incorporate modular tools for tasks like document summarization, transcript coding, qualitative variable classification, and statistical modeling. To demonstrate the potential of this approach, we introduce CongressRA, an LLM agent designed to support scholars studying the U.S. Congress. Through this example, we highlight how LLM agents can reduce the costs of replicating, testing, and extending empirical research using the domain-specific data that drives the study of political institutions.
zh

[NLP-55] Evaluating the Process Modeling Abilities of Large Language Models – Preliminary Foundations and Results

链接: https://arxiv.org/abs/2503.13520
作者: Peter Fettke,Constantin Houy
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 10 pages, 1 figure, submitted to 20th International Conference on Wirtschaftsinformatik 2025

点击查看摘要

[NLP-56] Examples as the Prompt: A Scalable Approach for Efficient LLM Adaptation in E-Commerce

链接: https://arxiv.org/abs/2503.13518
作者: Jingying Zeng,Zhenwei Dai,Hui Liu,Samarth Varshney,Zhiji Liu,Chen Luo,Zhen Li,Qi He,Xianfeng Tang
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-57] CURIE: Evaluating LLM s On Multitask Scientific Long Context Understanding and Reasoning ICLR2025

【速读】：该论文试图解决科学问题求解过程中大型语言模型（Large Language Models, LLMs）在领域专业知识应用与信息综合能力上的不足。为评估LLMs在科学任务中的潜力，论文引入了CURIE基准测试集，包含来自六个学科（材料科学、凝聚态物理、量子计算、地理空间分析、生物多样性及蛋白质研究）的十个挑战性任务，共计580个问题及其解答对。这些任务涵盖了实验性和理论性科学工作流程，要求模型具备领域专业知识、长上下文理解、以及多步推理能力。

解决方案的关键在于设计一个能够有效衡量LLMs在科学问题求解中表现的基准测试集，同时通过跨领域的严格评估揭示现有模型的优势与局限。论文发现，虽然Gemini Flash 2.0和Claude-3在多个领域表现出一致的高理解能力，但GPT-4o和command-R+等流行模型在蛋白质测序任务上表现不佳，整体最佳性能仅达32%。这表明LLMs在科学应用中仍有显著改进空间，而CURIE可为未来模型开发提供指导方向。

链接: https://arxiv.org/abs/2503.13517
作者: Hao Cui,Zahra Shamsi,Gowoon Cheon,Xuejian Ma,Shutong Li,Maria Tikhanovskaya,Peter Norgaard,Nayantara Mudur,Martyna Plomecka,Paul Raccuglia,Yasaman Bahri,Victor V. Albert,Pranesh Srinivasan,Haining Pan,Philippe Faist,Brian Rohr,Michael J. Statt,Dan Morris,Drew Purves,Elise Kleeman,Ruth Alcantara,Matthew Abraham,Muqthar Mohammad,Ean Phing VanLee,Chenfei Jiang,Elizabeth Dorfman,Eun-Ah Kim,Michael P Brenner,Viren Jain,Sameera Ponda,Subhashini Venugopalan
机构: Google; Harvard; University of Zurich; NIST; UMD College Park; Rutgers; FU Berlin; Modelyst; Cornell
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2025 main conference

点击查看摘要

Abstract:Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in this https URL
zh

[NLP-58] RAG -KG-IL: A Multi-Agent Hybrid Framework for Reducing Hallucinations and Enhancing LLM Reasoning through RAG and Incremental Knowledge Graph Learning Integration

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理结构化数据推理、动态知识演化以及幻觉现象（hallucinations）方面的显著挑战，尤其是在任务关键型领域。论文提出的RAG-KG-IL框架通过结合检索增强生成（Retrieval-Augmented Generation, RAG）、知识图谱（Knowledge Graphs, KGs）以及增量学习（Incremental Learning, IL）的方法，解决了这些局限性。其关键在于采用多智能体架构实现持续的知识更新、整合结构化知识，并引入自主智能体以提升可解释性和推理能力。RAG确保生成响应基于可验证信息，而KG提供结构化领域知识以增强理解和一致性；增量学习方法允许知识库的动态更新而不需完全重新训练，从而降低计算开销并提高模型适应性。

链接: https://arxiv.org/abs/2503.13514
作者: Hong Qing Yu(University of Derby),Frank McQuade(Bloc Digital)
机构: University of Derby (德比大学); Bloc Digital (Bloc Digital)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents RAG-KG-IL, a novel multi-agent hybrid framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) by integrating Retrieval-Augmented Generation (RAG) and Knowledge Graphs (KGs) with an Incremental Learning (IL) approach. Despite recent advancements, LLMs still face significant challenges in reasoning with structured data, handling dynamic knowledge evolution, and mitigating hallucinations, particularly in mission-critical domains. Our proposed RAG-KG-IL framework addresses these limitations by employing a multi-agent architecture that enables continuous knowledge updates, integrates structured knowledge, and incorporates autonomous agents for enhanced explainability and reasoning. The framework utilizes RAG to ensure the generated responses are grounded in verifiable information, while KGs provide structured domain knowledge for improved consistency and depth of understanding. The Incremental Learning approach allows for dynamic updates to the knowledge base without full retraining, significantly reducing computational overhead and improving the model’s adaptability. We evaluate the framework using real-world case studies involving health-related queries, comparing it to state-of-the-art models like GPT-4o and a RAG-only baseline. Experimental results demonstrate that our approach significantly reduces hallucination rates and improves answer completeness and reasoning accuracy. The results underscore the potential of combining RAG, KGs, and multi-agent systems to create intelligent, adaptable systems capable of real-time knowledge integration and reasoning in complex domains.
zh

[NLP-59] Prompt Sentiment: The Catalyst for LLM Change

【速读】：该论文试图解决的问题是：尽管大型语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）领域取得了显著进展，但输入文本中潜在的情感特征——提示词（prompt）情感对生成结果的影响尚未得到充分研究。具体而言，论文关注提示词情感变化如何影响LLMs生成输出的内容连贯性（coherence）、事实准确性（factuality）以及偏见（bias）。

解决方案的关键在于系统性地分析不同情感倾向的提示词对五种主流LLMs（Claude、DeepSeek、GPT-4、Gemini和LLaMA）生成输出的影响。为此，研究人员结合基于词典的方法与基于Transformer的方法进行情感分析，并将提示词分类后评估其对六个AI驱动应用领域（如内容生成、对话式AI、法律与金融分析、医疗AI、创意写作和技术文档）中生成结果的质量影响。通过改变提示词的情感方向，研究揭示了负面提示词通常会降低生成内容的事实准确性并加剧偏见，而正面提示词则可能增加冗长性和情感传播性。因此，论文强调了情感感知的提示词工程对于确保AI生成内容公平性和可靠性的重要性。

链接: https://arxiv.org/abs/2503.13510
作者: Vishal Gandhi,Sagar Gandhi
机构: Joyspace AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has revolutionized natural language processing (NLP), yet the influence of prompt sentiment, a latent affective characteristic of input text, remains underexplored. This study systematically examines how sentiment variations in prompts affect LLM-generated outputs in terms of coherence, factuality, and bias. Leveraging both lexicon-based and transformer-based sentiment analysis methods, we categorize prompts and evaluate responses from five leading LLMs: Claude, DeepSeek, GPT-4, Gemini, and LLaMA. Our analysis spans six AI-driven applications, including content generation, conversational AI, legal and financial analysis, healthcare AI, creative writing, and technical documentation. By transforming prompts, we assess their impact on output quality. Our findings reveal that prompt sentiment significantly influences model responses, with negative prompts often reducing factual accuracy and amplifying bias, while positive prompts tend to increase verbosity and sentiment propagation. These results highlight the importance of sentiment-aware prompt engineering for ensuring fair and reliable AI-generated content.
zh

[NLP-60] MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance

链接: https://arxiv.org/abs/2503.13509
作者: Jia Xu,Tianyi Wei,Bojian Hou,Patryk Orzechowski,Shu Yang,Ruochen Jin,Rachael Paulbeck,Joost Wagenaar,George Demiris,Li Shen
机构: Universiy of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

[NLP-61] It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

【速读】：该论文试图解决的问题是大型语言模型（Large Language Models, LLMs）在多选题（Multiple-Choice Questions, MCQs）基准测试中的表现是否真实反映了其医学领域的能力，或者是否部分由其他非医学知识与推理能力的因素驱动。为了解决这一问题，研究者创建了一个包含配对多选题的自由问答基准（FreeMedQA）。通过这个新基准，他们评估了三个最先进的LLMs（GPT-4o、GPT-3.5和Llama-3-70B-instruct），发现这些模型在自由问答任务上的平均性能比多选题下降了39.43%，显著高于人类的表现下降幅度（22.29%）。为了进一步探究多选题格式对性能的影响，研究团队进行了掩码实验，逐步遮蔽问题的题干部分，结果表明即使在完全遮蔽的情况下，LLMs的多选题表现仍略高于随机猜测（平均6.70%，p=0.002），而自由问答表现接近于零。这一系列实验揭示了当前医学多选题基准可能高估了LLMs在医学领域的实际能力，并提出了通过引入LLM评估的自由问答问题来改进人类与机器评估的潜力。解决方案的关键在于设计FreeMedQA这一新的评估框架，并结合掩码实验明确区分了多选题格式本身对LLMs表现的影响。

链接: https://arxiv.org/abs/2503.13508
作者: Shrutika Singh,Anton Alyakin,Daniel Alexander Alber,Jaden Stryker,Ai Phuong S Tong,Karl Sangwon,Nicolas Goff,Mathew de la Paz,Miguel Hernandez-Rovira,Ki Yun Park,Eric Claude Leuthardt,Eric Karl Oermann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.
zh

[NLP-62] NeurIPS 2023 LLM Efficiency Fine-tuning Competition

【速读】：该论文试图解决生成式大语言模型（Generative Large Language Model, LLM）在基准数据集上的过拟合问题以及当前基于基准评估方法的局限性。论文通过分析NeurIPS 2023大语言模型微调竞赛的结果，揭示了顶级模型在公开排行榜上普遍存在基准过拟合的现象，并强调了数据整理（Data Curation）在构建高性能LLM中的关键作用。论文的关键解决方案在于提出更稳健的评估方法，并通过开放所有竞赛提交、Docker文件及评估基础设施，促进研究的可重复性和进一步探索微调、过拟合及其相关问题。

链接: https://arxiv.org/abs/2503.13507
作者: Mark Saroufim,Yotam Perlitz,Leshem Choshen,Luca Antiga,Greg Bowyer,Christian Puhrsch,Driss Guessous,Supriya Rao,Geeta Chauhan,Ashvini Kumar,Jindal Pawan Kumar,Rajpoot Ankur Parikh,Joe Isaacson,Weiwei Yang
机构: Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open evaluation stage with publicly available tasks and a closed evaluation stage with unseen tasks - allowed us to assess the generalizability of fine-tuned LLMs. Our results highlight the limitations of current benchmark-based evaluation schemes for generative models and demonstrate the need for more robust evaluation methods. Notably, the winning submissions utilized standard open-source libraries and focused primarily on data curation. To facilitate further research and promote reproducibility, we release all competition entries, Docker files, and evaluation infrastructure, providing a valuable resource for the community to explore fine-tuning, overfitting, and reproducibility in LLMs.
zh

[NLP-63] Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

【速读】：该论文旨在解决单一大型语言模型（Large Language Models, LLMs）在文本和代码生成中的局限性，包括因固定语言参数导致的输出一致性不足、对多样化语言模式表示能力的限制以及由于闭源模型引发的数据隐私担忧和行业应用限制。论文的关键解决方案在于探索和总结LLM集成方法（LLM Ensemble Approaches），通过将多个模型组合以克服单一模型的局限性。具体而言，论文提出了七种主要集成方法，并重点分析了四种表现优异且具有广泛应用潜力的方法：权重合并（Weight Merging）、知识融合（Knowledge Fusion）、专家混合（Mixture of Experts）以及奖励集成（Reward Ensemble）。这些方法的关键在于通过模型间的协作提升多样性表示、输出质量和应用场景的灵活性，从而增强LLMs在实际任务中的适用性和性能。

链接: https://arxiv.org/abs/2503.13505
作者: Mari Ashiga,Wei Jie,Fan Wu,Vardan Voskanyan,Fateme Dinmohammadi,Paul Brookes,Jingzhi Gong,Zheng Wang
机构: School of Computing and Engineering, University of West London (伦敦西伦敦大学计算与工程学院); Turing Intelligence Technology Limited (图灵智能技术有限公司), London EC2M 2PF, United Kingdom; School of Computer Science, University of Leeds (利兹大学计算机科学学院), Leeds LS2 9JT, United Kingdom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE TAI

点击查看摘要

Abstract:Generative pretrained transformers (GPT) are the common large language models (LLMs) used for generating text from natural language inputs. However, the fixed properties of language parameters in individual LLMs can lead to inconsistencies in the generated outputs. This limitation also restricts the models’ ability to represent diverse language patterns due to inherent biases. Moreover, many powerful LLMs are closed-source. This prevents organizations from integrating their data into these systems, raising concerns about data privacy and limiting industry applications. Inspired by the successful application of LLM ensemble models in text generation, recent literature has also investigated their potential in code generation. This article reviews these emerging LLM ensemble approaches. Our goal is to enhance readers’ understanding of existing techniques and encourage further research and practical implementation, aiming to expand the real-world applications of LLM ensemble models in both text and code generation. We categorize these approaches into seven main methods: weight merging, knowledge fusion, mixture of experts, reward ensemble, output ensemble, routing, and cascading. From this list, we focus on four methods and models that show strong performance and potential for broader applications. We analyze their modeling steps, training methods, and output features to provide a clear understanding of their capabilities. Our findings highlight the benefits of LLM ensemble techniques. These include better representation of diversity, improved output quality, and greater flexibility in applications. This information offers valuable insights for selecting models for various real-world tasks involving text and code generation, and potentially applying methods to multimodal LLMs.
zh

[NLP-64] SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models

链接: https://arxiv.org/abs/2503.13503
作者: Chuan Qin,Xin Chen,Chengrui Wang,Pengmin Wu,Xi Chen,Yihang Cheng,Jingyi Zhao,Meng Xiao,Xiangchao Dong,Qingqing Long,Boya Pan,Han Wu,Chengzan Li,Yuanchun Zhou,Hui Xiong,Hengshu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

[NLP-65] Recent Developments in Deep Learning-based Author Name Disambiguation

链接: https://arxiv.org/abs/2503.13448
作者: Francesca Cappelli,Giovanni Colavizza,Silvio Peroni
机构: University of Bologna (博洛尼亚大学); University of Copenhagen (哥本哈根大学); Digital Humanities Advanced Research Center (DHARC), University of Bologna (博洛尼亚大学); Research Centre for Open Scholarly Metadata, University of Bologna (博洛尼亚大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-66] CorrSynth – A Correlated Sampling Method for Diverse Dataset Generation from LLM s EMNLP2024

链接: https://arxiv.org/abs/2411.08553
作者: Suhas S Kowshik,Abhishek Divekar,Vijit Malik
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a main conference paper at EMNLP 2024; First two authors contributed equally

点击查看摘要

[NLP-67] MoonCast: High-Quality Zero-Shot Podcast Generation

链接: https://arxiv.org/abs/2503.14345
作者: Zeqian Ju,Dongchao Yang,Jianwei Yu,Kai Shen,Yichong Leng,Zhengtao Wang,Xu Tan,Xinyu Zhou,Tao Qin,Xiangyang Li
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

计算机视觉

[CV-0] MusicInfuser: Making Video Diffusion Listen and Dance

【速读】：该论文试图解决如何生成高质量且与指定音乐同步的舞蹈视频的问题。解决方案的关键在于通过引入轻量级的音乐-视频跨注意力机制（music-video cross-attention）和低秩适配器（low-rank adapter），将现有的视频扩散模型（video diffusion models）适应音乐输入，而无需依赖运动捕捉数据，仅通过对舞蹈视频进行微调来实现高质量的音乐驱动视频生成，同时保留底层模型的灵活性和生成能力。

链接: https://arxiv.org/abs/2503.14505
作者: Susung Hong,Ira Kemelmacher-Shlizerman,Brian Curless,Steven M. Seitz
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce MusicInfuser, an approach for generating high-quality dance videos that are synchronized to a specified music track. Rather than attempting to design and train a new multimodal audio-video model, we show how existing video diffusion models can be adapted to align with musical inputs by introducing lightweight music-video cross-attention and a low-rank adapter. Unlike prior work requiring motion capture data, our approach fine-tunes only on dance videos. MusicInfuser achieves high-quality music-driven video generation while preserving the flexibility and generative capabilities of the underlying models. We introduce an evaluation framework using Video-LLMs to assess multiple dimensions of dance generation quality. The project page and code are available at this https URL.
zh

[CV-1] Aligning Multimodal LLM with Human Preference: A Survey

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在真实性（truthfulness）、安全性（safety）、类似人类的推理能力（o1-like reasoning），以及与人类偏好对齐（alignment with human preference）等方面尚未充分解决的关键问题。论文指出，这些问题的解决依赖于多种对齐算法（alignment algorithms）的发展，这些算法针对不同的应用场景和优化目标进行设计。论文的关键解决方案在于提供一个全面且系统的综述，围绕四个核心方面展开：(1) 对齐算法的应用场景，涵盖通用图像理解、多图像、视频、音频及扩展的多模态应用；(2) 构建对齐数据集的核心因素，包括数据来源、模型响应及偏好标注；(3) 用于评估对齐算法的基准；(4) 对对齐算法未来发展方向的讨论。通过这一工作，作者希望帮助研究者梳理当前领域的进展，并激发更优的对齐方法。

链接: https://arxiv.org/abs/2503.14504
作者: Tao Yu,Yi-Fan Zhang†,Chaoyou Fu,Junkang Wu,Jinda Lu,Kun Wang,Xingyu Lu,Yunhang Shen,Guibin Zhang,Dingjie Song,Yibo Yan,Tianlong Xu,Qingsong Wen,Zhang Zhang,Yan Huang,Liang Wang,Tieniu Tan
机构: Institute of automation, Chinese academy of science (自动化研究所，中国科学院); Nanjing University (南京大学); University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Tencent Youtu Lab (腾讯优图实验室); National University of Singapore (新加坡国立大学); Lehigh University (莱斯大学); The Hong Kong University of Science and Technology (香港科技大学); Squirrel Ai Learning (松鼠Ai学习)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Large language models (LLMs) can handle a wide variety of general tasks with simple prompts, without the need for task-specific training. Multimodal Large Language Models (MLLMs), built upon LLMs, have demonstrated impressive potential in tackling complex tasks involving visual, auditory, and textual data. However, critical issues related to truthfulness, safety, o1-like reasoning, and alignment with human preference remain insufficiently addressed. This gap has spurred the emergence of various alignment algorithms, each targeting different application scenarios and optimization goals. Recent studies have shown that alignment algorithms are a powerful approach to resolving the aforementioned challenges. In this paper, we aim to provide a comprehensive and systematic review of alignment algorithms for MLLMs. Specifically, we explore four key aspects: (1) the application scenarios covered by alignment algorithms, including general image understanding, multi-image, video, and audio, and extended multimodal applications; (2) the core factors in constructing alignment datasets, including data sources, model responses, and preference annotations; (3) the benchmarks used to evaluate alignment algorithms; and (4) a discussion of potential future directions for the development of alignment algorithms. This work seeks to help researchers organize current advancements in the field and inspire better alignment methods. The project page of this paper is available at this https URL.
zh

[CV-2] he Power of Context: How Multimodality Improves Image Super-Resolution CVPR2025

【速读】：该论文致力于解决单图像超分辨率（Single-image Super-Resolution, SISR）中由于从低分辨率输入恢复精细细节和保持感知质量的固有困难而导致的挑战。现有方法通常依赖有限的图像先验知识，导致结果次优。论文提出了一种新颖的方法，利用多模态（包括深度、分割、边缘和文本提示）中的丰富上下文信息，在扩散模型框架内学习强大的生成性先验知识。解决方案的关键在于引入了一种灵活的网络架构，能够有效融合多模态信息，并适应任意数量的输入模态，而无需对扩散过程进行重大修改。此外，通过使用其他模态的空间信息来引导基于文本的条件设置，论文还缓解了由文本提示引起的幻觉现象。每种模态的引导强度也可以独立控制，从而实现对输出方向的精确调整，例如通过深度增加散景效果或通过分割调整对象显著性。广泛的实验表明，该模型在视觉质量和保真度方面超越了最先进的生成式SISR方法。

链接: https://arxiv.org/abs/2503.14503
作者: Kangfu Mei,Hossein Talebi,Mojtaba Ardakani,Vishal M. Patel,Peyman Milanfar,Mauricio Delbracio
机构: Google (谷歌); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by CVPR2025

点击查看摘要

Abstract:Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities – including depth, segmentation, edges, and text prompts – to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality’s guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.
zh

[CV-3] Advances in 4D Generation: A Survey

【速读】：该论文旨在系统梳理和全面总结四维（4D）生成领域的理论基础、关键技术及实际应用，以帮助读者深入理解当前研究状态与未来潜力。论文的核心问题是探讨如何有效整合时间维度于生成任务中，构建具备时空一致性的动态数据生成方法。解决方案的关键在于提出一种包含时空建模、神经表示及生成框架的技术体系，并结合多种控制机制与表征策略，实现对复杂动态场景的有效捕捉与生成。此外，论文还针对数据可用性、计算效率及时空一致性等挑战提出了未来研究方向。

链接: https://arxiv.org/abs/2503.14501
作者: Qiaowei Miao,Kehan Li,Jinsheng Quan,Zhiyuan Min,Shaojie Ma,Yichao Xu,Yi Yang,Yawei Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative artificial intelligence has witnessed remarkable advancements across multiple domains in recent years. Building on the successes of 2D and 3D content generation, 4D generation, which incorporates the temporal dimension into generative tasks, has emerged as a burgeoning yet rapidly evolving research area. This paper presents a comprehensive survey of this emerging field, systematically examining its theoretical foundations, key methodologies, and practical applications, with the aim of providing readers with a holistic understanding of the current state and future potential of 4D generation. We begin by introducing the core concepts of 4D data representations, encompassing both structured and unstructured formats, and their implications for generative tasks. Building upon this foundation, we delve into the enabling technologies that drive 4D generation, including advancements in spatiotemporal modeling, neural representations, and generative frameworks. We further review recent studies that employ diverse control mechanisms and representation strategies for generating 4D outputs, categorizing these approaches and summarizing their research trajectories. In addition, we explore the wide-ranging applications of 4D generation techniques, spanning dynamic object modeling, scene generation, digital human synthesis, 4D content editing, and autonomous driving. Finally, we analyze the key challenges inherent to 4D generation, such as data availability, computational efficiency, and spatiotemporal consistency, and propose promising directions for future research. Our code is publicly available at: \hrefthis https URLthis https URL.
zh

[CV-4] Utilization of Neighbor Information for Image Classification with Different Levels of Supervision

【速读】：该论文旨在弥合半监督图像识别与无监督图像识别之间的差距，提出了一种灵活的方法，在广义类别发现（Generalized Category Discovery, GCD）和图像聚类任务中均表现出色。传统方法在两类任务中存在局限性：GCD方法依赖于数据的标注部分，而深度图像聚类方法无法高效利用标签信息。论文的关键创新点在于通过“利用邻域信息进行分类”（Utilizes Neighbor Information for Classification, UNIC）的方法，将无监督聚类与半监督广义类别发现统一起来。该方法首先通过采样和清洗策略准确识别正负样本邻居，并基于两类邻居计算聚类损失以微调主干网络；其次，通过将标记图像作为真实邻居，将此框架扩展至GCD任务。最终，该方法在聚类（ImageNet-100、ImageNet200）和GCD（ImageNet-100、CUB、SCars、Aircraft）任务中均达到了当前最优性能。

链接: https://arxiv.org/abs/2503.14500
作者: Gihan Jayatilaka,Abhinav Shrivastava,Matthew Gwilliam
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 16 figures, 7 tables

点击查看摘要

Abstract:We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task – GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
zh

[CV-5] racking Meets Large Multimodal Models for Driving Scenario Understanding UAI

【速读】：该论文旨在解决现有大型多模态模型（LMMs）在自动驾驶领域中对动态驾驶环境适应性不足的问题。具体而言，许多现有的方法主要依赖于图像数据，未能充分利用三维空间和时间信息，导致其在处理复杂动态场景时的效果受限。论文的关键解决方案在于引入了一种新颖的方法，通过将跟踪信息作为额外输入嵌入到LMMs中，以恢复图像未能有效捕捉的三维空间和时间细节。这种方法利用了一个轨道编码器来增强视觉查询中的空间和时间线索，同时避免了处理冗长视频序列或大量三维输入所带来的计算开销。此外，通过自监督预训练轨道编码器，为LMMs提供了额外的上下文信息，显著提升了其在感知、规划和预测任务中的性能。实验结果表明，该方法在DriveLM-nuScenes基准测试中提高了9.5%的准确性，并在DriveLM-CARLA中提升了3.7%的最终得分。

链接: https://arxiv.org/abs/2503.14498
作者: Ayesha Ishaq,Jean Lahoud,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal,Rao Muhammad Anwer
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI); Linköping University; Australian National University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 13 pages, 8 figures, Github: this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at this https URL
zh

[CV-6] Deeply Supervised Flow-Based Generative Models

【速读】：该论文旨在解决基于流的生成模型在训练过程中仅利用最终层输出来学习速度表示的问题，这种方法未能充分挖掘网络中间层的丰富特征表示，可能阻碍模型收敛。为了解决这一局限性，论文提出了一种名为DeepFlow的新框架，其关键是通过引入深监督机制和在相邻分支间插入轻量级的Velocity Refiner with Acceleration (VeRA)块，实现跨层通信以增强速度表示。这种设计不仅加速了模型收敛，在ImageNet上的训练速度提高了8倍且性能相当，同时在无需分类器自由引导的情况下，将FID降低了2.6，训练时间减半；还在文本到图像生成任务中超越了现有基线模型。

链接: https://arxiv.org/abs/2503.14494
作者: Inkyu Shin,Chenglin Yang,Liang-Chieh Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL

点击查看摘要

Abstract:Flow based generative models have charted an impressive path across multiple visual generation tasks by adhering to a simple principle: learning velocity representations of a linear interpolant. However, we observe that training velocity solely from the final layer output underutilizes the rich inter layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter layer communication. DeepFlow partitions transformer layers into balanced branches with deep supervision and inserts a lightweight Velocity Refiner with Acceleration (VeRA) block between adjacent branches, which aligns the intermediate velocity features within transformer blocks. Powered by the improved deep supervision via the internal velocity alignment, DeepFlow converges 8 times faster on ImageNet with equivalent performance and further reduces FID by 2.6 while halving training time compared to previous flow based models without a classifier free guidance. DeepFlow also outperforms baselines in text to image generation tasks, as evidenced by evaluations on MSCOCO and zero shot GenEval.
zh

[CV-7] State Space Model Meets Transformer: A New Paradigm for 3D Object Detection ICLR2025

【速读】：本文针对基于DETR的方法在3D室内物体检测中的局限性展开研究，其主要问题是现有方法中Transformer解码器内的场景点特征保持固定，导致后层解码器的贡献有限，从而限制了性能提升。为了解决这一问题，论文受到状态空间模型（State Space Models, SSM）高效上下文建模能力的启发，提出了一种新的3D物体检测框架——交互式状态空间模型（DEST）。该方案的关键在于设计了一种新颖的状态相关SSM参数化方法，使系统状态能够有效作为查询，在3D室内检测任务中发挥作用。此外，通过引入针对点云特性和SSM特点定制的四个关键技术：序列化与双向扫描策略实现场景点之间双向特征交互；状态间注意力机制建模状态点之间的关系；门控前馈网络增强通道间相关性。这些设计使得该方法首次实现了将查询建模为系统状态且场景点作为系统输入，同时以线性复杂度更新场景点特征和查询特征。

链接: https://arxiv.org/abs/2503.14493
作者: Chuxin Wang,Wenfei Yang,Xiang Liu,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratoray of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测全国重点实验室，深空探测实验室); Hainan Aerospace Technology Innovation Center (海南航天技术创新中心); Dongguan University of Technology (东莞理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2025

点击查看摘要

Abstract:DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.
zh

[CV-8] Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

【速读】：该论文试图解决如何基于多种模态的空间控制输入（如分割、深度和边缘信息）生成可控的世界模拟问题。解决方案的关键在于提出了一种名为Cosmos-Transfer的条件世界生成模型，其核心创新点是自适应且可定制的空间条件方案，允许在不同空间位置对不同条件输入赋予不同的权重。这种机制实现了高度可控的世界生成，并适用于Sim2Real等跨域迁移任务，同时通过推理扩展策略实现实时世界生成。

链接: https://arxiv.org/abs/2503.14492
作者: NVIDIA:Hassan Abu Alhaija,Jose Alvarez,Maciej Bala,Tiffany Cai,Tianshi Cao,Liz Cha,Joshua Chen,Mike Chen,Francesco Ferroni,Sanja Fidler,Dieter Fox,Yunhao Ge,Jinwei Gu,Ali Hassani,Michael Isaev,Pooya Jannaty,Shiyi Lan,Tobias Lasser,Huan Ling,Ming-Yu Liu,Xian Liu,Yifan Lu,Alice Luo,Qianli Ma,Hanzi Mao,Fabio Ramos,Xuanchi Ren,Tianchang Shen,Shitao Tang,Ting-Chun Wang,Jay Wu,Jiashu Xu,Stella Xu,Kevin Xie,Yuchong Ye,Xiaodong Yang,Xiaohui Zeng,Yu Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at this https URL.
zh

[CV-9] Stable Virtual Camera: Generative View Synthesis with Diffusion Models

【速读】：本文旨在解决现有视图合成方法在处理大视角变化或时间平滑样本生成时的局限性，同时避免对特定任务配置的依赖。论文的关键在于通过简单的模型设计、优化的训练策略以及灵活的采样方法，实现了在多种视图合成任务中的通用性。这种方法无需额外基于3D表示的蒸馏，即可保持高一致性，从而简化野外场景下的视图合成。此外，所提出的方法能够生成高质量、长达半分钟且具有无缝循环闭合的视频。

链接: https://arxiv.org/abs/2503.14489
作者: Jensen(Jinghao)Zhou,Hang Gao,Vikram Voleti,Aaryaman Vasishta,Chun-Han Yao,Mark Boss,Philip Torr,Christian Rupprecht,Varun Jampani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Stable Virtual Camera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras. Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations. Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time. As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild. Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure. Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings.
zh

[CV-10] DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

【速读】：该论文旨在解决扩散模型（Diffusion Models）在不同条件和噪声水平下对输入进行均匀处理导致性能受限的问题。为了解决这一局限性，论文提出了一种新颖的方法——DiffMoE，其关键是引入批次级全局标记池（batch-level global token pool），使专家能够在训练过程中访问全局标记分布，从而促进专家的专业化行为。此外，DiffMoE 还结合了一个容量预测器（capacity predictor），能够根据噪声水平和样本复杂度动态分配计算资源。通过这些创新，DiffMoE 在 ImageNet 基准测试中实现了最先进的性能，同时保持了与传统方法相比更少的激活参数，证明了其在多种扩散模型应用中的广泛适用性。

链接: https://arxiv.org/abs/2503.14487
作者: Minglei Shi,Ziyang Yuan,Haotian Yang,Xintao Wang,Mingwu Zheng,Xin Tao,Wenliang Zhao,Wenzhao Zheng,Jie Zhou,Jiwen Lu,Pengfei Wan,Di Zhang,Kun Gai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: this https URL
zh

[CV-11] Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and a Hybrid Dataset CVPR2025

【速读】：该论文致力于解决视频人像重照明（video portrait relighting）的问题，旨在生成既逼真又时间上一致的光照效果。传统方法通常需要复杂的模型设计以及在高质量配对视频数据集上的密集训练，例如动态单光源逐次切换（dynamic one-light-at-a-time, OLAT）数据集。论文的关键创新在于提出了一种名为Lux Post Facto的新方法，通过设计一种基于最新预训练视频扩散模型的条件视频扩散模型，并引入新的光照注入机制，实现精确的光照控制。这种方法利用空间和时间生成的强大能力，为病态的重照明问题提供了合理的解决方案。此外，研究采用包含静态表情OLAT数据与野外人像表演视频的混合数据集进行联合学习，从而避免了获取不同光照条件下配对视频数据的需求。实验结果表明，该模型在逼真度和时间一致性方面均达到了当前最佳水平。

链接: https://arxiv.org/abs/2503.14485
作者: Yiqun Mei,Mingming He,Li Ma,Julien Philip,Wenqi Xian,David M George,Xueming Yu,Gabriel Dedic,Ahmet Levent Taşel,Ning Yu,Vishal M. Patel,Paul Debevec
机构: Netflix Eyeline Studios (网飞视点工作室); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Video portrait relighting remains challenging because the results need to be both photorealistic and temporally stable. This typically requires a strong model design that can capture complex facial reflections as well as intensive training on a high-quality paired video dataset, such as dynamic one-light-at-a-time (OLAT). In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. From the model side, we design a new conditional video diffusion model built upon state-of-the-art pre-trained video diffusion model, alongside a new lighting injection mechanism to enable precise control. This way we leverage strong spatial and temporal generative capability to generate plausible solutions to the ill-posed relighting problem. Our technique uses a hybrid dataset consisting of static expression OLAT data and in-the-wild portrait performance videos to jointly learn relighting and temporal modeling. This avoids the need to acquire paired video data in different lighting conditions. Our extensive experiments show that our model produces state-of-the-art results both in terms of photorealism and temporal consistency.
zh

[CV-12] Multi-view Reconstruction via SfM-guided Monocular Depth Estimation CVPR2025

【速读】：该论文旨在解决单目深度估计在多视图几何重建任务中的精度不足问题。传统方法依赖于单目深度估计，但由于该任务存在歧义性，导致估计的深度值通常不够精确，限制了其在多视图重建中的实用性。论文的关键创新在于将 Structure from Motion (SfM) 提供的多视图先验信息引入深度估计过程，通过增强深度预测的质量，实现了单目深度估计结果在多视图几何重建中的直接应用。实验结果表明，该方法不仅显著提升了深度估计的准确性，还在多种场景（室内、街道视角和航拍视角）的重建质量上超越了现有的多视图立体视觉 (MVS) 方法。

链接: https://arxiv.org/abs/2503.14483
作者: Haoyu Guo,He Zhu,Sida Peng,Haotong Lin,Yunzhi Yan,Tao Xie,Wenguan Wang,Xiaowei Zhou,Hujun Bao
机构: Zhejiang University (浙江大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:In this paper, we present a new method for multi-view geometric reconstruction. In recent years, large vision models have rapidly developed, performing excellently across various tasks and demonstrating remarkable generalization capabilities. Some works use large vision models for monocular depth estimation, which have been applied to facilitate multi-view reconstruction tasks in an indirect manner. Due to the ambiguity of the monocular depth estimation task, the estimated depth values are usually not accurate enough, limiting their utility in aiding multi-view reconstruction. We propose to incorporate SfM information, a strong multi-view prior, into the depth estimation process, thus enhancing the quality of depth prediction and enabling their direct application in multi-view geometric reconstruction. Experimental results on public real-world datasets show that our method significantly improves the quality of depth estimation compared to previous monocular depth estimation works. Additionally, we evaluate the reconstruction quality of our approach in various types of scenes including indoor, streetscape, and aerial views, surpassing state-of-the-art MVS methods. The code and supplementary materials are available at this https URL .
zh

[CV-13] ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

【速读】：该论文旨在解决图像生成模型性能评估这一长期存在的难题。现有方法在全面性和公正性方面存在不足，难以满足多样化的实际需求。为应对这一挑战，论文提出了一套名为ICE-Bench的统一且综合性的基准测试方案。其核心解决方案在于：(1) 构建从粗粒度到细粒度的任务体系，将图像生成分解为无参考/有参考的创建与编辑四大任务类别，并进一步细化为31个子任务，形成全面覆盖的评估框架；(2) 设计多维度评价指标，在美学质量、成像质量、提示遵循、源一致性、参考一致性及可控性六个维度上引入11项度量标准，其中创新性地提出了利用大语言模型评估图像编辑效果的VLLM-QA指标；(3) 构建包含真实场景数据与虚拟生成数据的混合数据集，以提升数据多样性并缓解模型评估中的偏差问题。通过上述设计，ICE-Bench不仅揭示了现有模型的能力局限，还为研究者提供了宝贵的公共资源。

链接: https://arxiv.org/abs/2503.14482
作者: Yulin Pan,Xiangteng He,Chaojie Mao,Zhen Han,Zeyinzi Jiang,Jingfeng Zhang,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages

点击查看摘要

Abstract:Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.
zh

[CV-14] Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在创意能力评估领域的研究空白问题。解决方案的关键在于引入了一个名为Creation-MMBench的新基准数据集，该数据集包含765个测试案例，覆盖51个细粒度任务，并针对每个测试案例定义了特定实例的评估标准，以全面评估MLLMs在基于图像的真实世界任务中的创意能力和事实一致性。实验结果表明，当前开源的MLLMs在创意任务中显著逊色于专有模型，同时分析显示视觉微调可能损害基础LLM的创意能力。这一工作为提升MLLMs的创意能力提供了重要参考，并为未来多模态生成智能的发展奠定了基础。

链接: https://arxiv.org/abs/2503.14478
作者: Xinyu Fang,Zhijian Chen,Kai Lan,Shengyuan Ding,Yingji Liang,Xiangyu Zhao,Farong Wen,Zicheng Zhang,Guofeng Zhang,Haodong Duan,Kai Chen,Dahua Lin
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Tongji University (同济大学); Nanjing University (南京大学); East China Normal University (华东师范大学); Shanghai Jiaotong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Evaluation Code and dataset see this https URL

点击查看摘要

Abstract:Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM’s creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on this https URL.
zh

[CV-15] Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation

【速读】：该论文旨在解决基于3D Gaussian Splatting (3DGS) 的新型视角合成技术在实际应用中因高GPU内存和磁盘存储需求而受限的问题。论文提出了一种名为Opti3DGS的新颖频率调制粗到细优化框架，其关键是通过减少用于表示场景的Gaussian基元数量来降低内存和存储需求。Opti3DGS利用图像频率调制，首先强制生成粗略的场景表示，并通过逐步调节训练图像中的频率细节进行细化。实验结果显示，在保持视觉质量的同时，Opti3DGS在基线3DGS上平均减少了62%的Gaussian基元，降低了40%的训练GPU内存需求，并缩短了20%的优化时间。此外，该方法还能无缝集成到许多基于3DGS的技术中，同时自然生成多细节层次的场景表示。

链接: https://arxiv.org/abs/2503.14475
作者: Umar Farooq,Jean-Yves Guillemaut,Adrian Hilton,Marco Volino
机构: CVSSP, University of Surrey (CVSSP, 萨里大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of Novel View Synthesis has been revolutionized by 3D Gaussian Splatting (3DGS), which enables high-quality scene reconstruction that can be rendered in real-time. 3DGS-based techniques typically suffer from high GPU memory and disk storage requirements which limits their practical application on consumer-grade devices. We propose Opti3DGS, a novel frequency-modulated coarse-to-fine optimization framework that aims to minimize the number of Gaussian primitives used to represent a scene, thus reducing memory and storage demands. Opti3DGS leverages image frequency modulation, initially enforcing a coarse scene representation and progressively refining it by modulating frequency details in the training images. On the baseline 3DGS, we demonstrate an average reduction of 62% in Gaussians, a 40% reduction in the training GPU memory requirements and a 20% reduction in optimization time without sacrificing the visual quality. Furthermore, we show that our method integrates seamlessly with many 3DGS-based techniques, consistently reducing the number of Gaussian primitives while maintaining, and often improving, visual quality. Additionally, Opti3DGS inherently produces a level-of-detail scene representation at no extra cost, a natural byproduct of the optimization pipeline. Results and code will be made publicly available.
zh

[CV-16] SIR-DIFF: Sparse Image Sets Restoration with Multi-View Diffusion Model

【速读】：该论文旨在解决从单一退化图像恢复真实场景信息这一重要但高度不适定的任务。论文提出了一种新的视角，通过联合去噪同一场景的多张照片来实现图像复原。解决方案的关键在于利用多个退化图像所共享场景中的互补信息，通过强大的多视图扩散模型提取多视图关系中的丰富信息，从而联合生成无偏的清晰视图，显著提升了图像去模糊和超分辨率任务的效果。此外，该模型被训练以输出三维一致的图像，使其在需要鲁棒多视图集成的应用中（如3D重建或姿态估计）具有广阔前景。

链接: https://arxiv.org/abs/2503.14463
作者: Yucheng Mao,Boyang Wang,Nilesh Kulkarni,Jeong Joon Park
机构: University of Michigan, Ann Arbor (密歇根大学，安阿伯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
zh

[CV-17] Bolt3D: Generating 3D Scenes in Seconds

【速读】：该论文旨在解决快速生成高质量3D场景表示的问题。传统多视角生成模型通常需要针对每个场景进行优化以完成3D重建，导致计算成本高昂。为解决此问题，论文提出了一种名为Bolt3D的潜在扩散模型，其关键在于利用强大的可扩展现有2D扩散网络架构，并结合先进的密集3D重建技术构建大规模多视角一致的数据集，从而实现在单个GPU上于七秒内直接采样高保真3D场景表示，将推理成本降低多达300倍。

链接: https://arxiv.org/abs/2503.14445
作者: Stanislaw Szymanowicz,Jason Y. Zhang,Pratul Srinivasan,Ruiqi Gao,Arthur Brussee,Aleksander Holynski,Ricardo Martin-Brualla,Jonathan T. Barron,Philipp Henzler
机构: Google Research (谷歌研究); VGG – University of Oxford (VGG – 牛津大学); Google DeepMind (谷歌深思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.
zh

[CV-18] Joint Image-Instance Spatial-Temporal Attention for Few-shot Action Recognition

【速读】：该论文旨在解决Few-shot Action Recognition (FSAR) 中因主要依赖图像级特征而导致的背景噪声干扰以及对前景（与动作相关的实例）关注不足的问题，特别是在少量样本场景下显著影响识别性能。为应对这一挑战，论文提出了一种新颖的联合图像-实例级空间-时间注意力方法（Image-Instance level Spatial-Temporal attention, I2ST）。I2ST 的关键在于通过空间-时间注意力机制感知与动作相关的实例，并将其与图像特征进行整合。其解决方案的核心包括两个关键组件：动作相关实例感知（Action-related Instance Perception）和联合图像-实例空间-时间注意力（Joint Image-Instance Spatial-Temporal Attention）。前者利用文本引导的分割模型指导以感知动作相关的实例，后者则用于构建实例与图像之间的特征依赖关系。

链接: https://arxiv.org/abs/2503.14430
作者: Zefeng Qian,Chongyang Zhang,Yifei Huang,Gang Wang,Jiangyong Ying
机构: Shanghai Jiao Tong University (上海交通大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Computer Vision and Image Understanding

点击查看摘要

Abstract:Few-shot Action Recognition (FSAR) constitutes a crucial challenge in computer vision, entailing the recognition of actions from a limited set of examples. Recent approaches mainly focus on employing image-level features to construct temporal dependencies and generate prototypes for each action category. However, a considerable number of these methods utilize mainly image-level features that incorporate background noise and focus insufficiently on real foreground (action-related instances), thereby compromising the recognition capability, particularly in the few-shot scenario. To tackle this issue, we propose a novel joint Image-Instance level Spatial-temporal attention approach (I2ST) for Few-shot Action Recognition. The core concept of I2ST is to perceive the action-related instances and integrate them with image features via spatial-temporal attention. Specifically, I2ST consists of two key components: Action-related Instance Perception and Joint Image-Instance Spatial-temporal Attention. Given the basic representations from the feature extractor, the Action-related Instance Perception is introduced to perceive action-related instances under the guidance of a text-guided segmentation model. Subsequently, the Joint Image-Instance Spatial-temporal Attention is used to construct the feature dependency between instances and images…
zh

[CV-19] MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

【速读】：该论文旨在解决现有文本到视频（Text-to-Video, T2V）生成方法在准确绑定属性、确定空间关系以及捕捉多主体复杂动作交互方面的局限性。为了解决这些问题，论文提出了一种名为MagicComp的无训练方法，通过双阶段优化增强组合式T2V生成能力。其关键在于：(1) 在条件阶段引入语义锚点歧义消解（Semantic Anchor Disambiguation），通过逐步注入语义锚点的方向向量强化主体特定语义并解决主体间歧义；(2) 在去噪阶段提出动态布局融合注意力机制（Dynamic Layout Fusion Attention），结合定位先验和模型自适应的空间感知，通过掩码注意力调制灵活绑定主体至时空区域。MagicComp是一种模型不可知且通用的方法，可无缝集成到现有的T2V架构中，并在多个基准数据集上展示了超越现有最先进方法的性能。

链接: https://arxiv.org/abs/2503.14428
作者: Hongyu Zhang,Yufan Deng,Shenghai Yuan,Peng Jin,Zesen Cheng,Yian Zhao,Chang Liu,Jie Chen
机构: School of Electronic and Computer Engineering, Peking University (北京大学), Shenzhen, China; Peng Cheng Laboratory (鹏城实验室), Shenzhen, China; Tsinghua University (清华大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: this https URL.
zh

[CV-20] DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers CVPR-2025

【速读】：本文旨在解决异构教师蒸馏（Heterogeneous Teacher Distillation）或协同蒸馏（Co-Distillation）的问题，这是一种具有挑战性的多教师蒸馏场景，其中教师模型在设计目标（a）和训练数据（b）上存在显著差异。论文的关键在于探索数据共享策略和针对特定教师的编码方法，并提出了DUNE模型，这是一种在2D视觉、3D理解和3D人体感知方面表现优异的单一编码器。实验结果显示，DUNE在各自任务上达到了与更大教师模型相当甚至超越的表现，特别是在无地图视觉重定位任务中显著优于MASt3R，同时拥有更小的编码器规模。

链接: https://arxiv.org/abs/2503.14405
作者: Mert Bulent Sariyildiz,Philippe Weinzaepfel,Thomas Lucas,Pau de Jorge,Diane Larlus,Yannis Kalantidis
机构: NAVER LABS Europe (NAVER LABS Europe)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR-2025. Project page: this https URL

点击查看摘要

Abstract:Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.
zh

[CV-21] Diffusion-based Facial Aesthetics Enhancement with 3D Structure Guidance

【速读】：本文旨在解决在提升面部吸引力（Facial Aesthetics Enhancement, FAE）的同时，尽量减少对身份特征（identity consistency）影响的问题。现有的方法大多依赖于深度特征或评分指导的生成模型，但这些方法可能产生过度美化且身份一致性较低的结果，或者面部吸引力提升不足。为了解决这一问题，论文提出了一种基于扩散模型的解决方案——最近邻结构引导的扩散模型（Nearest Neighbor Structure Guidance based on Diffusion, NNSG-Diffusion）。其关键是利用三维人脸结构引导二维人脸图像的美化过程：通过从最近邻参考人脸提取FAE指导，并结合输入人脸与参考人脸恢复三维人脸模型，从而提取深度和轮廓引导信息。这些引导信息被用于控制Stable Diffusion中的ControlNet，以实现更自然且身份保持更好的面部美化效果。实验表明，该方法在提升面部吸引力的同时显著优于现有方法。

链接: https://arxiv.org/abs/2503.14402
作者: Lisha Li,Jingwen Hou,Weide Liu,Yuming Fang,Jiebin Yan
机构: School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, China (江西财经大学计算与人工智能学院，中国); Boston Children’s Hospital and Harvard Medical School, Boston, MA (波士顿儿童医院和哈佛医学院，美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Aesthetics Enhancement (FAE) aims to improve facial attractiveness by adjusting the structure and appearance of a facial image while preserving its identity as much as possible. Most existing methods adopted deep feature-based or score-based guidance for generation models to conduct FAE. Although these methods achieved promising results, they potentially produced excessively beautified results with lower identity consistency or insufficiently improved facial attractiveness. To enhance facial aesthetics with less loss of identity, we propose the Nearest Neighbor Structure Guidance based on Diffusion (NNSG-Diffusion), a diffusion-based FAE method that beautifies a 2D facial image with 3D structure guidance. Specifically, we propose to extract FAE guidance from a nearest neighbor reference face. To allow for less change of facial structures in the FAE process, a 3D face model is recovered by referring to both the matched 2D reference face and the 2D input face, so that the depth and contour guidance can be extracted from the 3D face model. Then the depth and contour clues can provide effective guidance to Stable Diffusion with ControlNet for FAE. Extensive experiments demonstrate that our method is superior to previous relevant methods in enhancing facial aesthetics while preserving facial identity.
zh

[CV-22] Impossible Videos

【速读】：该论文旨在解决两个核心问题：1) 当前视频生成模型是否能够有效遵循提示来创建不可能的视频内容？2) 当前的视频理解模型是否足够强大以理解不可能的视频？为了解决这些问题，论文提出了IPV-Bench，这是一个新颖的基准测试集，用于评估和促进视频理解和生成的进步。IPV-Bench基于一个全面的分类法构建，涵盖4个领域和14个类别，并包含违反物理、生物、地理或社会规律的多样化场景。其解决方案的关键在于通过该分类法设计了一套提示集合，以挑战视频生成模型的提示遵循能力和创造力；同时，还精心策划了一个视频基准测试集，用于评估Video-LLMs在理解不可能视频方面的能力，这特别需要对时间动态和世界知识进行推理。通过全面评估，该研究揭示了现有视频模型的局限性，并提供了未来发展的洞见，为下一代视频模型奠定了基础。

链接: https://arxiv.org/abs/2503.14378
作者: Zechen Bai,Hai Ci,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages

点击查看摘要

Abstract:Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos. Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today’s video generation models effectively follow prompts to create impossible video content? 2) Are today’s video understanding models good enough for understanding impossible videos? To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories. It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.
zh

[CV-23] Advancing Medical Representation Learning Through High-Quality Data

【速读】：该论文旨在探索医学视觉-语言数据集质量对模型性能的影响，尽管这类数据集的规模正在增长，但这一影响仍未被充分研究。论文引入了一个名为Open-PMC的高质量医学数据集，包含220万幅图像-文本配对，并带有丰富的图像模态标注、子图以及整合的文本内引用。与通常仅包含摘要信息的标题不同，这些文本内引用提供了更丰富的医学语境。论文通过广泛的实验，在检索和零样本分类任务中将Open-PMC与其他更大规模的数据集进行基准对比，结果表明数据集质量而非规模更能显著提升模型性能。关键解决方案在于构建了一个高质且富含上下文信息的Open-PMC数据集，并通过深入分析特征表示进一步验证了高质量数据标注的重要性，从而推动多模态医学AI的发展。

链接: https://arxiv.org/abs/2503.14377
作者: Negin Baghbanzadeh,Adibvafa Fallahpour,Yasaman Parhizkar,Franklin Ogidi,Shuvendu Roy,Sajad Ashkezari,Vahid Reza Khazaie,Michael Colacci,Ali Etemad,Arash Afkanpour,Elham Dolatabadi
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.
zh

[CV-24] ImViD: Immersive Volumetric Videos for Enhanced VR Engagement CVPR2025

【速读】：该论文旨在解决沉浸式体三维（volumetric）视频重建的问题。当前虚拟现实（VR）和增强现实（AR）技术的下一个前沿领域在于实现包含完整场景捕捉、大范围六自由度（6-DoF）交互空间、多模态反馈以及高分辨率帧率内容的沉浸式体三维视频。论文的关键创新在于引入了ImViD数据集，这是一个支持空间导向数据捕捉、涵盖多种室内外场景的多视角多模态数据集。与现有数据集相比，ImViD的独特之处在于其移动中的多视角音视频捕捉能力，显著提升了数据捕捉的完整性、灵活性和效率。此外，论文还通过该数据集评估了现有的重建方法，并构建了一个从多视角音视频输入生成沉浸式体三维视频的基础管道，用于6-DoF多模态沉浸式VR体验。这一基准测试及重建与交互结果验证了ImViD数据集及其基础方法的有效性，有望推动沉浸式体三维视频制作领域的进一步研究。

链接: https://arxiv.org/abs/2503.14359
作者: Zhengxian Yang,Shi Pan,Shengqi Wang,Haoxiang Wang,Li Lin,Guanjun Li,Zhengqi Wen,Borong Lin,Jianhua Tao,Tao Yu
机构: Tsinghua University (清华大学); Migu Beijing Research Institute (咪咕北京研究院); Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production. Comments: Accepted by CVPR 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.14359 [cs.CV] (or arXiv:2503.14359v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.14359 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] RFMI: Estimating Mutual Information on Rectified Flow for Text-to-Image Alignment ICLR2025

【速读】：该论文旨在解决当前基于 Rectified Flow (RF) 模型在 Text-to-Image (T2I) 条件生成任务中合成图像与提示（prompt）对齐效果不佳的问题，例如属性绑定错误、主体位置不当及数量不准确等。现有改进 T2I 对齐的方法主要针对扩散模型（Diffusion Models），且依赖于辅助数据集、评分模型以及提示的语言学分析，这限制了其实用性。为填补这些空白，论文提出的关键解决方案包括：1）引入一种新颖的互信息（Mutual Information, MI）估计算法 RFMI，利用预训练模型本身进行 MI 估计；2）基于 RFMI 的自监督微调方法，无需任何辅助信息，仅依赖预训练模型即可实现对齐优化。具体而言，通过从预训练 RF 模型生成的合成图像中选取与提示具有高点互信息的样本构建微调数据集，从而在保持图像质量的同时显著提升 T2I 对齐性能。

链接: https://arxiv.org/abs/2503.14358
作者: Chao Wang,Giulio Franzese,Alessandro Finamore,Pietro Michiardi
机构: EURECOM (EURECOM); Huawei Technologies SASU, France (华为技术有限公司, 法国)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: to appear at ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy

点击查看摘要

Abstract:Rectified Flow (RF) models trained with a Flow matching framework have achieved state-of-the-art performance on Text-to-Image (T2I) conditional generation. Yet, multiple benchmarks show that synthetic images can still suffer from poor alignment with the prompt, i.e., images show wrong attribute binding, subject positioning, numeracy, etc. While the literature offers many methods to improve T2I alignment, they all consider only Diffusion Models, and require auxiliary datasets, scoring models, and linguistic analysis of the prompt. In this paper we aim to address these gaps. First, we introduce RFMI, a novel Mutual Information (MI) estimator for RF models that uses the pre-trained model itself for the MI estimation. Then, we investigate a self-supervised fine-tuning approach for T2I alignment based on RFMI that does not require auxiliary information other than the pre-trained model itself. Specifically, a fine-tuning set is constructed by selecting synthetic images generated from the pre-trained RF model and having high point-wise MI between images and prompts. Our experiments on MI estimation benchmarks demonstrate the validity of RFMI, and empirical fine-tuning on SD3.5-Medium confirms the effectiveness of RFMI for improving T2I alignment while maintaining image quality.
zh

[CV-26] MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts

【速读】：该论文旨在解决肿瘤分割在癌症诊断和治疗中的准确性问题，具体面临三个主要挑战：(1) 医学先验知识整合不足，(2) 通用特征与肿瘤特异性特征之间的不平衡，以及 (3) 临床适配的高计算成本。为应对这些挑战，论文提出了一种名为 MAST-Pro 的新框架，其关键是结合了知识驱动提示（knowledge-driven prompts）和动态专家混合模型（dynamic Mixture-of-Experts, D-MoE）。文本和解剖学提示提供了领域特定的先验信息以指导肿瘤表示学习，而 D-MoE 动态选择专家以平衡通用特征和肿瘤特异性特征的学习，从而提升多种肿瘤类型的分割精度。此外，通过参数高效微调（Parameter-Efficient Fine-Tuning, PEFT），MAST-Pro 在显著降低计算开销的同时实现了高达 5.20% 的平均 Dice 相似系数（DSC）提升，并减少了 91.04% 的可训练参数，确保了性能不妥协。

链接: https://arxiv.org/abs/2503.14355
作者: Runqi Meng,Sifan Song,Pengfei Jin,Yujin Oh,Lin Teng,Yulin Wang,Yiqun Sun,Ling Chen,Xiang Li,Quanzheng Li,Ning Guo,Dinggang Shen
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Accurate tumor segmentation is crucial for cancer diagnosis and treatment. While foundation models have advanced general-purpose segmentation, existing methods still struggle with: (1) limited incorporation of medical priors, (2) imbalance between generic and tumor-specific features, and (3) high computational costs for clinical adaptation. To address these challenges, we propose MAST-Pro (Mixture-of-experts for Adaptive Segmentation of pan-Tumors with knowledge-driven Prompts), a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation. Specifically, text and anatomical prompts provide domain-specific priors, guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning, improving segmentation accuracy across diverse tumor types. To enhance efficiency, we employ Parameter-Efficient Fine-Tuning (PEFT), optimizing MAST-Pro with significantly reduced computational overhead. Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average DSC while reducing trainable parameters by 91.04%, without compromising accuracy.
zh

[CV-27] Retrospective: A CORDIC Based Configurable Activation Function for NN Applications

【速读】：该论文旨在解决资源受限系统中神经网络加速器设计的功能灵活性与硬件效率之间的矛盾。传统激活函数（Activation Function, AF）的设计通常针对特定功能固定实现，缺乏灵活性，而该研究提出了一种基于 Coordinate Rotation Digital Computer (CORDIC) 的配置方法，通过提供功能可重构性来加速专用集成电路（ASIC）硬件设计。解决方案的关键在于引入动态可配置且精度可调的激活函数核，利用移位加法CORDIC技术支持多种激活函数（如Swish、SoftMax、SeLU和GeLU），同时优化了针对乘积累加（MAC）、Sigmoid和Tanh的功能实现，并将其整合进ReLU AF，形成累积计算单元NEURIC。这一创新不仅提升了AI工作负载下多种激活函数的适应性，还实现了资源高效矢量引擎的设计，为深度神经网络（DNN）、循环神经网络（RNN/LSTM）以及Transformer模型的加速提供了基础组件，最终达到98.5%的结果质量（Quality of Results, QoR）。

链接: https://arxiv.org/abs/2503.14354
作者: Omkar Kokane,Gopal Raut,Salim Ullah,Mukul Lokhande,Adam Teman,Akash Kumar,Santosh Kumar Vishvakarma
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:A CORDIC-based configuration for the design of Activation Functions (AF) was previously suggested to accelerate ASIC hardware design for resource-constrained systems by providing functional reconfigurability. Since its introduction, this new approach for neural network acceleration has gained widespread popularity, influencing numerous designs for activation functions in both academic and commercial AI processors. In this retrospective analysis, we explore the foundational aspects of this initiative, summarize key developments over recent years, and introduce the DA-VINCI AF tailored for the evolving needs of AI applications. This new generation of dynamically configurable and precision-adjustable activation function cores promise greater adaptability for a range of activation functions in AI workloads, including Swish, SoftMax, SeLU, and GeLU, utilizing the Shift-and-Add CORDIC technique. The previously presented design has been optimized for MAC, Sigmoid, and Tanh functionalities and incorporated into ReLU AFs, culminating in an accumulative NEURIC compute unit. These enhancements position NEURIC as a fundamental component in the resource-efficient vector engine for the realization of AI accelerators that focus on DNNs, RNNs/LSTMs, and Transformers, achieving a quality of results (QoR) of 98.5%.
zh

[CV-28] 3D Densification for Multi-Map Monocular VSLAM in Endoscopy

【速读】：该论文致力于解决稀疏单目视觉同时定位与建图（Sparse Monocular Visual SLAM）在内窥镜序列应用中存在的环境表示不足问题。虽然多映射稀疏方法能够有效恢复因运动模糊、时间遮挡、工具交互或水流等原因频繁丢失的跟踪，但其重建的3D点云存在显著噪声、高比例的不准确点以及严重的异常值，并且密度不足以满足临床需求。论文的关键解决方案在于提出一种方法，通过结合NN LightDepth的深度预测与CudaSIFT子地图，利用鲁棒的最小中位数估计法（LMedS）去除异常值并增强现有方法（CudaSIFT-SLAM）的地图密度。该系统在缓解单目深度估计固有的尺度不确定性的同时过滤异常值，从而生成可靠的稠密化3D地图。实验结果表明，在C3VD结肠数据集上实现了4.15毫米RMS精度的稠密化地图，并在Endomapper数据集的实时结肠镜检查中展示了定性结果。

链接: https://arxiv.org/abs/2503.14346
作者: X. Anadón,Javier Rodríguez-Puigvert,J.M.M. Montiel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.14346 [cs.CV] (or arXiv:2503.14346v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.14346 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-29] ADAPT: An Autonomous Forklift for Construction Site Operation

【速读】：本文旨在解决建筑行业中物料物流效率低下、人工叉车操作易受人为因素影响及劳动短缺等问题。论文的关键在于开发并验证了一种名为“自主动态托盘运输器（ADAPT）”的全地形越野自主叉车，专为复杂多变的施工现场设计。不同于结构化仓库环境，施工现场面临动态障碍物、非结构化地形以及多变天气等挑战。为应对这些难题，研究团队将基于人工智能的感知技术与传统的决策、规划和控制方法相结合，实现了在复杂环境中的可靠运行。解决方案的核心在于融合AI驱动的感知技术和传统控制策略，以确保设备在恶劣条件下的稳定性和高效性。通过广泛的实地测试并与经验丰富的操作员进行对比分析，验证了自主户外叉车接近人类水平的操作性能，为实现更安全高效的建筑物流提供了可行路径。

链接: https://arxiv.org/abs/2503.14331
作者: Johannes Huemer,Markus Murschitz,Matthias Schörghuber,Lukas Reisinger,Thomas Kadiofsky,Christoph Weidinger,Mario Niedermeyer,Benedikt Widy,Marcel Zeilinger,Csaba Beleznai,Tobias Glück,Andreas Kugi,Patrik Zips
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Efficient material logistics play a critical role in controlling costs and schedules in the construction industry. However, manual material handling remains prone to inefficiencies, delays, and safety risks. Autonomous forklifts offer a promising solution to streamline on-site logistics, reducing reliance on human operators and mitigating labor shortages. This paper presents the development and evaluation of the Autonomous Dynamic All-terrain Pallet Transporter (ADAPT), a fully autonomous off-road forklift designed for construction environments. Unlike structured warehouse settings, construction sites pose significant challenges, including dynamic obstacles, unstructured terrain, and varying weather conditions. To address these challenges, our system integrates AI-driven perception techniques with traditional approaches for decision making, planning, and control, enabling reliable operation in complex environments. We validate the system through extensive real-world testing, comparing its long-term performance against an experienced human operator across various weather conditions. We also provide a comprehensive analysis of challenges and key lessons learned, contributing to the advancement of autonomous heavy machinery. Our findings demonstrate that autonomous outdoor forklifts can operate near human-level performance, offering a viable path toward safer and more efficient construction logistics.
zh

[CV-30] EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment

【速读】：该论文旨在解决灵巧机械手在复杂环境中因训练数据多样性不足而导致的泛化能力受限问题。现实世界中的场景变化无穷，难以通过穷举所有可能情况来应对。为了解决这一挑战，论文提出了一种基于进化思想的学习方法——EvolvingGrasp，其核心在于通过高效偏好对齐机制持续提升抓取性能。

解决方案的关键在于引入了Handpose-wise Preference Optimization (HPO)，使模型能够同时从正向和负向反馈中不断调整自身，逐步优化抓取策略。此外，为了提高在线调整过程中的效率与可靠性，HPO中集成了Physics-aware Consistency Model，该模型加速了推理速度，减少了偏好微调所需的步数，并在整个过程中确保了物理上的可行性。实验结果表明，EvolvingGrasp在四个基准数据集上的抓取成功率和采样效率均达到了最先进的水平，验证了其在模拟环境及真实场景下实现稳健、物理可行且偏好一致抓取的能力。

链接: https://arxiv.org/abs/2503.14329
作者: Yufei Zhu,Yiming Zhong,Zemin Yang,Peishan Cong,Jingyi Yu,Xinge Zhu,Yuexin Ma
机构: ShanghaiTech University (上海科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dexterous robotic hands often struggle to generalize effectively in complex environments due to the limitations of models trained on low-diversity data. However, the real world presents an inherently unbounded range of scenarios, making it impractical to account for every possible variation. A natural solution is to enable robots learning from experience in complex environments, an approach akin to evolution, where systems improve through continuous feedback, learning from both failures and successes, and iterating toward optimal performance. Motivated by this, we propose EvolvingGrasp, an evolutionary grasp generation method that continuously enhances grasping performance through efficient preference alignment. Specifically, we introduce Handpose wise Preference Optimization (HPO), which allows the model to continuously align with preferences from both positive and negative feedback while progressively refining its grasping strategies. To further enhance efficiency and reliability during online adjustments, we incorporate a Physics-aware Consistency Model within HPO, which accelerates inference, reduces the number of timesteps needed for preference finetuning, and ensures physical plausibility throughout the process. Extensive experiments across four benchmark datasets demonstrate state of the art performance of our method in grasp success rate and sampling efficiency. Our results validate that EvolvingGrasp enables evolutionary grasp generation, ensuring robust, physically feasible, and preference-aligned grasping in both simulation and real scenarios.
zh

[CV-31] LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

【速读】：该论文旨在解决Latent Video Diffusion Models (LVDMs) 在训练过程中因Video Variational Autoencoders (Video VAEs) 的计算开销成为瓶颈的问题，特别是当处理高分辨率视频时。为了解决这一挑战，论文提出了LeanVAE，这是一种新颖且极其高效的Video VAE框架，其关键创新点包括：(1) 基于Neighborhood-Aware Feedforward (NAF) 模块和非重叠分块操作的轻量级架构，显著降低了计算成本；(2) 引入小波变换与压缩感知技术以提升重建质量。实验结果表明，LeanVAE在视频重建与生成任务中表现出色，相较于现有方法，在保持竞争性重建质量的同时，实现了高达50倍的浮点运算减少(FLOPs) 和44倍的推理速度提升。这些特性为可扩展且高效的LVDMs提供了重要参考。模型及相关代码已开源。

链接: https://arxiv.org/abs/2503.14325
作者: Yu Cheng,Fajie Yuan
机构: Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latent this http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE’s superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Video this http URL model offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient video this http URL models and code are available at this https URL.
zh

[CV-32] PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

【速读】：该论文致力于解决音频驱动的说话人脸生成中唇音同步与面部动画控制不足的问题。当前方法在说话风格和情感表达等面部动画控制方面缺乏灵活性，导致输出结果单一。为提升生成视频的多样性和用户友好性，论文聚焦于改善唇音对齐（lip-audio alignment）和情感控制两个关键因素。解决方案的关键在于提出了一种名为PC-Talk的新框架，通过隐式的特征点变形实现唇音对齐和情感控制。具体而言，唇音对齐模块支持精确编辑单词级别的说话风格并调整嘴唇运动幅度以模拟不同音量，而情感控制模块则通过纯粹的情感变形生成逼真的面部表情，并可精细调节情感强度及多区域复合情感的组合，从而显著提升了模型的可控性和性能。

链接: https://arxiv.org/abs/2503.14295
作者: Baiqin Wang,Xiangyu Zhu,Fan Shen,Hao Xu,Zhen Lei
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (自动化所，中科院 MAIS); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Psyche AI.INC (Psyche AI公司); The Hong Kong University of Science and Technology (香港科技大学); CAIR, HKISI, Chinese Academy of Sciences (自动化所，中科院 CAIR, HKISI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.
zh

[CV-33] owards synthetic generation of realistic wooden logs

【速读】：该论文旨在解决高效锯切过程中对原木内部节疤（knots）分布及其外部表面结构精确测量的难题。由于计算机断层扫描（CT）虽能提供准确的节疤信息但难以在锯木厂环境中广泛应用，因此提出了一种利用表面测量与机器学习技术预测原木内部结构的替代方法。然而，获取足够的训练数据仍是该领域的挑战。论文的关键在于两个方面：一是树内节疤生长的建模；二是真实合成包含节疤到达表面区域的原木表面结构。最终，该研究提出了首个能够同时生成木材内部节疤及外部表面结构的原木合成方法，并证明所提出的数学原木模型可准确拟合来自CT扫描的真实数据，从而实现逼真的原木生成。

链接: https://arxiv.org/abs/2503.14277
作者: Fedor Zolotarev,Borek Reich,Tuomas Eerola,Tomi Kauppi,Pavel Zemcik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we propose a novel method to synthetically generate realistic 3D representations of wooden logs. Efficient sawmilling heavily relies on accurate measurement of logs and the distribution of knots inside them. Computed Tomography (CT) can be used to obtain accurate information about the knots but is often not feasible in a sawmill environment. A promising alternative is to utilize surface measurements and machine learning techniques to predict the inner structure of the logs. However, obtaining enough training data remains a challenge. We focus mainly on two aspects of log generation: the modeling of knot growth inside the tree, and the realistic synthesis of the surface including the regions, where the knots reach the surface. This results in the first log synthesis approach capable of generating both the internal knot and external surface structures of wood. We demonstrate that the proposed mathematical log model accurately fits to real data obtained from CT scans and enables the generation of realistic logs.
zh

[CV-34] Free-Lunch Color-Texture Disentanglement for Stylized Image Generation

【速读】：该论文旨在解决离散化风格化图像生成（Disentangled Stylized Image Generation, DisIG）任务中对风格元素独立控制的需求，特别是针对现有基于扩散模型的方法在细粒度风格定制方面的不足，这些方法难以同时有效控制颜色和纹理等多重风格属性。论文提出了一种无需微调的解决方案——风格属性解耦（Style Attributes Disentanglement, SADis），其关键是利用CLIP图像嵌入空间中的图像-提示可加性特性，开发技术以从单一的颜色和纹理参考图像中分离并提取颜色-纹理嵌入（Color-Texture Embeddings, CTE）。此外，通过引入白化与着色变换的正则化方法（Regularized Whitening and Coloring Transformation, RegWCT）以及添加噪声项来保持纹理保真度，确保生成图像的颜色板与颜色参考高度一致，并防止因扩散训练中信号泄漏偏差导致的纹理损失。实验结果表明，SADis在WikiArt和StyleDrop数据集上的表现优于现有的风格化方法，在定性和定量评估中均展现出卓越性能。

链接: https://arxiv.org/abs/2503.14275
作者: Jiang Qin,Senmao Li,Alexandra Gomez-Villa,Shiqi Yang,Yaxing Wang,Kai Wang,Joost van de Weijer
机构: Harbin Institute of Technology (哈尔滨工业大学); VCIP, CS, Nankai University (南开大学视觉计算与系统研究所); Computer Vision Center (计算机视觉中心); Universitat de València (瓦伦西亚大学); Universitat Autònoma de Barcelona (自治大学巴塞罗那分校); SB Intuitions, SoftBank (软银智能事业部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.
zh

[CV-35] Improving Adaptive Density Control for 3D Gaussian Splatting

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在场景重建过程中管理高斯原语数量的问题，特别是在自适应密度控制（Adaptive Density Control, ADC）机制下可能导致的渲染伪影。论文观察到现有方法在密集化和剪枝高斯原语时，有时会在欠重构背景或过拟合前景区域引入渲染伪影。为此，论文提出三项改进：一是修正场景范围计算以减少对相机位置的依赖；二是引入指数上升的梯度阈值以加速训练收敛；三是采用显著性感知剪枝策略避免背景伪影。这些改进的关键在于通过优化密度控制机制，在保持相同高斯原语数量的同时提升渲染质量，并显著加快训练速度，同时确保与现有3DGS衍生工作的兼容性。

链接: https://arxiv.org/abs/2503.14274
作者: Glenn Grubert,Florian Barthel,Anna Hilsmann,Peter Eisert
机构: Humboldt UniversitГÖt zu Berlin (柏林洪堡大学), Germany; Fraunhofer HHI (弗劳恩霍夫图像信息研究所), Berlin, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
zh

[CV-36] Manual Labelling Artificially Inflates Deep Learning-Based Segmentation Performance on Closed Canopy: Validation Using TLS

【速读】：该论文试图解决在混合未管理的温带与地中海森林中，利用无人机RGB影像进行个体树冠（ITC）分割时验证方法缺乏严格独立地面真实数据的问题。论文的关键解决方案是通过共定位的地面激光扫描（TLS）数据生成高保真验证标签，并以此评估两种广泛使用的深度学习ITC分割模型（DeepForest和Detectree2）的性能。研究发现，当基于TLS衍生的地表真实数据验证时，模型性能显著下降，尤其是在地中海森林中，这表明现有方法在闭合树冠森林中的空中分割方法存在根本性局限。

链接: https://arxiv.org/abs/2503.14273
作者: Matthew J. Allen,Harry J. F. Owen,Stuart W. D. Grieve,Emily R. Lines
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures

点击查看摘要

Abstract:Monitoring forest dynamics at an individual tree scale is essential for accurately assessing ecosystem responses to climate change, yet traditional methods relying on field-based forest inventories are labor-intensive and limited in spatial coverage. Advances in remote sensing using drone-acquired RGB imagery combined with deep learning models have promised precise individual tree crown (ITC) segmentation; however, existing methods are frequently validated against human-annotated images, lacking rigorous independent ground truth. In this study, we generate high-fidelity validation labels from co-located Terrestrial Laser Scanning (TLS) data for drone imagery of mixed unmanaged boreal and Mediterranean forests. We evaluate the performance of two widely used deep learning ITC segmentation models - DeepForest (RetinaNet) and Detectree2 (Mask R-CNN) - on these data, and compare to performance on further Mediterranean forest data labelled manually. When validated against TLS-derived ground truth from Mediterranean forests, model performance decreased significantly compared to assessment based on hand-labelled from an ecologically similar site (AP50: 0.094 vs. 0.670). Restricting evaluation to only canopy trees shrank this gap considerably (Canopy AP50: 0.365), although performance was still far lower than on similar hand-labelled data. Models also performed poorly on boreal forest data (AP50: 0.142), although again increasing when evaluated on canopy trees only (Canopy AP50: 0.308). Both models showed very poor localisation accuracy at stricter IoU thresholds, even when restricted to canopy trees (Max AP75: 0.051). Similar results have been observed in studies using aerial LiDAR data, suggesting fundamental limitations in aerial-based segmentation approaches in closed canopy forests.
zh

[CV-37] CTSR: Controllable Fidelity-Realness Trade-off Distillation for Real-World Image Super Resolution

【速读】：该论文旨在解决真实场景图像超分辨率任务中，在保证生成结果视觉逼真性的同时，难以有效平衡保真度与视觉真实性的问题。现有基于扩散模型的方法虽在视觉逼真性方面表现优异，但往往难以兼顾两者之间的权衡。论文的关键创新在于提出了一种基于蒸馏的方法，通过几何分解保真度和视觉真实性，并结合多个教师模型的优势，实现更均衡的权衡。此外，论文进一步探索了这种权衡的可控性，提出了名为CTSR（可控制权衡超分辨率）的方法，使超分辨率过程更具灵活性和可调性。实验表明，该方法在多个真实场景图像超分辨率基准数据集上超越了现有最先进的方法，在保真度和视觉真实性指标上均表现出色。

链接: https://arxiv.org/abs/2503.14272
作者: Runyi Li,Bin Chen,Jian Zhang,Radu Timofte
机构: School of Electronic and Computer Engineering, Peking University (北京大学), China; Computer Vision Lab, CAIDAS & IFI, University of Würzburg (维尔茨堡大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.
zh

[CV-38] Deep Unsupervised Segmentation of Log Point Clouds

【速读】：该论文旨在解决锯木厂中通过激光扫描仪生成的表面点云准确预测原木内部结构的问题。传统方法依赖X射线CT测量设备，成本高且效率较低，而本文提出了一种基于Point Transformer的无监督点云分割技术作为解决方案的关键。该技术通过设计一种利用圆柱几何属性的损失函数，在考虑木材原木常见形状变化的前提下，学习识别属于原木表面的点，从而实现对原木表面细节的精准分割，为揭示内部结构提供依据。此方法不仅提升了预测准确性，还为其他圆柱形物体的类似分析提供了潜在应用可能。

链接: https://arxiv.org/abs/2503.14244
作者: Fedor Zolotarev,Tuomas Eerola,Tomi Kauppi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In sawmills, it is essential to accurately measure the raw material, i.e. wooden logs, to optimise the sawing process. Earlier studies have shown that accurate predictions of the inner structure of the logs can be obtained using just surface point clouds produced by a laser scanner. This provides a cost-efficient and fast alternative to the X-ray CT-based measurement devices. The essential steps in analysing log point clouds is segmentation, as it forms the basis for finding the fine surface details that provide the cues about the inner structure of the log. We propose a novel Point Transformer-based point cloud segmentation technique that learns to find the points belonging to the log surface in unsupervised manner. This is obtained using a loss function that utilises the geometrical properties of a cylinder while taking into account the shape variation common in timber logs. We demonstrate the accuracy of the method on wooden logs, but the approach could be utilised also on other cylindrical objects.
zh

[CV-39] Make Your Training Flexible: Towards Deployment-Efficient Video Models

【速读】：该论文旨在解决现有视频训练方法因固定数量的令牌（fixed number of tokens）采样自预定义时空网格而导致的次优准确度-计算权衡问题，同时缺乏适应不同计算预算的能力，限制了高性能模型在实际场景中的应用。论文的关键创新在于提出了一种新的测试设置——Token Optimization，通过优化受限大小的输入令牌集合来最大化跨预算的信息利用率，并引入了一种名为Flux的新颖增强工具。Flux通过使采样网格变得灵活以及利用令牌选择技术，在大多数流行的视频训练框架中易于实现，显著提升了模型鲁棒性且几乎不增加额外成本。最终，将Flux集成到大规模视频预训练中，所提出的FluxViT在标准成本下实现了广泛任务上的最新性能。

链接: https://arxiv.org/abs/2503.14237
作者: Chenting Wang,Kunchang Li,Tianxiang Jiang,Xiangyu Zeng,Yi Wang,Limin Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); University of Chinese Academy of Sciences (中国科学院大学); University of Science and Technology of China (中国科学技术大学); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90% savings. All models and data are available at this https URL.
zh

[CV-40] CRCE: Coreference-Retention Concept Erasure in Text-to-Image Diffusion Models

【速读】：该论文试图解决文本到图像扩散模型生成内容中概念擦除不精确的问题，现有方法存在擦除不足（遗留目标概念的残余痕迹）或过度擦除（误删无关但视觉相似的概念）的局限。解决方案的关键在于提出CRCE（Concept Retention and Concept Erasure）框架，通过利用大语言模型识别与目标概念语义相关的应被擦除的概念以及需保留的独立概念，并显式建模核心指代和保留概念的语义特性，从而实现更精准的概念擦除，避免非预期的擦除行为。

链接: https://arxiv.org/abs/2503.14232
作者: Yuyang Xue,Edward Moroshko,Feng Chen,Steven McDonagh,Sotirios A. Tsaftaris
机构: School of Engineering, University of Edinburgh (爱丁堡大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-Image diffusion models can produce undesirable content that necessitates concept erasure techniques. However, existing methods struggle with under-erasure, leaving residual traces of targeted concepts, or over-erasure, mistakenly eliminating unrelated but visually similar concepts. To address these limitations, we introduce CRCE, a novel concept erasure framework that leverages Large Language Models to identify both semantically related concepts that should be erased alongside the target and distinct concepts that should be preserved. By explicitly modeling coreferential and retained concepts semantically, CRCE enables more precise concept removal, without unintended erasure. Experiments demonstrate that CRCE outperforms existing methods on diverse erasure tasks.
zh

[CV-41] Multi-task Learning for Identification of Porcelain in Song and Yuan Dynasties

【速读】：该论文致力于解决中国瓷器在考古研究和文化遗产保护中因传统分类方法（依赖专家分析）耗时、主观且难以规模化而面临的挑战。论文提出利用深度学习（Deep Learning, DL）和迁移学习技术实现瓷器四大属性（朝代、釉料、器物类别和类型）的自动化分类。解决方案的关键在于通过迁移学习显著提升分类准确性，尤其是在复杂任务如类型分类中，使用预训练权重的模型表现远优于从头训练的模型。实验结果显示，MobileNetV2和ResNet50在所有任务中均表现出高精度和鲁棒性，而VGG16在多样化分类任务中的性能较弱。此外，论文还讨论了数据集限制的影响，并提出了未来的研究方向，包括领域特定的预训练、注意力机制的集成、可解释性人工智能方法以及推广至其他文化遗物的分类任务。

链接: https://arxiv.org/abs/2503.14231
作者: Ziyao Ling,Giovanni Delnevo,Paola Salomoni,Silvia Mirri
机构: University of Bologna (博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chinese porcelain holds immense historical and cultural value, making its accurate classification essential for archaeological research and cultural heritage preservation. Traditional classification methods rely heavily on expert analysis, which is time-consuming, subjective, and difficult to scale. This paper explores the application of DL and transfer learning techniques to automate the classification of porcelain artifacts across four key attributes: dynasty, glaze, ware, and type. We evaluate four Convolutional Neural Networks (CNNs) - ResNet50, MobileNetV2, VGG16, and InceptionV3 - comparing their performance with and without pre-trained weights. Our results demonstrate that transfer learning significantly enhances classification accuracy, particularly for complex tasks like type classification, where models trained from scratch exhibit lower performance. MobileNetV2 and ResNet50 consistently achieve high accuracy and robustness across all tasks, while VGG16 struggles with more diverse classifications. We further discuss the impact of dataset limitations and propose future directions, including domain-specific pre-training, integration of attention mechanisms, explainable AI methods, and generalization to other cultural artifacts.
zh

[CV-42] HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions Real-World Validation and an Open Leaderboard

【速读】：该论文旨在解决传统视觉与语言导航（Vision-and-Language Navigation, VLN）系统在处理复杂动态环境中人类交互行为时的局限性，这些系统通常仅关注离散（全景）或连续（自由运动）导航范式之一，而忽视了人类密集且动态环境下的复杂性。论文的关键在于提出了一种统一的人类感知型VLN（Human-Aware VLN, HA-VLN）基准测试框架，通过引入显式的社会感知约束条件，将离散与连续导航模式整合在一起。这一方案的核心在于定义了一个标准化的任务，该任务不仅平衡了离散-连续导航需求，还考虑到了个人空间的要求，并开发了一个增强版的人类运动数据集（HAPS 2.0）及改进后的模拟器，以更真实地反映多人互动场景、户外情境以及运动与语言之间的精确对齐。此外，通过实证研究验证了在拥挤室内环境中从仿真到实际应用的迁移能力，强调了在设计中融入社会语境的重要性，从而推动了更安全、更强大且更具社会责任感的VLN技术发展。

链接: https://arxiv.org/abs/2503.14229
作者: Yifei Dong,Fengyi Wu,Qi He,Heng Li,Minghan Li,Zebang Cheng,Yuxuan Zhou,Jingdong Sun,Qi Dai,Zhi-Qi Cheng,Alexander G Hauptmann
机构: University of Washington (华盛顿大学); Galbot; University of Mannheim (曼海姆大学); Carnegie Mellon University (卡内基梅隆大学); Microsoft Research (微软研究)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 27 pages, website: this https URL

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) systems often focus on either discrete (panoramic) or continuous (free-motion) paradigms alone, overlooking the complexities of human-populated, dynamic environments. We introduce a unified Human-Aware VLN (HA-VLN) benchmark that merges these paradigms under explicit social-awareness constraints. Our contributions include: 1. A standardized task definition that balances discrete-continuous navigation with personal-space requirements; 2. An enhanced human motion dataset (HAPS 2.0) and upgraded simulators capturing realistic multi-human interactions, outdoor contexts, and refined motion-language alignment; 3. Extensive benchmarking on 16,844 human-centric instructions, revealing how multi-human dynamics and partial observability pose substantial challenges for leading VLN agents; 4. Real-world robot tests validating sim-to-real transfer in crowded indoor spaces; and 5. A public leaderboard supporting transparent comparisons across discrete and continuous tasks. Empirical results show improved navigation success and fewer collisions when social context is integrated, underscoring the need for human-centric design. By releasing all datasets, simulators, agent code, and evaluation tools, we aim to advance safer, more capable, and socially responsible VLN research.
zh

[CV-43] Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images

【速读】：本文旨在解决从广角鱼眼图像中进行精确人物检测的挑战，特别是面对人物旋转和小尺寸人物时的检测难题。为了解决人物旋转问题，论文将鱼眼图像转换为全景图像；针对小尺寸人物的问题，则聚焦于全景图像的几何特性。传统方法倾向于关注特征图中较大显著区域的人物，而忽略了较小人物的重要信息。论文发现，在等距柱状全景图像中，人物高度在图像顶部线性递减，据此提出利用显著值对聚合标记进行排序以平衡重要区域。关键在于引入了一种全景失真感知标记化（panoramic distortion-aware tokenization）方法，通过自相似图形划分全景图像，确保无间隙的最佳分割，并利用标记组中每块的最大显著值保留小人物的重要区域。最终，结合全景图像重映射与标记化过程，提出了一个更高精度的人物检测与定位方法。实验表明，该方法在大规模数据集上的表现优于传统方法。

链接: https://arxiv.org/abs/2503.14228
作者: Nobuhiko Wakai,Satoshi Sato,Yasunori Ishii,Takayoshi Yamashita
机构: Panasonic Holdings Corporation (松下控股公司); Chubu University (中部大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person’s height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.
zh

[CV-44] Segmentation-Guided Neural Radiance Fields for Novel Street View Synthesis WWW

【速读】：该论文致力于解决将Neural Radiance Fields (NeRF) 扩展到大规模室外场景时面临的挑战，如瞬态物体、稀疏相机视角与纹理以及多变的光照条件。论文的关键在于提出了一种基于分割引导的增强方法，通过扩展ZipNeRF并利用Grounded SAM生成分割掩模，有效处理瞬态物体、天空建模及地面正则化。此外，引入外观嵌入来适应视点序列中的不一致光照。实验结果表明，该方法在新视图合成质量上优于基线ZipNeRF，减少了伪影并提升了细节清晰度。

链接: https://arxiv.org/abs/2503.14219
作者: Yizhou Li,Yusuke Monno,Masatoshi Okutomi,Yuuichi Tanaka,Seiichi Kataoka,Teruaki Kosiba
机构: Institute of Science Tokyo (东京工业大学); Micware Mobility Co., Ltd. (Micware Mobility株式会社); Micware Automotive Co., Ltd. (Micware Automotive株式会社); Micware Navigations Co., Ltd (Micware Navigations株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Presented at VISAPP2025. Project page: this http URL

点击查看摘要

Abstract:Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
zh

[CV-45] AI-Driven Diabetic Retinopathy Diagnosis Enhancement through Image Processing and Salp Swarm Algorithm-Optimized Ensemble Network

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）早期诊断效率低且易受错误影响的问题。传统诊断方法耗时且容易出错，而单一深度学习模型在从复杂的视网膜图像中提取关键特征时常遇到挑战。为应对这一问题，论文提出了一种基于集成学习的有效DR诊断方法，其关键是通过多阶段优化流程提升模型性能：首先利用对比度受限自适应直方图均衡化（CLAHE）和Gamma校正进行图像预处理以增强特征可识别性；接着采用离散小波变换（DWT）融合多分辨率细节生成更丰富的数据集；随后选择表现最佳的三种预训练模型（DenseNet169、MobileNetV1、Xception），并引入改进残差块以增强特征提取能力；最后通过Salp Swarm Algorithm (SSA) 优化加权集成策略聚合各基础模型的预测结果，从而最大化集成模型的整体性能。实验结果显示，该方法在Kaggle APTOS 2019多分类数据集上达到了88.52%的准确率。

链接: https://arxiv.org/abs/2503.14209
作者: Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
机构: DFKI (German Research Center for Artificial Intelligence)(德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic retinopathy is a leading cause of blindness in diabetic patients and early detection plays a crucial role in preventing vision loss. Traditional diagnostic methods are often time-consuming and prone to errors. The emergence of deep learning techniques has provided innovative solutions to improve diagnostic efficiency. However, single deep learning models frequently face issues related to extracting key features from complex retinal images. To handle this problem, we present an effective ensemble method for DR diagnosis comprising four main phases: image pre-processing, selection of backbone pre-trained models, feature enhancement, and optimization. Our methodology initiates with the pre-processing phase, where we apply CLAHE to enhance image contrast and Gamma correction is then used to adjust the brightness for better feature recognition. We then apply Discrete Wavelet Transform (DWT) for image fusion by combining multi-resolution details to create a richer dataset. Then, we selected three pre-trained models with the best performance named DenseNet169, MobileNetV1, and Xception for diverse feature extraction. To further improve feature extraction, an improved residual block is integrated into each model. Finally, the predictions from these base models are then aggregated using weighted ensemble approach, with the weights optimized by using Salp Swarm Algorithm (SSA).SSA intelligently explores the weight space and finds the optimal configuration of base architectures to maximize the performance of the ensemble model. The proposed model is evaluated on the multiclass Kaggle APTOS 2019 dataset and obtained 88.52% accuracy.
zh

[CV-46] RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images CVPR2025

【速读】：该论文试图解决从稀疏多视角图像中合成高质量全新视角的问题，同时避免繁琐的每主体优化过程。现有方法在视角稀疏且重叠较少的情况下通常表现不佳，尤其在复杂人体几何重建方面效果有限。论文的关键在于提出了一种新颖的方法RoGSplat，其核心思想是将SMPL顶点提升为密集且可靠的三维先验点以表示精确的人体几何结构，并基于这些点回归高斯参数。为了应对SMPL模型与图像之间的可能错位，通过结合像素级特征和体素级特征预测图像对齐的三维先验点，从而回归粗粒度的高斯分布。此外，为进一步捕捉高频细节，通过渲染深度图辅助回归细粒度逐像素的高斯分布。实验结果表明，该方法在新视角合成和跨数据集泛化方面优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.14198
作者: Junjin Xiao,Qing Zhang,Yonewei Nie,Lei Zhu,Wei-Shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. Unlike previous methods that typically struggle with sparse views with few overlappings and are less effective in reconstructing complex human geometry, the proposed method enables robust reconstruction in such challenging conditions. Our key idea is to lift SMPL vertices to dense and reliable 3D prior points representing accurate human body geometry, and then regress human Gaussian parameters based on the points. To account for possible misalignment between SMPL model and images, we propose to predict image-aligned 3D prior points by leveraging both pixel-level features and voxel-level features, from which we regress the coarse Gaussians. To enhance the ability to capture high-frequency details, we further render depth maps from the coarse 3D Gaussians to help regress fine-grained pixel-wise Gaussians. Experiments on several benchmark datasets demonstrate that our method outperforms state-of-the-art methods in novel view synthesis and cross-dataset generalization. Our code is available at this https URL.
zh

[CV-47] Mapping Urban Villages in China: Progress and Challenges DATE

【速读】：该论文试图解决“城市村庄”（Urban Villages）的高精度空间制图问题，这是中国城镇化高质量发展过程中亟需关注的社会问题。然而，由于缺乏可用的地理空间数据，这一领域的研究进展有限。论文的关键在于通过全面回顾现有研究，总结中国城市村庄制图的研究区域、数据来源及方法，同时识别当前研究的挑战与未来方向。论文指出，目前研究存在覆盖范围狭窄、时间跨度有限以及对识别方法的可扩展性、可迁移性和可解释性研究不足等问题，这些问题源于概念模糊、空间异质性以及数据可用性的挑战。未来研究需在这些方面进行补充与深化，以实现全国范围内的大尺度制图。

链接: https://arxiv.org/abs/2503.14195
作者: Rui Cao,Wei Tu,Dongsheng Chen,Wenyu Zhang
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Databases (cs.DB); Computer Vision and Pattern Recognition (cs.CV)
备注: Updated review at this https URL

点击查看摘要

Abstract:The shift toward high-quality urbanization has brought increased attention to the issue of “urban villages”, which has become a prominent social problem in China. However, there is a lack of available geospatial data on urban villages, making it crucial to prioritize urban village mapping. In order to assess the current progress in urban village mapping and identify challenges and future directions, we have conducted a comprehensive review, which to the best of our knowledge is the first of its kind in this field. Our review begins by providing a clear context for urban villages and elaborating the method for literature review, then summarizes the study areas, data sources, and approaches used for urban village mapping in China. We also address the challenges and future directions for further research. Through thorough investigation, we find that current studies only cover very limited study areas and periods and lack sufficient investigation into the scalability, transferability, and interpretability of identification approaches due to the challenges in concept fuzziness and variances, spatial heterogeneity and variances of urban villages, and data availability. Future research can complement and further the current research in the following potential directions in order to achieve large-area mapping across the whole nation…
zh

[CV-48] Bridging Past and Future: End-to-End Autonomous Driving with Historical Prediction and Planning CVPR2025

【速读】：本文旨在解决现有端到端自动驾驶方法在利用历史信息方面的不足。当前方法要么通过密集的历史鸟瞰图（BEV）特征，要么通过查询稀疏记忆库来聚合历史信息，但这些方法要么在运动规划中忽略了历史信息，要么未能与多步预测的特性相适应，而运动规划需要预测或规划多个未来时间步。为了解决这一问题，论文提出了BridgeAD，其关键是将运动和规划查询重新表述为多步查询，以区分每个未来时间步的查询。这种设计允许根据时间步将历史预测和规划有效地应用于端到端系统的适当部分，从而改进感知和运动规划。具体而言，当前帧的历史查询与感知相结合，而未来帧的查询则与运动规划集成，通过在每个时间步聚合历史洞察，弥合过去与未来的差距，提升整个端到端自动驾驶管道的一致性和准确性。实验结果表明，BridgeAD在nuScenes数据集的开环和闭环设置下均达到了最先进的性能。

链接: https://arxiv.org/abs/2503.14182
作者: Bozhou Zhang,Nan Song,Xin Jin,Li Zhang
机构: School of Data Science, Fudan University (复旦大学); Eastern Institute of Technology (东亚理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:End-to-end autonomous driving unifies tasks in a differentiable framework, enabling planning-oriented optimization and attracting growing attention. Current methods aggregate historical information either through dense historical bird’s-eye-view (BEV) features or by querying a sparse memory bank, following paradigms inherited from detection. However, we argue that these paradigms either omit historical information in motion planning or fail to align with its multi-step nature, which requires predicting or planning multiple future time steps. In line with the philosophy of future is a continuation of past, we propose BridgeAD, which reformulates motion and planning queries as multi-step queries to differentiate the queries for each future time step. This design enables the effective use of historical prediction and planning by applying them to the appropriate parts of the end-to-end system based on the time steps, which improves both perception and motion planning. Specifically, historical queries for the current frame are combined with perception, while queries for future frames are integrated with motion planning. In this way, we bridge the gap between past and future by aggregating historical insights at every time step, enhancing the overall coherence and accuracy of the end-to-end autonomous driving pipeline. Extensive experiments on the nuScenes dataset in both open-loop and closed-loop settings demonstrate that BridgeAD achieves state-of-the-art performance.
zh

[CV-49] Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

【速读】：该论文旨在解决轻量级GPU上3D高斯点云渲染速度较慢以及常见伪影问题。解决方案的关键在于提出了一种针对3D高斯点云(3D Gaussian Splatting, 3DGS)图像的梯度感知放大技术，通过直接利用高斯分布的解析图像梯度进行基于梯度的双三次样条插值，以实现低分辨率3DGS渲染图的高效放大，同时保持较高的重建保真度。此方法与具体的3DGS实现无关，并将新型视图合成的速度提升至基线实现的3倍到4倍。此外，论文展示了该技术在提升重建质量和优化性能方面的潜力，特别是在结合基于梯度的3DGS模型优化时。

链接: https://arxiv.org/abs/2503.14171
作者: Simon Niedermayr,Christoph Neuhauser Rüdiger Westermann
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:We introduce an image upscaling technique tailored for 3D Gaussian Splatting (3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher rendering speeds and reduces artifacts commonly observed in 3DGS reconstructions. Our technique upscales low-resolution 3DGS renderings with a marginal increase in cost by directly leveraging the analytical image gradients of Gaussians for gradient-based bicubic spline interpolation. The technique is agnostic to the specific 3DGS implementation, achieving novel view synthesis at rates 3x-4x higher than the baseline implementation. Through extensive experiments on multiple datasets, we showcase the performance improvements and high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS images. We further demonstrate the integration of gradient-aware upscaling into the gradient-based optimization of a 3DGS model and analyze its effects on reconstruction quality and performance.
zh

[CV-50] CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在连续空间感知（Continuous Space Perception）能力上的不足，这一能力指的是从固定视角观察场景并旋转方向时产生的空间连续图像序列，从而重建整个空间的能力。当前基准测试通常关注不相关的离散图像或来自不同视角的图像，而忽视了来自固定视角图像的组合特性。论文的关键解决方案是提出CoSpace，一个包含2,918张图像和1,626组问答对的多图像视觉理解基准，涵盖七种任务类型，用于评估VLMs在连续空间感知方面的能力。通过评估19个专有和开源VLMs，研究发现大多数模型（包括专有模型）在这方面的表现存在不足，且开源与专有模型的主要差异体现在响应一致性而非准确率上。

链接: https://arxiv.org/abs/2503.14161
作者: Yiqi Zhu,Ziyue Wang,Can Zhang,Peng Li,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have recently witnessed significant progress in visual comprehension. As the permitting length of image context grows, VLMs can now comprehend a broader range of views and spaces. Current benchmarks provide insightful analysis of VLMs in tasks involving complex visual instructions following, multi-image understanding and spatial reasoning. However, they usually focus on spatially irrelevant images or discrete images captured from varied viewpoints. The compositional characteristic of images captured from a static viewpoint remains underestimated. We term this characteristic as Continuous Space Perception. When observing a scene from a static viewpoint while shifting orientations, it produces a series of spatially continuous images, enabling the reconstruction of the entire space. In this paper, we present CoSpace, a multi-image visual understanding benchmark designed to assess the Continuous Space perception ability for VLMs. CoSpace contains 2,918 images and 1,626 question-answer pairs, covering seven types of tasks. We conduct evaluation across 19 proprietary and open-source VLMs. Results reveal that there exist pitfalls on the continuous space perception ability for most of the evaluated models, including proprietary ones. Interestingly, we find that the main discrepancy between open-source and proprietary models lies not in accuracy but in the consistency of responses. We believe that enhancing the ability of continuous space perception is essential for VLMs to perform effectively in real-world tasks and encourage further research to advance this capability.
zh

[CV-51] RBFIM: Perceptual Quality Assessment for Compressed Point Clouds Using Radial Basis Function Interpolation

【速读】：该论文旨在解决点云压缩（Point Cloud Compression, PCC）中如何有效评估感知失真的问题，以优化编解码器的感知质量。当前PCC的标准实践中存在一个主要局限：单特征指标虽被广泛用于评估压缩失真，但基于逐点最近邻搜索的经典方法难以建立精确的点云对应关系，导致无法有效捕捉人类感知特征。论文的关键解决方案是提出了一种名为RBFIM的新评估方法，利用径向基函数（Radial Basis Function, RBF）插值将离散点特征转换为连续特征函数，从而实现对失真点云的精确特征映射。通过将原始点云的几何坐标代入特征函数，可以建立双向一一对应的特征集，显著提高质量评估的准确性，同时避免了双向搜索带来的复杂性。实验结果表明，RBFIM在处理人类感知任务方面表现出色，为PCC优化提供了有力支持。

链接: https://arxiv.org/abs/2503.14154
作者: Zhang Chen,Shuai Wan,Siyu Ren,Fuzheng Yang,Mengting Yu,Junhui Hou
机构: School of Electronics and Information, Northwestern Polytechnical University (西北工业大学电子与信息学院); School of Engineering, Royal Melbourne Institute of Technology (皇家墨尔本理工大学工程学院); Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系); School of Telecommunication Engineering, Xidian University (西安电子科技大学电信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:One of the main challenges in point cloud compression (PCC) is how to evaluate the perceived distortion so that the codec can be optimized for perceptual quality. Current standard practices in PCC highlight a primary issue: while single-feature metrics are widely used to assess compression distortion, the classic method of searching point-to-point nearest neighbors frequently fails to adequately build precise correspondences between point clouds, resulting in an ineffective capture of human perceptual features. To overcome the related limitations, we propose a novel assessment method called RBFIM, utilizing radial basis function (RBF) interpolation to convert discrete point features into a continuous feature function for the distorted point cloud. By substituting the geometry coordinates of the original point cloud into the feature function, we obtain the bijective sets of point features. This enables an establishment of precise corresponding features between distorted and original point clouds and significantly improves the accuracy of quality assessments. Moreover, this method avoids the complexity caused by bidirectional searches. Extensive experiments on multiple subjective quality datasets of compressed point clouds demonstrate that our RBFIM excels in addressing human perception tasks, thereby providing robust support for PCC optimization efforts.
zh

[CV-52] Concat-ID: Towards Universal Identity-Preserving Video Synthesis

【速读】：该论文旨在解决身份保留（identity-preserving）视频生成的问题，特别是在单人和多身份场景下的视频合成任务。论文提出了一种名为Concat-ID的统一框架，其关键在于利用变分自编码器（Variational Autoencoders, VAEs）提取图像特征，并将其与视频潜码（video latents）沿序列维度拼接，仅依赖三维自注意力机制（3D self-attention）实现特征融合，而无需额外模块。此外，通过引入新颖的跨视频配对策略（cross-video pairing strategy）和多阶段训练方案（multi-stage training regimen），Concat-ID在保持身份一致性的同时提升了面部可编辑性（facial editability）和视频自然度（video naturalness）。实验结果验证了Concat-ID在单/多身份生成任务中的优越性能以及在多主体场景（如虚拟试穿和可控背景生成）中的无缝扩展能力。

链接: https://arxiv.org/abs/2503.14151
作者: Yong Zhong,Zhuoyi Yang,Jiayan Teng,Xiaotao Gu,Chongxuan Li
机构: Gaoling School of AI, Renmin University of China (人大高瓴人工智能学院); Tsinghua University (清华大学); Zhipu AI (智谱AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Concat-ID, a unified framework for identity-preserving video generation. Concat-ID employs Variational Autoencoders to extract image features, which are concatenated with video latents along the sequence dimension, leveraging solely 3D self-attention mechanisms without the need for additional modules. A novel cross-video pairing strategy and a multi-stage training regimen are introduced to balance identity consistency and facial editability while enhancing video naturalness. Extensive experiments demonstrate Concat-ID’s superiority over existing methods in both single and multi-identity generation, as well as its seamless scalability to multi-subject scenarios, including virtual try-on and background-controllable generation. Concat-ID establishes a new benchmark for identity-preserving video synthesis, providing a versatile and scalable solution for a wide range of applications.
zh

[CV-53] Comparative and Interpretative Analysis of CNN and Transformer Models in Predicting Wildfire Spread Using Remote Sensing Data

【速读】：该论文旨在解决深学习方法在野火预测中的选择不确定性问题，由于缺乏定量且可解释性的比较分析，这限制了预防措施的改进和模型的优化。论文的关键解决方案在于通过全面比较四种主流深度学习架构（Autoencoder、ResNet、UNet 和 Transformer-based Swin-UNet）的性能、效率和可解释性，发现Transformer-based Swin-UNet 和 UNet 在预测精度和模型可解释性方面表现更优，主要归因于Transformer-based Swin-UNet 的先进注意力机制以及UNet 和 Transformer-based Swin-UNet 对跳跃连接的有效利用。此外，通过应用可解释人工智能（XAI）技术进一步增强模型的透明性和可信度，并揭示UNet 和 Transformer-based Swin-UNet 能更有效地关注关键特征（如“先前火灾掩膜”、“干旱”和“植被”），同时保持对其他特征的均衡关注，从而实现优越的预测能力。这些深入的比较分析为未来模型设计提供了重要启示，并为不同场景下的模型选择提供了指导。

链接: https://arxiv.org/abs/2503.14150
作者: Yihang Zhou,Ruige Kong,Zhengsen Xu,Linlin Xu,Sibo Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Facing the escalating threat of global wildfires, numerous computer vision techniques using remote sensing data have been applied in this area. However, the selection of deep learning methods for wildfire prediction remains uncertain due to the lack of comparative analysis in a quantitative and explainable manner, crucial for improving prevention measures and refining models. This study aims to thoroughly compare the performance, efficiency, and explainability of four prevalent deep learning architectures: Autoencoder, ResNet, UNet, and Transformer-based Swin-UNet. Employing a real-world dataset that includes nearly a decade of remote sensing data from California, U.S., these models predict the spread of wildfires for the following day. Through detailed quantitative comparison analysis, we discovered that Transformer-based Swin-UNet and UNet generally outperform Autoencoder and ResNet, particularly due to the advanced attention mechanisms in Transformer-based Swin-UNet and the efficient use of skip connections in both UNet and Transformer-based Swin-UNet, which contribute to superior predictive accuracy and model interpretability. Then we applied XAI techniques on all four models, this not only enhances the clarity and trustworthiness of models but also promotes focused improvements in wildfire prediction capabilities. The XAI analysis reveals that UNet and Transformer-based Swin-UNet are able to focus on critical features such as ‘Previous Fire Mask’, ‘Drought’, and ‘Vegetation’ more effectively than the other two models, while also maintaining balanced attention to the remaining features, leading to their superior performance. The insights from our thorough comparative analysis offer substantial implications for future model design and also provide guidance for model selection in different scenarios.
zh

[CV-54] Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding CVPR2025

【速读】：本文旨在解决多模态大语言模型（MLLMs）在文档级别上视觉与语言模态对齐的问题，特别是如何设计合适的图像-文本预训练任务以有效桥接这两种模态。论文的关键解决方案是提出了一种名为VQAMask的新颖视觉-语言对齐方法，它将问题形式化为一个结合视觉问答（VQA）和掩码生成的任务，同时优化基于VQA的文本解析和掩码生成两项任务。前者使模型能够在语义层面隐式对齐图像与文本，而后者通过引入额外的显式掩码生成器（推理阶段丢弃），确保图像内视觉文本与其对应区域的空间感知对齐，从而防止解析视觉文本时产生幻觉，并促进空间感知特征表示的学习。为支持这一VQAMask任务，研究构建了一个全面的图像-掩码生成流水线，并提供了包含600万数据点的大规模数据集MTMask6M。实验结果表明，引入此掩码生成任务显著提升了文档级别的理解性能，进而开发出高效的Marten模型，其在多个文档中心任务中表现出色。

链接: https://arxiv.org/abs/2503.14140
作者: Zining Wang,Tongkun Guan,Pei Fu,Chen Duan,Qianyi Jiang,Zhentao Guo,Shan Guo,Junfeng Luo,Wei Shen,Xiaokang Yang
机构: Meituan; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University (上海交通大学教育部人工智能重点实验室, 人工智能研究院); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at this https URL.
zh

[CV-55] Exploring Disparity-Accuracy Trade-offs in Face Recognition Systems: The Role of Datasets Architectures and Loss Functions AAAI

【速读】：该论文旨在解决深度学习驱动的人脸识别系统（FRSs）在不同人群中的性能不均衡问题，特别是针对性别预测任务中可能出现的准确率与偏差权衡问题。论文的关键在于深入分析模型架构、损失函数以及人脸图像数据集这三个核心组件对模型性能及偏差的影响。通过构建十种基于不同架构的深度学习模型，并结合四种损失函数，在七个人脸数据集上进行266种评估配置的基准测试，研究发现这三个组件不仅单独影响模型的准确性和偏差，而且其组合效应也至关重要。论文强调数据集的固有属性可能导致不同模型表现相似，而数据集的选择直接决定了模型的感知偏差。此外，模型未能对“女性面孔”和“男性面孔”的定义形成统一的理解，这与数据集的多样性密切相关。论文的关键解决方案在于揭示这些影响因素的本质，并为模型开发者提供设计更公平、无偏系统的蓝图。

链接: https://arxiv.org/abs/2503.14138
作者: Siddharth D Jaiswal,Sagnik Basu,Sandipan Sikdar,Animesh Mukherjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This work has been accepted for publication at AAAI ICWSM 2025

点击查看摘要

Abstract:Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components – model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model’s perceived bias – the same model reports bias in opposite directions for three gender-balanced datasets of in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a female face’’ as opposed to a ``male face’', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.
zh

[CV-56] SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models CVPR2025

【速读】：该论文旨在解决现有基础模型在手绘草图理解任务中的局限性问题。具体而言，Stable Diffusion (SD) 在处理抽象草图时难以提取有意义的特征，并表现出显著的空间频率偏倚，抑制了草图理解所需的低频成分。为应对这些挑战，论文提出了一种创新解决方案：通过将SD与CLIP结合，利用CLIP强大的语义理解能力来弥补SD的空间频率偏倚。关键在于动态注入CLIP特征到SD的去噪过程中，并跨语义层级自适应聚合特征，从而实现了草图检索、识别、分割及对应学习等任务上的最新性能提升，展示了基础模型时代首个真正通用的手绘草图特征表示方法。

链接: https://arxiv.org/abs/2503.14129
作者: Subhadeep Koley,Tapas Kumar Dutta,Aneeshan Sain,Pinaki Nath Chowdhury,Ayan Kumar Bhunia,Yi-Zhe Song
机构: SketchX, CVSSP, University of Surrey (SketchX, CVSSP, 英国萨里大学); iFlyTek-Surrey Joint Research Centre on Artificial Intelligence (科大讯飞-萨里联合研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025. Project page available at this https URL

点击查看摘要

Abstract:While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD’s spatial-frequency biases. By dynamically injecting CLIP features into SD’s denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
zh

[CV-57] Condensing Action Segmentation Datasets via Generative Network Inversion CVPR2025

【速读】：该论文旨在解决基于过程性视频数据集的时间动作分割（Temporal Action Segmentation, TAS）任务中的存储效率与性能保持之间的矛盾问题。为应对这一挑战，论文提出了一种新颖的压缩框架，其关键是利用从数据集中学习到的生成先验（Generative Prior）以及网络反演（Network Inversion）技术，将视频数据压缩为紧凑的潜在码（latent codes），从而显著减少时间维度和通道维度上的存储需求。同时，通过采样多样化且具有代表性的动作序列，进一步降低视频间的冗余性。实验结果表明，该方法在标准基准测试中不仅实现了高效的压缩，还保持了竞争力的性能，在Breakfast数据集上存储减少了超过500倍，同时保留了使用完整数据集训练性能的83%，并且在下游增量学习任务中表现优于现有技术。

链接: https://arxiv.org/abs/2503.14112
作者: Guodong Ding,Rongyu Chen,Angela Yao
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 5 tables, Accepted to CVPR2025

点击查看摘要

Abstract:This work presents the first condensation approach for procedural video datasets used in temporal action segmentation. We propose a condensation framework that leverages generative prior learned from the dataset and network inversion to condense data into compact latent codes with significant storage reduced across temporal and channel aspects. Orthogonally, we propose sampling diverse and representative action sequences to minimize video-wise redundancy. Our evaluation on standard benchmarks demonstrates consistent effectiveness in condensing TAS datasets and achieving competitive performances. Specifically, on the Breakfast dataset, our approach reduces storage by over 500 \times while retaining 83% of the performance compared to training with the full dataset. Furthermore, when applied to a downstream incremental learning task, it yields superior performance compared to the state-of-the-art.
zh

[CV-58] owards properties of adversarial image perturbations

【速读】：本文研究了使用随机梯度方法产生的对抗扰动对 VMAF（Video Multimethod Assessment Fusion）图像质量评估指标显著增长的影响。论文的关键在于探索这些对抗扰动的结构特性，分析其与可接受的 PSNR 值的关系，并通过傅里叶功率谱计算扰动的频域特性。研究发现，在图像亮度适度变化（约 10 像素单位的局部区域）的情况下，VMAF 可以提升约 60%，而主观图像质量几乎不受影响。此外，论文揭示了对抗扰动幅度与图像亮度之间可能存在近似线性关系，并通过 PyTorch 实现了直接基于 VMAF 的优化。然而，实验还表明，当使用相同方法从噪声中恢复图像时，会出现显著的指标值与主观判断之间的不一致现象。因此，本文旨在解决如何在保持主观质量的同时有效提升 VMAF 指标的问题，其关键在于设计能够实现这一目标的对抗扰动生成机制。

链接: https://arxiv.org/abs/2503.14111
作者: Egor Kuznetsov,Kirill Aistov,Maxim Koroteev
机构: Huawei Luzin Research Center (华为卢森研究中心), Moscow, Russia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 40 figures

点击查看摘要

Abstract:Using stochastic gradient approach we study the properties of adversarial perturbations resulting in noticeable growth of VMAF image quality metric. The structure of the perturbations is investigated depending on the acceptable PSNR values and based on the Fourier power spectrum computations for the perturbations. It is demonstrated that moderate variation of image brightness ( \sim 10 pixel units in a restricted region of an image can result in VMAF growth by \sim 60% ). Unlike some other methods demonstrating similar VMAF growth, the subjective quality of an image remains almost unchanged. It is also shown that the adversarial perturbations may demonstrate approximately linear dependence of perturbation amplitudes on the image brightness. The perturbations are studied based on the direct VMAF optimization in PyTorch. The significant discrepancies between the metric values and subjective judgements are also demonstrated when image restoration from noise is carried out using the same direct VMAF optimization.
zh

[CV-59] Operational Change Detection for Geographical Information: Overview and Challenges

【速读】：该论文旨在解决因气候变化和人类活动导致地理空间快速变化背景下，国家测绘机构所维护的大规模地理数据库需要及时且有效的更新这一问题。论文的关键在于综述适用于大规模地理数据库操作性更新的变更检测方法，并提出解决方案。论文首先定义变更的本质，从时间到语义特征进行多维度阐述，并将自动变更检测方法分为规则驱动、统计分析、机器学习及模拟四大类，讨论每种方法的优势、局限性和适用性。接着，论文聚焦于国家测绘机构的关键应用，包括地理空间数据库优化更新、基于变更的现象与动态监测等。最后，论文强调当前面临的挑战，如变更定义的多样性、大规模数据集的缺失、输入数据类型的多样化、未研究的无变更检测、人机协作集成以及实际操作约束等问题。论文指出，持续创新变更检测技术对于满足未来地理信息系统的需求至关重要。

链接: https://arxiv.org/abs/2503.14109
作者: Nicolas Gonthier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint under review

点击查看摘要

Abstract:Rapid evolution of territories due to climate change and human impact requires prompt and effective updates to geospatial databases maintained by the National Mapping Agency. This paper presents a comprehensive overview of change detection methods tailored for the operational updating of large-scale geographic databases. This review first outlines the fundamental definition of change, emphasizing its multifaceted nature, from temporal to semantic characterization. It categorizes automatic change detection methods into four main families: rule-based, statistical, machine learning, and simulation methods. The strengths, limitations, and applicability of every family are discussed in the context of various input data. Then, key applications for National Mapping Agencies are identified, particularly the optimization of geospatial database updating, change-based phenomena, and dynamics monitoring. Finally, the paper highlights the current challenges for leveraging change detection such as the variability of change definition, the missing of relevant large-scale datasets, the diversity of input data, the unstudied no-change detection, the human in the loop integration and the operational constraints. The discussion underscores the necessity for ongoing innovation in change detection techniques to address the future needs of geographic information systems for national mapping agencies.
zh

[CV-60] Reliable uncertainty quantification for 2D/3D anatomical landmark localization using multi-output conformal prediction

【速读】：该论文致力于解决医学影像中解剖学标志点自动定位时预测不确定性量化不准确的问题，尤其是在结合正态性假设的传统方法中，普遍存在总预测不确定性系统性低估的现象。论文的关键创新在于引入了校准预测（Conformal Prediction, CP）框架，提出两种保证有限样本有效性的多输出预测新方法：多输出回归转分类校准预测（Multi-output Regression-to-Classification Conformal Prediction, M-R2CCP）及其变体——多输出回归转分类校准预测集到区域（Multi-output Regression-to-Classification Conformal Prediction to Region, M-R2C2R）。与传统方法生成轴对齐超矩形或椭球形区域不同，这些方法能够生成灵活的非凸预测区域，更有效地捕捉标志点预测中的不确定性结构。通过在多个二维和三维数据集上的广泛评估，证明了所提方法在有效性与效率上均优于现有方法，为临床决策支持提供了可靠的置信度度量。

链接: https://arxiv.org/abs/2503.14106
作者: Jef Jonkers,Frank Coopman,Luc Duchateau,Glenn Van Wallendael,Sofie Van Hoecke
机构: IDLab (IDLab); Department of Electronics and Information Systems (电子与信息系统系), Ghent University (根特大学), Belgium (比利时); Biometrics Research Group (生物特征研究组), Department of Morphology, Imaging, Orthopedics, Rehabilitation and Nutrition (形态学、影像学、骨科、康复与营养系), Ghent University (根特大学), Belgium (比利时); Ghent University - imec (根特大学 - imec), Belgium (比利时)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 33 pages, 10 figures

点击查看摘要

Abstract:Automatic anatomical landmark localization in medical imaging requires not just accurate predictions but reliable uncertainty quantification for effective clinical decision support. Current uncertainty quantification approaches often fall short, particularly when combined with normality assumptions, systematically underestimating total predictive uncertainty. This paper introduces conformal prediction as a framework for reliable uncertainty quantification in anatomical landmark localization, addressing a critical gap in automatic landmark localization. We present two novel approaches guaranteeing finite-sample validity for multi-output prediction: Multi-output Regression-as-Classification Conformal Prediction (M-R2CCP) and its variant Multi-output Regression to Classification Conformal Prediction set to Region (M-R2C2R). Unlike conventional methods that produce axis-aligned hyperrectangular or ellipsoidal regions, our approaches generate flexible, non-convex prediction regions that better capture the underlying uncertainty structure of landmark predictions. Through extensive empirical evaluation across multiple 2D and 3D datasets, we demonstrate that our methods consistently outperform existing multi-output conformal prediction approaches in both validity and efficiency. This work represents a significant advancement in reliable uncertainty estimation for anatomical landmark localization, providing clinicians with trustworthy confidence measures for their diagnoses. While developed for medical imaging, these methods show promise for broader applications in multi-output regression problems.
zh

[CV-61] SCJD: Sparse Correlation and Joint Distillation for Efficient 3D Human Pose Estimation

【速读】：该论文针对现有3D人体姿态估计算法（3D Human Pose Estimation, HPE）在高精度的同时面临计算开销大、推理速度慢的问题，以及知识蒸馏方法未能有效处理关节间的空间关系和多帧输入的时间相关性的问题，提出了一种新的框架——稀疏关联与关节蒸馏（Sparse Correlation and Joint Distillation, SCJD）。SCJD的关键在于通过稀疏关联输入序列下采样（Sparse Correlation Input Sequence Downsampling）减少学生网络输入的冗余，同时保留帧间关联；通过动态关节空间注意蒸馏（Dynamic Joint Spatial Attention Distillation），包括动态关节嵌入蒸馏（Dynamic Joint Embedding Distillation）和相邻关节注意蒸馏（Adjacent Joint Attention Distillation），增强学生网络的特征表示能力，并提升其对相邻关节关系的空间理解；此外，时间一致性蒸馏（Temporal Consistency Distillation）通过上采样和全局监督对齐教师和学生网络之间的时间相关性。这些创新点共同实现了效率与精度的良好平衡。

链接: https://arxiv.org/abs/2503.14097
作者: Weihong Chen,Xuemiao Xu,Haoxin Yang,Yi Xie,Peng Xiao,Cheng Xu,Huaidong Zhang,Pheng-Ann Heng
机构: South China University of Technology; The Hong Kong Polytechnic University; The Chinese University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student’s feature representation using the teacher’s multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network’s focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at this https URL.
zh

[CV-62] Limb-Aware Virtual Try-On Network with Progressive Clothing Warping

【速读】：该论文旨在解决基于图像的虚拟试穿中两个主要问题：一是现有方法通常采用单一全局变形直接进行服装扭曲，缺乏对店内服装的细粒度建模，导致服装外观失真；二是由于使用了与人体无关的表征方式，无法很好地生成肢体细节。为了解决这些问题，论文提出了关键的解决方案——Limb-aware Virtual Try-on Network (PL-VTON)，其核心在于通过渐进式的精细服装扭曲（Progressive Clothing Warping, PCW）实现高质量的试穿结果，并结合重力感知损失（gravity-aware loss）处理服装边缘，设计非肢体目标解析图的人体解析估计器（Person Parsing Estimator, PPE）提供结构约束以减少纹理渗漏，以及引入关注肢体区域真实细节生成的肢体感知纹理融合模块（Limb-aware Texture Fusion, LTF）。这些创新点共同确保了最终试穿效果的高精度与真实性。

链接: https://arxiv.org/abs/2503.14074
作者: Shengping Zhang,Xiaoyu Han,Weigang Zhang,Xiangyuan Lan,Hongxun Yao,Qingming Huang
机构: Harbin Institute of Technology (哈尔滨工业大学); Peng Cheng Laboratory (鹏城实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM). The code is available at this https URL

点击查看摘要

Abstract:Image-based virtual try-on aims to transfer an in-shop clothing image to a person image. Most existing methods adopt a single global deformation to perform clothing warping directly, which lacks fine-grained modeling of in-shop clothing and leads to distorted clothing appearance. In addition, existing methods usually fail to generate limb details well because they are limited by the used clothing-agnostic person representation without referring to the limb textures of the person image. To address these problems, we propose Limb-aware Virtual Try-on Network named PL-VTON, which performs fine-grained clothing warping progressively and generates high-quality try-on results with realistic limb details. Specifically, we present Progressive Clothing Warping (PCW) that explicitly models the location and size of in-shop clothing and utilizes a two-stage alignment strategy to progressively align the in-shop clothing with the human body. Moreover, a novel gravity-aware loss that considers the fit of the person wearing clothing is adopted to better handle the clothing edges. Then, we design Person Parsing Estimator (PPE) with a non-limb target parsing map to semantically divide the person into various regions, which provides structural constraints on the human body and therefore alleviates texture bleeding between clothing and body regions. Finally, we introduce Limb-aware Texture Fusion (LTF) that focuses on generating realistic details in limb regions, where a coarse try-on result is first generated by fusing the warped clothing image with the person image, then limb textures are further fused with the coarse result under limb-aware guidance to refine limb details. Extensive experiments demonstrate that our PL-VTON outperforms the state-of-the-art methods both qualitatively and quantitatively.
zh

[CV-63] Fast Autoregressive Video Generation with Diagonal Decoding

【速读】：该论文旨在解决自回归 Transformer 模型在长视频生成任务中的序列化逐token解码效率瓶颈问题，特别是在由数万tokens表示的长视频场景下。为了解决这一问题，论文提出了一种名为对角解码（Diagonal Decoding, DiagD）的无训练推理加速算法。其关键在于利用视频的空间和时间相关性，通过沿时空token网格的对角路径生成token，从而实现每帧内的并行解码以及相邻帧之间的部分重叠解码。此外，论文还提出了一种成本效益高的微调策略，以优化模型的注意力模式与解码顺序的一致性，进一步缩小小规模模型的训练-推理差距。实验表明，DiagD 相比于原始的顺序解码可实现高达10倍的速度提升，同时保持视觉保真度的相当水平。

链接: https://arxiv.org/abs/2503.14070
作者: Yang Ye,Junliang Guo,Haoyu Wu,Tianyu He,Tim Pearce,Tabish Rashid,Katja Hofmann,Jiang Bian
机构: Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. In this paper, we propose Diagonal Decoding (DiagD), a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Furthermore, we propose a cost-effective finetuning strategy that aligns the attention patterns of the model with our decoding order, further mitigating the training-inference gap on small-scale models. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to 10\times speedup compared to naive sequential decoding, while maintaining comparable visual fidelity.
zh

[CV-64] AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark

【速读】：该论文旨在解决AI生成视频评估领域中缺乏统一框架、方法学分散以及模型泛化能力受限的问题。现有评估指标未能系统性分类方法论，导致对评估体系的整体理解不足；同时，实现方式碎片化且缺乏标准化接口，增加了冗余处理开销；此外，许多现有方法受特定数据集依赖，限制了其在多样化视频领域的适用性。为应对这些挑战，论文提出了AIGVE-Tool这一统一框架，其关键在于通过新颖的五类分类法整合多种评估方法，并借助模块化配置系统实现灵活定制，从而构建全面且可扩展的评估流程。此外，论文还引入了AIGVE-Bench，这是一个基于五种SOTA视频生成模型的大规模基准数据集，用于系统性评估不同模型在九个关键质量维度上的表现，进一步验证了该工具在提供标准化和可靠评估结果方面的有效性及其对推动下一代AI生成视频技术发展的促进作用。

链接: https://arxiv.org/abs/2503.14064
作者: Xinhao Xiang,Xiao Liu,Zizhong Li,Zhuosheng Liu,Jiawei Zhang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement in AI-generated video synthesis has led to a growth demand for standardized and effective evaluation metrics. Existing metrics lack a unified framework for systematically categorizing methodologies, limiting a holistic understanding of the evaluation landscape. Additionally, fragmented implementations and the absence of standardized interfaces lead to redundant processing overhead. Furthermore, many prior approaches are constrained by dataset-specific dependencies, limiting their applicability across diverse video domains. To address these challenges, we introduce AIGVE-Tool (AI-Generated Video Evaluation Toolkit), a unified framework that provides a structured and extensible evaluation pipeline for a comprehensive AI-generated video evaluation. Organized within a novel five-category taxonomy, AIGVE-Tool integrates multiple evaluation methodologies while allowing flexible customization through a modular configuration system. Additionally, we propose AIGVE-Bench, a large-scale benchmark dataset created with five SOTA video generation models based on hand-crafted instructions and prompts. This dataset systematically evaluates various video generation models across nine critical quality dimensions. Extensive experiments demonstrate the effectiveness of AIGVE-Tool in providing standardized and reliable evaluation results, highlighting specific strengths and limitations of current models and facilitating the advancements of next-generation AI-generated video techniques.
zh

[CV-65] Foundation Feature-Driven Online End-Effector Pose Estimation: A Marker-Free and Learning-Free Approach

【速读】：该论文旨在解决传统手眼标定方法依赖于离线图像采集且不适用于在线自标定的问题，同时克服基于学习的机器人位姿估计方法在跨机器人泛化能力以及需要完全可见性方面的局限。论文的关键在于提出了一种名为Foundation特征驱动的末端执行器位姿估计算法（FEEPE），其核心在于无需训练即可实现跨末端执行器的泛化能力，并利用预训练视觉特征结合CAD模型与目标图像推导出2D-3D对应关系，通过PnP算法完成6D位姿估计。此外，为了应对部分观测引起的歧义及对称性问题，引入了多历史关键帧增强的位姿优化算法，利用时间信息提高精度。与传统手眼标定相比，FEEPE实现了无标记的在线标定；与机器人位姿估计不同的是，它能够在无需训练的情况下实现跨机器人和末端执行器的通用性。实验结果验证了其卓越的灵活性、泛化能力和性能表现。

链接: https://arxiv.org/abs/2503.14051
作者: Tianshu Wu,Jiyao Zhang,Shiqian Liang,Zhengxiao Han,Hao Dong
机构: Center on Frontiers of Computing Studies, School of Computer Science, Peking University (北京大学前沿计算科学中心, 计算机科学学院); PKU-Agibot Lab (北大-Agibot实验室); National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (多媒体信息处理国家重点实验室, 计算机科学学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate transformation estimation between camera space and robot space is essential. Traditional methods using markers for hand-eye calibration require offline image collection, limiting their suitability for online self-calibration. Recent learning-based robot pose estimation methods, while advancing online calibration, struggle with cross-robot generalization and require the robot to be fully visible. This work proposes a Foundation feature-driven online End-Effector Pose Estimation (FEEPE) algorithm, characterized by its training-free and cross end-effector generalization capabilities. Inspired by the zero-shot generalization capabilities of foundation models, FEEPE leverages pre-trained visual features to estimate 2D-3D correspondences derived from the CAD model and target image, enabling 6D pose estimation via the PnP algorithm. To resolve ambiguities from partial observations and symmetry, a multi-historical key frame enhanced pose optimization algorithm is introduced, utilizing temporal information for improved accuracy. Compared to traditional hand-eye calibration, FEEPE enables marker-free online calibration. Unlike robot pose estimation, it generalizes across robots and end-effectors in a training-free manner. Extensive experiments demonstrate its superior flexibility, generalization, and performance.
zh

[CV-66] MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization

【速读】：该论文旨在解决现有全身伴随手势生成方法中存在的信息丢失问题以及生成手势真实性不足的问题。现有方法通常依赖于自回归模型与向量量化标记进行手势生成，这会导致信息损失并影响生成结果的真实感。为了解决这些问题，论文提出了一种新的多模态对齐框架MAG（Motion-Audio-Text Aligned Framework），无需依赖离散标记化即可实现高质量且多样化的伴随手势合成。

解决方案的关键在于：(1) 引入了一个运动-文本-音频对齐的变分自编码器(MTA-VAE)，利用预训练的WavCaps模型提取的文本和音频嵌入来增强运动与语义及节奏的一致性，从而生成更逼真的手势；(2) 在此基础上提出了一个多模态掩码自回归模型(MMAG)，通过扩散机制在连续运动嵌入中实现自回归建模，避免了向量量化过程中的信息损失。此外，MMAG还设计了一个混合粒度的音频-文本融合模块，作为扩散过程的条件输入，以进一步保证多模态一致性。实验结果表明，MAG在两个基准数据集上的表现达到了当前最佳水平，无论是定量评估还是定性分析都显示出其生成的手势高度真实且多样化。

链接: https://arxiv.org/abs/2503.14040
作者: Binjie Liu,Lina Liu,Sanyi Zhang,Songen Gu,Yihao Zhi,Tianyi Zhu,Lei Yang,Long Ye
机构: Communication University of China; China Mobile Research Institute, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences; The Chinese University of Hong Kong
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps’ text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech this http URL code will be released to facilitate future research.
zh

[CV-67] Intra and Inter Parser-Prompted Transformers for Effective Image Restoration AAAI-25

【速读】：该论文旨在解决图像恢复（Image Restoration）任务中的挑战，特别是从退化观测中恢复高质量图像的问题，包括去除雨纹（image deraining）、去模糊（defocus deblurring）、去雪（desnowing）以及低光照增强（low-light enhancement）。论文的关键在于提出了一种名为Parser-Prompted Transformers (PPTformer) 的方法，其核心解决方案包括两个主要部分：一个用于图像恢复的图像恢复网络（IRNet）和一个提供可靠解析信息的解析提示特征生成网络（PPFGNet）。为了增强解析信息在IRNet中的整合，论文进一步提出了两种注意力机制：Intra Parser-Prompted Attention (IntraPPA) 和 Inter Parser-Prompted Attention (InterPPA)，分别实现了解析特征的隐式和显式学习。此外，还设计了一个解析提示前馈网络，通过像素级门控调制指导图像恢复过程。这些创新点共同提升了图像恢复的效果，使其达到当前最先进的性能。

链接: https://arxiv.org/abs/2503.14037
作者: Cong Wang,Jinshan Pan,Liyan Wang,Wei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This version is accepted by the Association for the Advancement of Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:We propose Intra and Inter Parser-Prompted Transformers (PPTformer) that explore useful features from visual foundation models for image restoration. Specifically, PPTformer contains two parts: an Image Restoration Network (IRNet) for restoring images from degraded observations and a Parser-Prompted Feature Generation Network (PPFGNet) for providing IRNet with reliable parser information to boost restoration. To enhance the integration of the parser within IRNet, we propose Intra Parser-Prompted Attention (IntraPPA) and Inter Parser-Prompted Attention (InterPPA) to implicitly and explicitly learn useful parser features to facilitate restoration. The IntraPPA re-considers cross attention between parser and restoration features, enabling implicit perception of the parser from a long-range and intra-layer perspective. Conversely, the InterPPA initially fuses restoration features with those of the parser, followed by formulating these fused features within an attention mechanism to explicitly perceive parser information. Further, we propose a parser-prompted feed-forward network to guide restoration within pixel-wise gating modulation. Experimental results show that PPTformer achieves state-of-the-art performance on image deraining, defocus deblurring, desnowing, and low-light enhancement.
zh

[CV-68] A Revisit to the Decoder for Camouflaged Object Detection BMVC2024

【速读】：该论文致力于解决隐蔽物体检测（Camouflaged Object Detection, COD）中的挑战，即从背景中精准分割出隐藏的隐蔽物体。由于隐蔽物体与背景高度相似，解码器需要特别设计以有效提取隐蔽物体的特征并精细生成其复杂边界。论文的关键解决方案在于提出了一种新颖的架构，通过引入“Enrich Decoder”和“Retouch Decoder”来增强现有解码策略。“Enrich Decoder”利用通道注意力机制放大对COD至关重要的特征通道，“Retouch Decoder”则进一步通过空间注意力机制精修分割图，重点关注如边界区域等重要像素。实验表明，这两种组件相互补充，显著提升了模型性能。

链接: https://arxiv.org/abs/2503.14035
作者: Seung Woo Ko,Joopyo Hong,Suyoung Kim,Seungjai Bang,Sungzoon Cho,Nojun Kwak,Hyung-Sin Kim,Joonseok Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in BMVC 2024, 13 pages, 7 figures (Appendix: 5 pages, 2 figures)

点击查看摘要

Abstract:Camouflaged object detection (COD) aims to generate a fine-grained segmentation map of camouflaged objects hidden in their background. Due to the hidden nature of camouflaged objects, it is essential for the decoder to be tailored to effectively extract proper features of camouflaged objects and extra-carefully generate their complex boundaries. In this paper, we propose a novel architecture that augments the prevalent decoding strategy in COD with Enrich Decoder and Retouch Decoder, which help to generate a fine-grained segmentation map. Specifically, the Enrich Decoder amplifies the channels of features that are important for COD using channel-wise attention. Retouch Decoder further refines the segmentation maps by spatially attending to important pixels, such as the boundary regions. With extensive experiments, we demonstrate that ENTO shows superior performance using various encoders, with the two novel components playing their unique roles that are mutually complementary.
zh

[CV-69] Uncertainty-Aware Global-View Reconstruction for Multi-View Multi-Label Feature Selection AAAI25

【速读】：该论文致力于解决多视图多标签学习（Multi-View Multi-Label Learning, MVML）中特征选择的问题，旨在确保模型在性能与效率之间的平衡。现有方法通常分别从一致性部分和互补性部分提取信息，但由于分割不清可能导致噪声。论文的关键解决方案在于提出了一种从全局视角重构的角度构建的统一模型，并在重构过程中引入样本不确定性感知，以增强模型的可信度。通过样本间的图结构、样本置信度以及视图关系来实现全局视图的重构，并建立重构视图与标签矩阵之间的精确映射。实验结果验证了所提方法在多视图数据集上的优越性能。

链接: https://arxiv.org/abs/2503.14024
作者: Pingting Hao,Kunpeng Liu,Wanfu Gao
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,5 figures, accept in AAAI 25

点击查看摘要

Abstract:In recent years, multi-view multi-label learning (MVML) has gained popularity due to its close resemblance to real-world scenarios. However, the challenge of selecting informative features to ensure both performance and efficiency remains a significant question in MVML. Existing methods often extract information separately from the consistency part and the complementary part, which may result in noise due to unclear segmentation. In this paper, we propose a unified model constructed from the perspective of global-view reconstruction. Additionally, while feature selection methods can discern the importance of features, they typically overlook the uncertainty of samples, which is prevalent in realistic scenarios. To address this, we incorporate the perception of sample uncertainty during the reconstruction process to enhance trustworthiness. Thus, the global-view is reconstructed through the graph structure between samples, sample confidence, and the view relationship. The accurate mapping is established between the reconstructed view and the label matrix. Experimental results demonstrate the superior performance of our method on multi-view datasets.
zh

[CV-70] MP-GUI: Modality Perception with MLLM s for GUI Understanding CVPR2025

【速读】：该论文旨在解决多模态大型语言模型（Multi-modal Large Language Models, MLLMs）在图形用户界面（Graphical User Interface, GUI）理解中的挑战，主要体现在两个方面：一是现有模型难以有效建模GUI特有的显式空间结构；二是高质量空间结构数据的获取因隐私和噪声环境受限。为应对这些挑战，论文提出了一种名为MP-GUI的专门设计的MLLM。其关键在于通过三个精确的专业感知器提取图形、文本和空间模态，并结合空间结构优化策略与融合门机制，灵活适应不同GUI理解任务的需求。此外，为了缓解训练数据不足的问题，论文还引入了自动数据收集管道。实验结果表明，MP-GUI在有限数据条件下表现出色。

链接: https://arxiv.org/abs/2503.14021
作者: Ziwei Wang,Weizhi Chen,Leyang Yang,Sheng Zhou,Shengchu Zhao,Hanbei Zhan,Jiongchao Jin,Liangcheng Li,Zirui Shao,Jiajun Bu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Paper accepted to CVPR 2025

点击查看摘要

Abstract:Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.
zh

[CV-71] Boosting Semi-Supervised Medical Image Segmentation via Masked Image Consistency and Discrepancy Learning

【速读】：本文旨在解决半监督学习在医学图像分割中利用未标注数据时存在的问题，特别是在协同训练（co-training）框架下，现有研究主要关注网络初始化差异和伪标签生成，而忽视了信息交换与模型多样性保持之间的平衡。为了解决这一局限性，论文提出了Masked Image Consistency and Discrepancy Learning (MICD)框架，其关键在于三个模块：Masked Cross Pseudo Consistency (MCPC)模块通过掩码输入分支间的伪标记增强上下文感知和小样本学习；Cross Feature Consistency (CFC)模块通过确保解码器特征一致性强化信息交换和模型鲁棒性；Cross Model Discrepancy (CMD)模块利用EMA教师网络监控输出以保留分支多样性。这些模块共同聚焦于细粒度局部信息的处理，并在一个异构框架内维持多样性，从而克服现有方法的局限性。实验结果表明，该方法在AMOS和Synapse两个公开医学图像数据集上的表现优于当前最先进的技术。

链接: https://arxiv.org/abs/2503.14013
作者: Pengcheng Zhou,Lantian Zhang,Wei Li
机构: School of Information and Communication Engineering, Beijing University of Posts and Telecommunications (北京邮电大学信息与通信工程学院), Beijing, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院), Beijing, China; School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院), Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semi-supervised learning is of great significance in medical image segmentation by exploiting unlabeled data. Among its strategies, the co-training framework is prominent. However, previous co-training studies predominantly concentrate on network initialization variances and pseudo-label generation, while overlooking the equilibrium between information interchange and model diversity preservation. In this paper, we propose the Masked Image Consistency and Discrepancy Learning (MICD) framework with three key modules. The Masked Cross Pseudo Consistency (MCPC) module enriches context perception and small sample learning via pseudo-labeling across masked-input branches. The Cross Feature Consistency (CFC) module fortifies information exchange and model robustness by ensuring decoder feature consistency. The Cross Model Discrepancy (CMD) module utilizes EMA teacher networks to oversee outputs and preserve branch diversity. Together, these modules address existing limitations by focusing on fine-grained local information and maintaining diversity in a heterogeneous framework. Experiments on two public medical image datasets, AMOS and Synapse, demonstrate that our approach outperforms state-of-the-art methods.
zh

[CV-72] LEGNet: Lightweight Edge-Gaussian Driven Network for Low-Quality Remote Sensing Image Object Detection

【速读】：该论文旨在解决复杂视觉环境中遥感目标检测（Remote Sensing Object Detection, RSOD）面临的挑战，特别是低质量遥感图像中因低空间分辨率、传感器噪声、物体模糊、低光照退化及部分遮挡等因素导致的特征可分辨性降低问题。这些问题具体表现为前景-背景对比度下降、边缘表示的结构不连续以及光照变化引起的特征响应模糊，从而削弱了检测模型的鲁棒性和部署可行性。为应对这些挑战，论文提出了一种轻量级网络LEGNet，其关键创新在于结合Scharr算子引导的边缘先验与不确定性感知的高斯建模的新型边缘-高斯聚合（Edge-Gaussian Aggregation, EGA）模块：(a) 方向感知的Scharr滤波器通过旋转不变性保留高频边缘细节；(b) 不确定性感知的高斯层通过方差估计概率性优化低置信度特征。这种设计在提升精度的同时保持了网络架构的简洁性。综合评估表明，LEGNet在五个基准数据集上实现了最先进的性能，同时保持计算效率，适合部署于资源受限的边缘设备。

链接: https://arxiv.org/abs/2503.14012
作者: Wei Lu,Si-Bao Chen,Hui-Dong Li,Qing-Ling Shu,Chris H. Q. Ding,Jin Tang,Bin Luo
机构: MOE Key Laboratory of ICSP, IMIS Laboratory of Anhui, Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Zenmorn-AHU AI Joint Laboratory, School of Computer Science and Technology, Anhui University (安徽大学), Hefei 230601, China; School of Data Science (SDS), Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）数据科学学院), Shenzhen 518172, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. Remote Sensing Image Object Detection

点击查看摘要

Abstract:Remote sensing object detection (RSOD) faces formidable challenges in complex visual environments. Aerial and satellite images inherently suffer from limitations such as low spatial resolution, sensor noise, blurred objects, low-light degradation, and partial occlusions. These degradation factors collectively compromise the feature discriminability in detection models, resulting in three key issues: (1) reduced contrast that hampers foreground-background separation, (2) structural discontinuities in edge representations, and (3) ambiguous feature responses caused by variations in illumination. These collectively weaken model robustness and deployment feasibility. To address these challenges, we propose LEGNet, a lightweight network that incorporates a novel edge-Gaussian aggregation (EGA) module specifically designed for low-quality remote sensing images. Our key innovation lies in the synergistic integration of Scharr operator-based edge priors with uncertainty-aware Gaussian modeling: (a) The orientation-aware Scharr filters preserve high-frequency edge details with rotational invariance; (b) The uncertainty-aware Gaussian layers probabilistically refine low-confidence features through variance estimation. This design enables precision enhancement while maintaining architectural simplicity. Comprehensive evaluations across four RSOD benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0) and a UAV-view dataset (VisDrone2019) demonstrate significant improvements. LEGNet achieves state-of-the-art performance across five benchmark datasets while ensuring computational efficiency, making it well-suited for deployment on resource-constrained edge devices in real-world remote sensing applications. The code is available at this https URL.
zh

[CV-73] MeshFleet: Filtered and Annotated 3D Vehicle Dataset for Domain Specific Generative Modeling

【速读】：本文旨在解决生成式大模型在工程等特定领域应用受限的问题，主要由于现有模型难以满足任务所需的准确性、质量和可控性。解决方案的关键在于创建高质量的领域专用3D数据集以微调大模型，而其中数据过滤与标注过程是主要瓶颈。论文提出了一种基于质量分类器的自动化数据过滤管道，该分类器通过结合DINOv2和SigLIP嵌入，并利用caption分析和不确定性估计进行精炼，训练于Objaverse的手动标注子集。实验表明，这种方法优于基于caption和图像美学评分的技术，并通过SV3D的微调实验证明了目标化数据选择在特定领域3D生成建模中的重要性。

链接: https://arxiv.org/abs/2503.14002
作者: Damian Boborzi,Phillip Mueller,Jonas Emrich,Dominik Schmid,Sebastian Mueller,Lars Mikelsons
机构: University of Augsburg (奥格斯堡大学); BMW Group (宝马集团); TU Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.
zh

[CV-74] Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

【速读】：该论文旨在解决家禽（如鸭子）体尺和体重非侵入式精准测量的问题。传统方法依赖于人工操作，不仅效率低且易引入误差，同时可能对动物造成压力。为应对这些挑战，论文提出了一种基于深度学习的创新模型，利用多模态数据（包括不同视角的2D RGB图像、深度图以及3D点云）进行鸭子体尺与体重的无创估计。该方案的关键在于通过PointNet++从3D点云中提取关键特征点，并结合多视角卷积2D特征与三维几何特征融合技术，再借助Transformer编码器捕捉长距离依赖关系以优化特征交互，从而显著提升预测鲁棒性与精度。最终，该模型在八项形态学参数上的平均绝对百分比误差（MAPE）仅为6.33%，R²值达到0.953，验证了其强大的预测能力。

链接: https://arxiv.org/abs/2503.14001
作者: Yi Xiao,Qiannan Han,Guiping Liang,Hongyan Zhang,Song Wang,Zhihao Xu,Weican Wan,Chuang Li,Guitao Jiang,Wenbo Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data-2D RGB images from different views, depth images, and 3D point clouds-for the non-invasive estimation of duck body dimensions and weight. A dataset of 1,023 Linwu ducks, comprising over 5,000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 6.33% and an R2 of 0.953 across eight morphometric parameters, demonstrating strong predictive capability. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.
zh

[CV-75] BI-RADS prediction of mammographic masses using uncertainty information extracted from a Bayesian Deep Learning model

【速读】：该论文旨在解决乳腺钼靶影像中因放射科医生描述乳腺肿块存在显著变异性而导致BI-RADS分类不确定性的问题。论文提出利用一种基于贝叶斯深度学习模型提取的不确定性信息来预测BI-RADS评分，以支持放射科医生的最终决策。解决方案的关键在于构建一个能够量化并报告病变恶性程度不确定性的贝叶斯深度学习模型，通过结合病理信息验证，该模型在BI-RADS 2、3和5数据集上的F1分数分别达到73.33%、59.60%和59.26%，优于放射科医生的表现，并能正确识别BI-RADS 0类别中的所有恶性样本为BI-RADS 5级别，同时Grad-CAM可视化表明模型关注于病灶的形态学特征。

链接: https://arxiv.org/abs/2503.13999
作者: Mohaddeseh Chegini,Ali Mahloojifar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The BI_RADS score is a probabilistic reporting tool used by radiologists to express the level of uncertainty in predicting breast cancer based on some morphological features in mammography images. There is a significant variability in describing masses which sometimes leads to BI_RADS misclassification. Using a BI_RADS prediction system is required to support the final radiologist decisions. In this study, the uncertainty information extracted by a Bayesian deep learning model is utilized to predict the BI_RADS score. The investigation results based on the pathology information demonstrate that the f1-scores of the predictions of the radiologist are 42.86%, 48.33% and 48.28%, meanwhile, the f1-scores of the model performance are 73.33%, 59.60% and 59.26% in the BI_RADS 2, 3 and 5 dataset samples, respectively. Also, the model can distinguish malignant from benign samples in the BI_RADS 0 category of the used dataset with an accuracy of 75.86% and correctly identify all malignant samples as BI_RADS 5. The Grad-CAM visualization shows the model pays attention to the morphological features of the lesions. Therefore, this study shows the uncertainty-aware Bayesian Deep Learning model can report his uncertainty about the malignancy of a lesion based on morphological features, like a radiologist.
zh

[CV-76] arPro: Targeted Protection against Malicious Image Editing

【速读】：该论文旨在解决图像编辑技术被滥用以生成不适宜工作内容（Not-Safe-for-Work, NSFW）的问题，同时需要在阻止恶意编辑的同时保持正常编辑的可用性。现有保护方法因无法区分良性与恶意编辑，导致要么过度干扰正常编辑，要么仍允许部分有害内容生成。为此，论文提出TarPro框架，其关键是通过语义感知约束仅干扰恶意内容，并利用轻量级扰动生成更稳定、不可察觉且鲁棒的扰动，从而实现对恶意编辑的有效阻断与正常编辑的最小影响。实验表明，TarPro在保护效率和正常使用之间取得了更好的平衡。

链接: https://arxiv.org/abs/2503.13994
作者: Kaixin Shen,Ruijie Quan,Jiaxu Miao,Jun Xiao,Yi Yang
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Sun Yat-sen University (中山大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of image editing techniques has raised concerns about their misuse for generating Not-Safe-for-Work (NSFW) content. This necessitates a targeted protection mechanism that blocks malicious edits while preserving normal editability. However, existing protection methods fail to achieve this balance, as they indiscriminately disrupt all edits while still allowing some harmful content to be generated. To address this, we propose TarPro, a targeted protection framework that prevents malicious edits while maintaining benign modifications. TarPro achieves this through a semantic-aware constraint that only disrupts malicious content and a lightweight perturbation generator that produces a more stable, imperceptible, and robust perturbation for image protection. Extensive experiments demonstrate that TarPro surpasses existing methods, achieving a high protection efficacy while ensuring minimal impact on normal edits. Our results highlight TarPro as a practical solution for secure and controlled image editing.
zh

[CV-77] GraphTEN: Graph Enhanced Texture Encoding Network

【速读】：该论文旨在解决纹理识别中由于纹理基元在空间分布上的变化性和随机性导致难以建模非局部上下文关系的问题。论文的关键解决方案是提出了一种图增强的纹理编码网络（GraphTEN），其通过全连接图捕获局部和全局特征之间的关联，并利用二分图捕捉跨尺度的纹理基元依赖关系。此外，引入了一个基于代码书的补丁编码模块，以无序方式表示纹理，将多尺度补丁特征编码到统一特征空间中。这些创新使得GraphTEN在五个公开数据集上实现了优于现有方法的性能。

链接: https://arxiv.org/abs/2503.13991
作者: Bo Peng,Jintao Chen,Mufeng Yao,Chenhao Zhang,Jianghui Zhang,Mingmin Chi,Jiang Tao
机构: School of Information, Shanghai Ocean University (上海海洋大学), Shanghai; School of Computer Science, Fudan University (复旦大学), Shanghai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures, conference paper

点击查看摘要

Abstract:Texture recognition is a fundamental problem in computer vision and pattern recognition. Recent progress leverages feature aggregation into discriminative descriptions based on convolutional neural networks (CNNs). However, modeling non-local context relations through visual primitives remains challenging due to the variability and randomness of texture primitives in spatial distributions. In this paper, we propose a graph-enhanced texture encoding network (GraphTEN) designed to capture both local and global features of texture primitives. GraphTEN models global associations through fully connected graphs and captures cross-scale dependencies of texture primitives via bipartite graphs. Additionally, we introduce a patch encoding module that utilizes a codebook to achieve an orderless representation of texture by encoding multi-scale patch features into a unified feature space. The proposed GraphTEN achieves superior performance compared to state-of-the-art methods across five publicly available datasets.
zh

[CV-78] Rethinking Cell Counting Methods: Decoupling Counting and Localization MICCAI2024

【速读】：该论文旨在解决显微镜图像中细胞计数这一在医学和生物学领域至关重要的任务，但手动执行此任务极为繁琐且耗时。为应对这一挑战，论文提出了一种概念简单但有效的解耦学习方案，包括独立的计数器网络和定位器网络。与联合学习计数和密度图估计不同，论文的关键发现是将这些目标解耦反而显著提升了性能。具体而言，计数器在网络的中间特征图而非像素空间中运行，以利用全局上下文并生成计数估计值，同时生成粗略的密度图；定位器则基于原始图像和计数器提供的粗密度图重建高分辨率密度图，从而精确地定位单个细胞。此外，为了进一步提高计数准确性，引入了全局消息传递模块以整合跨区域模式。实验结果表明，尽管方法简单，该方案在多个数据集上挑战了现有实践，并取得了最先进的性能。论文的核心见解在于解耦学习减少了直接在高分辨率密度图上学习计数的需求，使模型能够专注于对准确估计至关重要的全局特征。

链接: https://arxiv.org/abs/2503.13989
作者: Zixuan Zheng,Yilei Shi,Chunlei Li,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024

点击查看摘要

Abstract:Cell counting in microscopy images is vital in medicine and biology but extremely tedious and time-consuming to perform manually. While automated methods have advanced in recent years, state-of-the-art approaches tend to increasingly complex model designs. In this paper, we propose a conceptually simple yet effective decoupled learning scheme for automated cell counting, consisting of separate counter and localizer networks. In contrast to jointly learning counting and density map estimation, we show that decoupling these objectives surprisingly improves results. The counter operates on intermediate feature maps rather than pixel space to leverage global context and produce count estimates, while also generating coarse density maps. The localizer then reconstructs high-resolution density maps that precisely localize individual cells, conditional on the original images and coarse density maps from the counter. Besides, to boost counting accuracy, we further introduce a global message passing module to integrate cross-region patterns. Extensive experiments on four datasets demonstrate that our approach, despite its simplicity, challenges common practice and achieves state-of-the-art performance by significant margins. Our key insight is that decoupled learning alleviates the need to learn counting on high-resolution density maps directly, allowing the model to focus on global features critical for accurate estimates. Code is available at this https URL.
zh

[CV-79] DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection CVPR2025

【速读】：该论文致力于解决视觉检测模型训练中因缺陷数据稀缺而导致的有效模型开发难题。传统方法通过图像生成模型合成缺陷图像，但生成高度逼真的缺陷仍具挑战性。论文提出的DefectFill是一种仅需少量参考缺陷图像即可实现逼真缺陷生成的新方法，其关键是利用经过微调的inpainting扩散模型，并结合自定义损失函数（包含缺陷项、对象项和注意力项），以精确捕获局部细节特征并将其无缝融入无缺陷物体中。此外，低保真度选择方法进一步提升了缺陷样本质量。实验表明，DefectFill生成的高质量缺陷图像显著提升了MVTec AD数据集上的视觉检测模型性能。

链接: https://arxiv.org/abs/2503.13985
作者: Jaewoo Song,Daemin Park,Kanghyun Baek,Sangyub Lee,Jooyoung Choi,Eunji Kim,Sungroh Yoon
机构: Department of Electrical and Computer Engineering, Seoul National University (首尔国立大学); Global Technology Research, Samsung Electronics (三星电子); IPAI, 4AIIS, ASRI, INMC, ISRC, Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Developing effective visual inspection models remains challenging due to the scarcity of defect data. While image generation models have been used to synthesize defect images, producing highly realistic defects remains difficult. We propose DefectFill, a novel method for realistic defect generation that requires only a few reference defect images. It leverages a fine-tuned inpainting diffusion model, optimized with our custom loss functions incorporating defect, object, and attention terms. It enables precise capture of detailed, localized defect features and their seamless integration into defect-free objects. Additionally, our Low-Fidelity Selection method further enhances the defect sample quality. Experiments show that DefectFill generates high-quality defect images, enabling visual inspection models to achieve state-of-the-art performance on the MVTec AD dataset.
zh

[CV-80] SpaceVLLM : Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在时空视频定位（spatio-temporal video grounding）任务中的局限性。这一问题的核心挑战在于：首先，难以从视频的每一帧中准确提取时空信息；其次，大量视觉标记（visual tokens）使得精确映射到对应的时空坐标变得困难。为了解决这些问题，论文提出了关键方案：采用交错的时空感知查询（interleaved Spatio-Temporal Aware Queries）以捕获时间感知和动态空间信息，并引入查询引导的空间解码器（Query-Guided Space Decoder）建立查询与空间坐标之间的对应关系。此外，为了弥补缺乏时空数据集的问题，构建了统一的时空定位（Unified Spatio-Temporal Grounding, Uni-STG）数据集，包含三个任务下的48万实例，以充分挖掘MLLMs在时空定位中的潜力。实验结果表明，所提出的方法在涵盖时间、空间、时空及视频理解任务的11个基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.13983
作者: Jiankang Wang,Zhihan zhang,Zhihang Liu,Yang Li,Jiannan Ge,Hongtao Xie,Yongdong Zhang
机构: University of Science and Technology of China (中国科学技术大学); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets and model will be released.
zh

[CV-81] A-SCoRe: Attention-based Scene Coordinate Regression for wide-ranging scenarios

【速读】：该论文试图解决视觉定位中基于特征匹配方法存储和计算开销大的问题。解决方案的关键在于提出了一种名为A-ScoRe的新场景坐标回归（Scene Coordinate Regression, SCR）架构，它通过在描述子图级别利用注意力机制生成有意义且高语义的二维描述子，弥补了传统卷积神经网络（Convolutional Neural Network, CNN）方法未能充分捕捉像素间空间关系的不足。这一设计不仅实现了轻量化，还提升了模型在多模态数据（如稠密或稀疏深度图、SLAM到运动结构恢复SfM等）上的灵活性，使其适用于不同环境与条件下的移动机器人应用。

链接: https://arxiv.org/abs/2503.13982
作者: Huy-Hoang Bui,Bach-Thuan Bui,Quang-Vinh Tran,Yasuyuki Fujii,Joo-Ho Lee
机构: Graduate School of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程研究生院), Osaka, Japan; College of Information Science and Engineering, Ritsumeikan University (立命馆大学信息科学与工程学院), Osaka, Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual localization is considered to be one of the crucial parts in many robotic and vision systems. While state-of-the art methods that relies on feature matching have proven to be accurate for visual localization, its requirements for storage and compute are burdens. Scene coordinate regression (SCR) is an alternative approach that remove the barrier for storage by learning to map 2D pixels to 3D scene coordinates. Most popular SCR use Convolutional Neural Network (CNN) to extract 2D descriptor, which we would argue that it miss the spatial relationship between pixels. Inspired by the success of vision transformer architecture, we present a new SCR architecture, called A-ScoRe, an Attention-based model which leverage attention on descriptor map level to produce meaningful and high-semantic 2D descriptors. Since the operation is performed on descriptor map, our model can work with multiple data modality whether it is a dense or sparse from depth-map, SLAM to Structure-from-Motion (SfM). This versatility allows A-SCoRe to operate in different kind of environments, conditions and achieve the level of flexibility that is important for mobile robots. Results show our methods achieve comparable performance with State-of-the-art methods on multiple benchmark while being light-weighted and much more flexible. Code and pre-trained models are public in our repository: this https URL.
zh

[CV-82] SoccerSynth Field: enhancing field detection with synthetic data from virtual soccer simulator

【速读】：该论文试图解决体育视频分析中足球场地检测任务中因收集大规模多样化真实世界数据集成本高、耗时长而面临的挑战。论文通过研究利用合成数据集对模型进行预训练的有效性，提出了一种基于合成数据集（SoccerSynth-Field）的解决方案。关键在于创建了一个合成足球场地数据集用于预训练模型，并验证了这些模型在检测足球场地时相较于基于真实数据集训练的模型具有更优的性能，从而证明了合成数据在提升模型鲁棒性和准确性方面的有效性，为体育场地检测任务提供了一种经济高效且可扩展的解决方案。

链接: https://arxiv.org/abs/2503.13969
作者: HaoBin Qin,Jiale Fang,Keisuke Fujii
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Field detection in team sports is an essential task in sports video analysis. However, collecting large-scale and diverse real-world datasets for training detection models is often cost and time-consuming. Synthetic datasets, which allow controlled variability in lighting, textures, and camera angles, will be a promising alternative for addressing these problems. This study addresses the challenges of high costs and difficulties in collecting real-world datasets by investigating the effectiveness of pretraining models using synthetic datasets. In this paper, we propose the effectiveness of using a synthetic dataset (SoccerSynth-Field) for soccer field detection. A synthetic soccer field dataset was created to pretrain models, and the performance of these models was compared with models trained on real-world datasets. The results demonstrate that models pretrained on the synthetic dataset exhibit superior performance in detecting soccer fields. This highlights the effectiveness of synthetic data in enhancing model robustness and accuracy, offering a cost-effective and scalable solution for advancing detection tasks in sports field detection.
zh

[CV-83] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

【速读】：本文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）任务中现有方法普遍依赖于特定数据集训练、缺乏跨数据集泛化能力的问题。为实现导航能力在多样化任务中的无缝迁移，论文提出了一种创新的分层方法FlexVLN。其关键是将基于监督学习的指令跟随器（Instruction Follower）的基本导航能力与大型语言模型规划器（LLM Planner）的强大泛化能力相结合，通过层次化的架构实现VLN任务在不同数据集上的有效泛化。此外，为了缓解LLM Planner可能产生的幻觉现象并提升指令跟随器的执行准确性，论文还设计了验证机制和多模型集成机制。实验评估表明，FlexVLN在REVERIE、SOON和CVDN-target等跨领域数据集上的泛化性能显著优于现有方法。

链接: https://arxiv.org/abs/2503.13966
作者: Siqi Zhang,Yanyuan Qiao,Qunbo Wang,Longteng Guo,Zhihua Wei,Jing Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The aspiration of the Vision-and-Language Navigation (VLN) task has long been to develop an embodied agent with robust adaptability, capable of seamlessly transferring its navigation capabilities across various tasks. Despite remarkable advancements in recent years, most methods necessitate dataset-specific training, thereby lacking the capability to generalize across diverse datasets encompassing distinct types of instructions. Large language models (LLMs) have demonstrated exceptional reasoning and generalization abilities, exhibiting immense potential in robot action planning. In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets. Moreover, a verification mechanism and a multi-model integration mechanism are proposed to mitigate potential hallucinations by the LLM Planner and enhance execution accuracy of the Instruction Follower. We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability. The generalization performance of FlexVLN surpasses that of all the previous methods to a large extent.
zh

[CV-84] Survey of Adversarial Robustness in Multimodal Large Language Models

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在实际应用中面临的对抗性漏洞问题，这些漏洞可能危及模型的安全性和可靠性。与单模态模型不同，MLLMs由于各模态之间的相互依赖性，容易受到模态特定威胁和跨模态对抗性操作的影响。论文的关键在于全面回顾了针对不同模态的MLLMs的对抗鲁棒性研究，包括模态特异性对抗攻击的分类、用于评估MLLMs鲁棒性的关键数据集和评价指标，并深入分析了针对MLLMs不同模态的攻击方法。通过识别现有挑战，论文提出了未来研究的潜在方向，以提高MLLMs的对抗鲁棒性。

链接: https://arxiv.org/abs/2503.13962
作者: Chengze Jiang,Zhuangzhuang Wang,Minjing Dong,Jie Gui
机构: School of Cyber Science and Engineering, Southeast University, Nanjing, China (东南大学网络空间安全学院，中国南京); Purple Mountain Laboratories, Nanjing, China (紫金山实验室，中国南京); Department of Computer Science, City University of Hong Kong, Hong Kong, China (香港城市大学计算机科学系，中国香港)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance in artificial intelligence by facilitating integrated understanding across diverse modalities, including text, images, video, audio, and speech. However, their deployment in real-world applications raises significant concerns about adversarial vulnerabilities that could compromise their safety and reliability. Unlike unimodal models, MLLMs face unique challenges due to the interdependencies among modalities, making them susceptible to modality-specific threats and cross-modal adversarial manipulations. This paper reviews the adversarial robustness of MLLMs, covering different modalities. We begin with an overview of MLLMs and a taxonomy of adversarial attacks tailored to each modality. Next, we review key datasets and evaluation metrics used to assess the robustness of MLLMs. After that, we provide an in-depth review of attacks targeting MLLMs across different modalities. Our survey also identifies critical challenges and suggests promising future research directions.
zh

[CV-85] BG-Triangle: Bézier Gaussian Triangle for 3D Vectorization and Rendering

【速读】：本文旨在解决传统可微渲染方法在保持锐利边界方面的不足问题。现有方法通过体积或高斯基元等平滑概率代理来近似或重新定义传统渲染操作以确保可微性，但因缺乏显式的边界定义，在处理物体边界时难以保留锐利边缘。为此，论文提出了一种新颖的混合表示——Bézier Gaussian Triangle (BG-Triangle)，它结合了基于Bézier三角形的矢量图形基元与基于高斯的概率模型，既能维持精确的形状建模，又能进行分辨率无关的高效可微渲染。其关键在于引入了一种鲁棒且有效的带间断感知的渲染技术以减少物体边界处的不确定性，并采用自适应密集化与剪枝方案实现高效训练的同时可靠处理细节层次（LoD）的变化。实验表明，BG-Triangle在边界保持方面优于3DGS，同时使用更少的基元数量，展示了矢量化图形基元的优势以及弥合经典与新兴表示之间差距的潜力。

链接: https://arxiv.org/abs/2503.13961
作者: Minye Wu,Haizhao Dai,Kaixin Yao,Tinne Tuytelaars,Jingyi Yu
机构: KU Leuven (鲁汶大学); ShanghaiTech University (上海科技大学); Cellverse Co, Ltd. (Cellverse有限公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Differentiable rendering enables efficient optimization by allowing gradients to be computed through the rendering process, facilitating 3D reconstruction, inverse rendering and neural scene representation learning. To ensure differentiability, existing solutions approximate or re-formulate traditional rendering operations using smooth, probabilistic proxies such as volumes or Gaussian primitives. Consequently, they struggle to preserve sharp edges due to the lack of explicit boundary definitions. We present a novel hybrid representation, Bézier Gaussian Triangle (BG-Triangle), that combines Bézier triangle-based vector graphics primitives with Gaussian-based probabilistic models, to maintain accurate shape modeling while conducting resolution-independent differentiable rendering. We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. We also employ an adaptive densification and pruning scheme for efficient training while reliably handling level-of-detail (LoD) variations. Experiments show that BG-Triangle achieves comparable rendering quality as 3DGS but with superior boundary preservation. More importantly, BG-Triangle uses a much smaller number of primitives than its alternatives, showcasing the benefits of vectorized graphics primitives and the potential to bridge the gap between classic and emerging representations.
zh

[CV-86] DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation CVPR2025

【速读】：该论文旨在解决现有顶级视频场景图生成（Video Scene Graph Generation, VSGG）方法无法处理实时视频流、消耗大量GPU内存以及在时序推理能力上的不足等问题。具体而言，传统方法通常依赖离线管道，仅通过聚合帧级预测来实现时序上下文的推理，而缺乏更深层次的时序关联建模。

解决方案的关键在于提出了一种名为DIFFVSGG的在线VSGG方法，将任务重新定义为迭代场景图更新问题。受潜扩散模型（Latent Diffusion Models, LDMs）启发，DIFFVSGG通过一个共享特征嵌入统一了对象分类、边界框回归以及图生成三项任务，并在LDM框架内对包含统一特征的对象对嵌入进行逐步去噪处理，以清晰表达对象间的关系。这一过程生成的干净嵌入被用于特定任务头（如对象分类、场景图生成等）。此外，DIFFVSGG进一步增强了连续时序推理能力，后续帧的预测利用过去帧的结果作为条件输入，引导当前帧的反向扩散过程。实验结果表明，DIFFVSGG在Action Genome的三种设置下均展现出优越性能。

链接: https://arxiv.org/abs/2503.13957
作者: Mu Chen,Liulei Li,Wenguan Wang,Yi Yang
机构: ReLER, CCAI, Zhejiang University (浙江大学); ReLER, AAII, University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, Code: this https URL

点击查看摘要

Abstract:Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
zh

[CV-87] Improving LLM Video Understanding with 16 Frames Per Second

【速读】：该论文试图解决现有视频理解方法因采用固定低帧率（≤2 FPS）提取静态特征而导致关键视觉信息丢失的问题。解决方案的关键在于引入F-16，这是一种专为高帧率视频理解设计的多模态大型语言模型（Multimodal Large Language Model, MLLM）。F-16通过将帧率提升至16 FPS，并在每秒视频片段内压缩视觉标记，高效捕获动态视觉特征的同时保留关键语义信息，从而显著提升了视频理解能力。此外，论文还提出了一种新颖的解码方法，使F-16能够在低帧率推理任务中实现高效性能，而无需重新训练模型。实验结果表明，F-16在多种基准测试中达到最先进的性能，特别是在复杂时空任务中的表现超越了现有的专有视觉模型。

链接: https://arxiv.org/abs/2503.13956
作者: Yixuan Li,Changli Tang,Jimin Zhuang,Yudong Yang,Guangzhi Sun,Wei Li,Zejun Ma,Chao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) \leqslant 2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (\textite.g., basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. Upon acceptance, we will release the source code, model checkpoints, and data.
zh

[CV-88] SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model

【速读】：该论文旨在解决自动驾驶领域中感知模型精度提升受数据匮乏限制的问题，并探索如何构建适用于真实应用场景的数据生成引擎以实现对挑战性场景的大规模数据生成。论文的关键解决方案在于提出了一种基于世界模型的仿真条件场景生成引擎，通过构建与现实场景一致的仿真系统收集仿真数据及其标签作为世界模型数据生成的条件，结合仿真引擎强大的场景模拟能力和世界模型稳健的数据生成能力，形成了一种新颖的数据生成管道。此外，论文还提供了一个虚拟与真实数据按比例构建的基准，用于探索世界模型在真实场景中的能力，定量结果显示生成的图像显著提升了下游感知模型的性能。

链接: https://arxiv.org/abs/2503.13952
作者: Xinqing Li,Ruiqi Song,Qingyu Xie,Ye Wu,Nanxin Zeng,Yunfeng Ai
机构: The School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Waytous Inc. (未翻译); The State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at this https URL.
zh

[CV-89] FrustumFusionNets: A Three-Dimensional Object Detection Network Based on Tractor Road Scene

【速读】：该论文旨在解决现有基于视锥体（frustum-based）方法在道路三维目标检测中对图像信息利用率不足的问题，同时填补农业场景相关研究的空白。论文的关键在于提出了一种名为FrustumFusionNets (FFNets) 的新网络，通过结合点云与图像特征实现高效的多模态融合检测。具体而言，其解决方案的核心包括：首先利用基于图像的二维目标检测结果缩小点云三维空间中的搜索范围；其次引入高斯掩码增强点云信息；接着分别通过点云特征提取管道和图像特征提取管道从视锥点云和裁剪图像中提取特征；最后将两种模态的数据特征进行拼接与融合，以完成三维目标检测任务。实验表明，FrustumFusionNetv2 在拖拉机道路场景中对车辆和行人的三维检测精度分别达到了82.28%和95.68%，较原始模型提升了1.83%和2.33%，并在KITTI基准数据集验证集中展现了显著优势。

链接: https://arxiv.org/abs/2503.13951
作者: Lili Yang,Mengshuai Chang,Xiao Guo,Yuxin Feng,Yiwen Mei,Caicong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the issues of the existing frustum-based methods’ underutilization of image information in road three-dimensional object detection as well as the lack of research on agricultural scenes, we constructed an object detection dataset using an 80-line Light Detection And Ranging (LiDAR) and a camera in a complex tractor road scene and proposed a new network called FrustumFusionNets (FFNets). Initially, we utilize the results of image-based two-dimensional object detection to narrow down the search region in the three-dimensional space of the point cloud. Next, we introduce a Gaussian mask to enhance the point cloud information. Then, we extract the features from the frustum point cloud and the crop image using the point cloud feature extraction pipeline and the image feature extraction pipeline, respectively. Finally, we concatenate and fuse the data features from both modalities to achieve three-dimensional object detection. Experiments demonstrate that on the constructed test set of tractor road data, the FrustumFusionNetv2 achieves 82.28% and 95.68% accuracy in the three-dimensional object detection of the two main road objects, cars and people, respectively. This performance is 1.83% and 2.33% better than the original model. It offers a hybrid fusion-based multi-object, high-precision, real-time three-dimensional object detection technique for unmanned agricultural machines in tractor road scenarios. On the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) Benchmark Suite validation set, the FrustumFusionNetv2 also demonstrates significant superiority in detecting road pedestrian objects compared with other frustum-based three-dimensional object detection methods.
zh

[CV-90] Light4GS: Lightweight Compact 4D Gaussian Splatting Generation via Context Model

【速读】：该论文旨在解决动态场景下可变形3D高斯点 splatting（3DGS）方法因高维嵌入和大量基元导致存储需求过大的问题。论文提出了一种名为Light4GS的轻量级4D GS框架，其关键在于结合时空显著性剪枝策略与深度上下文模型。显著性剪枝策略能够消除超过64%的可变形基元，并对剩余基元应用熵约束的球谐函数压缩；而深度上下文模型则通过粗到细的上下文结构整合了空域和时域预测以及超先验信息，实现了高效的多尺度潜在嵌入压缩。该方案在保证渲染质量的同时，实现了超过120倍的压缩比，并将渲染帧率提升高达20%，优于基于单帧的最新3DGS压缩方法。

链接: https://arxiv.org/abs/2503.13948
作者: Mufan Liu,Qi Yang,He Huang,Wenjie Huang,Zhenlong Yuan,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri, Kansas City (密苏里大学堪萨斯城分校); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a \textbfLightweight \textbf4D\textbfGS framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS that is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: (1) a spatio-temporal significance pruning strategy that eliminates over 64% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and (2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure to enable efficient multiscale latent embedding compression. Our approach achieves over 120x compression and increases rendering FPS up to 20% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods, revealing the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality.
zh

[CV-91] Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation CVPR2025

【速读】：该论文旨在解决场景图生成（Scene Graph Generation, SGG）方法在实际应用中面临的预测不确定性问题，特别是由于长尾类别分布和预测变化性导致的挑战。为实现这一目标，论文提出了一种基于校准预测集的新型Conformal Prediction (CP) 框架，该框架能够适应现有的任何SGG方法，通过构造统计意义上可靠且覆盖良好的预测集来量化其预测不确定性。关键在于，该框架不仅确保预测集具有严格的统计覆盖率保证，还结合一种基于多标签语言模型（MLLM-based）的后处理策略，从预测集中选择视觉和语义上最合理的场景图，从而提高SGG的整体性能并增强其可解释性。

链接: https://arxiv.org/abs/2503.13947
作者: Sayak Nag,Udita Ghosh,Sarosij Bose,Calvin-Khang Ta,Jiachen Li,Amit K Roy Chowdhury
机构: University of California, Riverside (加州大学河滨分校); Dolby Labs (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.
zh

[CV-92] Is Discretization Fusion All You Need for Collaborative Perception?

【速读】：该论文旨在解决当前多智能体系统中主流协作感知方法在特征融合过程中存在的两大问题：一是基于离散特征图的方法缺乏灵活性，难以有效提取和传输信息量大的特征；二是难以在融合阶段聚焦于这些重要特征。为了解决这些问题，论文提出了一种新的Anchor-Centric协作目标检测范式（ACCO）。其关键是设计了三个主要组件：(1) Anchor特征块（AFB），用于生成锚框提议并将预处理的锚框查询投影到图像特征上；(2) Anchor置信度生成器（ACG），通过选择仅在置信度高的锚框中传输特征来最小化通信量；(3) 局部-全局融合模块，其中局部融合采用基于锚框对齐的融合（LAAF），全局融合利用空间感知交叉注意力（SACA）。这两个融合机制在多层上运行，使智能体能够迭代调整锚框提议，从而实现更灵活高效的锚框为中心的通信与融合。实验结果表明，ACCO显著减少了通信量，并提升了感知范围和检测性能。

链接: https://arxiv.org/abs/2503.13946
作者: Kang Yang,Tianci Bu,Lantao Li,Chunxu Li,Yongcai Wang,Deying Li
机构: School of Information Renmin University of China (中国人民大学信息学院); National University of Defense Technology (国防科技大学); Sony Research and Development Center China (索尼研究与发展中心中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO’s superiority in reducing the communication volume, and in improving the perception range and detection performances. Code can be found at: \hrefthis https URLthis https URL.
zh

[CV-93] Make the Most of Everything: Further Considerations on Disrupting Diffusion-based Customization

【速读】：该论文旨在解决扩散模型在文本到图像生成过程中因微调技术导致的隐私泄露和观点操控风险，现有研究主要集中在提示或图像级别的对抗攻击以防止定制化，但忽视了两级之间的关联性以及内部模块与输入的关系，这限制了实际威胁场景下的反定制性能。为此，论文提出了一种名为Dual Anti-Diffusion (DADiff) 的两阶段对抗攻击方法，首次将提示级别的对抗攻击集成到图像级别对抗样本的生成过程中。其关键在于第一阶段生成提示级别的对抗向量来指导后续的图像级别攻击；第二阶段不仅对UNet模型进行端到端攻击，还干扰其自注意力和交叉注意力模块，旨在破坏图像像素间的相关性，并使基于实例提示和对抗提示向量计算得到的交叉注意力结果在图像内对齐。此外，引入局部随机时间步梯度集成策略，通过整合多个分割时间段的随机梯度更新对抗扰动。实验结果显示，在多种主流面部数据集上，DADiff相比现有方法在跨提示、关键词不匹配、跨模型及跨机制反定制方面提升了10%-30%。

链接: https://arxiv.org/abs/2503.13945
作者: Long Tang,Dengpan Ye,Sirun Chen,Xiuwen Shi,Yunna Lv,Ziyi Liu
机构: School of Cyber Science and Engineering, Wuhan University (武汉大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The fine-tuning technique for text-to-image diffusion models facilitates image customization but risks privacy breaches and opinion manipulation. Current research focuses on prompt- or image-level adversarial attacks for anti-customization, yet it overlooks the correlation between these two levels and the relationship between internal modules and inputs. This hinders anti-customization performance in practical threat scenarios. We propose Dual Anti-Diffusion (DADiff), a two-stage adversarial attack targeting diffusion customization, which, for the first time, integrates the adversarial prompt-level attack into the generation process of image-level adversarial examples. In stage 1, we generate prompt-level adversarial vectors to guide the subsequent image-level attack. In stage 2, besides conducting the end-to-end attack on the UNet model, we disrupt its self- and cross-attention modules, aiming to break the correlations between image pixels and align the cross-attention results computed using instance prompts and adversarial prompt vectors within the images. Furthermore, we introduce a local random timestep gradient ensemble strategy, which updates adversarial perturbations by integrating random gradients from multiple segmented timesets. Experimental results on various mainstream facial datasets demonstrate 10%-30% improvements in cross-prompt, keyword mismatch, cross-model, and cross-mechanism anti-customization with DADiff compared to existing methods.
zh

[CV-94] Multi-Modal Self-Supervised Semantic Communication

【速读】：该论文致力于解决动态无线环境中语义通信在训练阶段因显著通信开销而带来的挑战。解决方案的关键在于提出了一种多模态语义通信系统，通过在预训练阶段利用多模态自监督学习提取与任务无关的语义特征，并在下游任务中进行有监督微调。该双阶段策略能够有效捕获模态不变和模态特定的特征，同时最小化与训练相关的通信开销。实验结果验证了所提方法在保持或超越现有有监督学习性能的同时，显著降低了训练相关的通信开销。

链接: https://arxiv.org/abs/2503.13940
作者: Hang Zhao,Hongru Li,Dongfang Xu,Shenghui Song,Khaled B. Letaief
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Semantic communication is emerging as a promising paradigm that focuses on the extraction and transmission of semantic meanings using deep learning techniques. While current research primarily addresses the reduction of semantic communication overhead, it often overlooks the training phase, which can incur significant communication costs in dynamic wireless environments. To address this challenge, we propose a multi-modal semantic communication system that leverages multi-modal self-supervised learning to enhance task-agnostic feature extraction. The proposed approach employs self-supervised learning during the pre-training phase to extract task-agnostic semantic features, followed by supervised fine-tuning for downstream tasks. This dual-phase strategy effectively captures both modality-invariant and modality-specific features while minimizing training-related communication overhead. Experimental results on the NYU Depth V2 dataset demonstrate that the proposed method significantly reduces training-related communication overhead while maintaining or exceeding the performance of existing supervised learning approaches. The findings underscore the advantages of multi-modal self-supervised learning in semantic communication, paving the way for more efficient and scalable edge inference systems.
zh

[CV-95] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

【速读】：该论文旨在解决医学影像推理任务中视觉语言模型（Vision-Language Models, VLMs）的泛化性和可信性不足的问题。传统方法如监督微调（Supervised Fine-Tuning, SFT）在面对复杂医学图像时容易过拟合且缺乏泛化能力。为此，论文提出Med-R1框架，其关键在于利用强化学习（Reinforcement Learning, RL）并通过DeepSeek策略中的分组相对策略优化（Group Relative Policy Optimization, GRPO）来引导推理路径。这种方法通过奖励信号促进模型产生更健壮且多样化的推理结果，从而显著提升模型在不同医学成像模态下的表现，同时确保透明度和可解释性。实验结果显示，Med-R1在多种医学推理任务中超越了参数量更大的基线模型，证明了其在提升医学VLMs泛化性和可信性方面的有效性。

链接: https://arxiv.org/abs/2503.13939
作者: Yuxiang Lai,Jike Zhong,Ming Li,Shitian Zhao,Xiaofeng Yang
机构: Department of Computer Science, Emory University (埃默里大学); Department of Computer Science, University of Southern California (南加州大学); Department of Computer Science, University of Tokyo (东京大学); Department of Computer Science, Johns Hopkins University (约翰斯·霍普金斯大学); Department of Biomedical Engineering, Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs’ generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.
zh

[CV-96] ChatBEV: A Visual Language Model that Understands BEV Maps

【速读】：该论文旨在解决交通场景理解在智能交通系统和自动驾驶中的挑战，特别是基于鸟瞰图（BEV）地图的视觉语言模型（VLM）应用不足的问题。现有方法通常受限于任务设计的局限性和数据规模的狭隘性，难以实现全面的场景理解。为应对这些挑战，论文提出的关键解决方案包括：构建一个包含超过137,000个问题的新颖BEV视觉问答（VQA）基准——ChatBEV-QA，覆盖全局场景理解、车辆-车道交互以及车辆-车辆交互等多种任务；开发一种新的数据收集管道以生成可扩展且信息丰富的BEV-VQA数据；进一步微调专门的视觉-语言模型ChatBEV，使其能够解析多样化的问题提示并从BEV地图中提取上下文相关的有用信息；同时提出一种基于语言驱动的交通场景生成管道，利用ChatBEV提升地图理解和文本对齐导航指导的能力，从而显著增强交通场景生成的真实性和一致性。

链接: https://arxiv.org/abs/2503.13938
作者: Qingyao Xu,Siheng Chen,Guang Chen,Yanfeng Wang,Ya Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Tongji University (同济大学); Multi-Agent Governance & Intelligence Crew (MAGIC)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic scene understanding is essential for intelligent transportation systems and autonomous driving, ensuring safe and efficient vehicle operation. While recent advancements in VLMs have shown promise for holistic scene understanding, the application of VLMs to traffic scenarios, particularly using BEV maps, remains under explored. Existing methods often suffer from limited task design and narrow data amount, hindering comprehensive scene understanding. To address these challenges, we introduce ChatBEV-QA, a novel BEV VQA benchmark contains over 137k questions, designed to encompass a wide range of scene understanding tasks, including global scene understanding, vehicle-lane interactions, and vehicle-vehicle interactions. This benchmark is constructed using an novel data collection pipeline that generates scalable and informative VQA data for BEV maps. We further fine-tune a specialized vision-language model ChatBEV, enabling it to interpret diverse question prompts and extract relevant context-aware information from BEV maps. Additionally, we propose a language-driven traffic scene generation pipeline, where ChatBEV facilitates map understanding and text-aligned navigation guidance, significantly enhancing the generation of realistic and consistent traffic scenarios. The dataset, code and the fine-tuned model will be released.
zh

[CV-97] SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization

【速读】：该论文旨在解决生成式数据集压缩（Dataset Condensation, DC）在实现高性能的同时面临显著存储成本增加的问题。传统方法通过软标签编码知识到逼真的图像中，虽然具备跨域泛化能力强和可扩展性好等优点，但其存储需求可能远超原始数据集。为缓解这种性能与存储之间的权衡困境，本文提出了一种以软标签压缩为核心的新型数据集压缩框架——SCORE。该框架将数据集压缩视为一个min-max优化问题，从信息论角度平衡了压缩数据的三个关键属性：信息量（informativeness）、判别能力（discriminativeness）和可压缩性（compressibility）。通过理论证明，所提出的基于编码率的目标函数具有次模性质，并且其优化过程自然地促使每条压缩数据对应的软标签集呈现低秩结构。实验结果表明，在ImageNet-1K和Tiny-ImageNet等大规模数据集上的表现优于现有方法，即使在软标签压缩比达到30倍的情况下，ImageNet-1K的数据集性能仅下降了5.5%（每类10个样本）和2.7%（每类50个样本）。

链接: https://arxiv.org/abs/2503.13935
作者: Bowen Yuan,Yuxia Fu,Zijian Wang,Yadan Luo,Zi Huang
机构: The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbfSoft label compression-centric dataset condensation framework using \textbfCOding \textbfRat\textbfE (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30 \times compression of soft labels, performance decreases by only 5.5% and 2.7% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.
zh

[CV-98] Learning Shape-Independent Transformation via Spherical Representations for Category-Level Object Pose Estimation ICLR2025

【速读】：该论文旨在解决类别级物体位姿估计中因标准坐标形状依赖性导致的语义不一致问题。现有基于对应关系的方法通常采用基于点的表示来建立原始观测点与归一化物体坐标之间的对应关系，但由于标准坐标形状对物体形状的固有依赖性，这些方法在处理多样化物体形状时面临语义不连贯的问题。为了解决这一问题，论文创新性地利用球体作为物体的共享代理形状，通过球面表示学习形状无关的变换。解决方案的关键在于提出了名为SpherePose的新架构，其核心设计包括：首先，赋予逐点特征提取SO(3)不变性，以确保在旋转变换下相机坐标系与物体坐标系之间映射的鲁棒性；其次，设计球面注意力机制，从全局视角传播和整合球面锚点的特征，从而减轻噪声和不完整点云的干扰；最后，提出双曲对应损失函数，用于区分细微差异，提升对应关系预测的精度。实验结果验证了所提方法在CAMERA25、REAL275和HouseCat6D数据集上的优越性能，证明了球面表示和架构创新的有效性。

链接: https://arxiv.org/abs/2503.13926
作者: Huan Ren,Wenfei Yang,Xiang Liu,Shifeng Zhang,Tianzhu Zhang
机构: University of Science and Technology of China (中国科学技术大学); National Key Laboratoray of Deep Space Exploration, Deep Space Exploration Laboratory (深空探测全国重点实验室，深空探测实验室); Jianghuai Advance Technology Center (江淮先进技术研究中心); Dongguan University of Technology (东莞理工学院); Sangfor Technologies (深信服科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025. Project page is available at this https URL

点击查看摘要

Abstract:Category-level object pose estimation aims to determine the pose and size of novel objects in specific categories. Existing correspondence-based approaches typically adopt point-based representations to establish the correspondences between primitive observed points and normalized object coordinates. However, due to the inherent shape-dependence of canonical coordinates, these methods suffer from semantic incoherence across diverse object shapes. To resolve this issue, we innovatively leverage the sphere as a shared proxy shape of objects to learn shape-independent transformation via spherical representations. Based on this insight, we introduce a novel architecture called SpherePose, which yields precise correspondence prediction through three core designs. Firstly, We endow the point-wise feature extraction with SO(3)-invariance, which facilitates robust mapping between camera coordinate space and object coordinate space regardless of rotation transformation. Secondly, the spherical attention mechanism is designed to propagate and integrate features among spherical anchors from a comprehensive perspective, thus mitigating the interference of noise and incomplete point cloud. Lastly, a hyperbolic correspondence loss function is designed to distinguish subtle distinctions, which can promote the precision of correspondence prediction. Experimental results on CAMERA25, REAL275 and HouseCat6D benchmarks demonstrate the superior performance of our method, verifying the effectiveness of spherical representations and architectural innovations.
zh

[CV-99] Unlocking the Potential of Unlabeled Data in Semi-Supervised Domain Generalization CVPR2025

【速读】：本文旨在解决半监督领域泛化（Semi-Supervised Domain Generalization, SSDG）中的问题，其中训练数据与测试数据分布不同，并且在训练过程中仅提供少量标注数据以及更多未标注数据。现有方法通常仅利用模型预测置信度高的未标注样本（即“置信未标注”样本），限制了未标注数据的充分利用。为此，本文首次探索了一种方法，将之前在SSDG设置中被忽略的低置信度未标注样本纳入考虑。关键在于提出的UPCSC方法，它包含两个模块：1）基于未标注代理对比学习（Unlabeled Proxy-based Contrastive, UPC）模块，将低置信度未标注样本视为额外的负样本对；2）代理类别学习（Surrogate Class learning, SC）模块，通过混淆类别集为低置信度未标注样本生成正样本对。这两个模块无需任何领域标签即可作为插件集成到现有方法中。实验结果表明，该方法在四个常用的SSDG基准数据集上显著提升了性能，并优于其他插件式方法。此外，分析显示该方法增强了类别级别的区分能力并缓解了领域差距。代码已开源。

链接: https://arxiv.org/abs/2503.13915
作者: Dongkwan Lee,Kyomin Hwang,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025

点击查看摘要

Abstract:We address the problem of semi-supervised domain generalization (SSDG), where the distributions of train and test data differ, and only a small amount of labeled data along with a larger amount of unlabeled data are available during training. Existing SSDG methods that leverage only the unlabeled samples for which the model’s predictions are highly confident (confident-unlabeled samples), limit the full utilization of the available unlabeled data. To the best of our knowledge, we are the first to explore a method for incorporating the unconfident-unlabeled samples that were previously disregarded in SSDG setting. To this end, we propose UPCSC to utilize these unconfident-unlabeled samples in SSDG that consists of two modules: 1) Unlabeled Proxy-based Contrastive learning (UPC) module, treating unconfident-unlabeled samples as additional negative pairs and 2) Surrogate Class learning (SC) module, generating positive pairs for unconfident-unlabeled samples using their confusing class set. These modules are plug-and-play and do not require any domain labels, which can be easily integrated into existing approaches. Experiments on four widely used SSDG benchmarks demonstrate that our approach consistently improves performance when attached to baselines and outperforms competing plug-and-play methods. We also analyze the role of our method in SSDG, showing that it enhances class-level discriminability and mitigates domain gaps. The code is available at this https URL.
zh

[CV-100] PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds

【速读】：该论文旨在解决现有基于自监督学习（Self-Supervised Learning, SSL）在3D点云任务中未能有效保留物体姿态（pose）和尺度（size）等几何信息的问题，这可能导致下游定位及对几何敏感的3D场景理解任务（如3D语义分割和3D目标检测）性能下降。为应对这一挑战，论文提出了一种名为PSA-SSL的新方法，其关键在于通过引入一个自监督的边界框回归预训练任务来学习与物体姿态和尺寸相关的特征（object pose and size-aware features），同时结合LiDAR光束模式增强技术以促进传感器无关特征的学习。这种设计确保了几何信息的保留，并提升了模型在标注数据有限情况下的泛化能力。实验结果表明，该方法在多个自动驾驶数据集上的3D语义分割任务中实现了显著性能提升，尤其是在标注样本较少的情况下表现尤为突出。

链接: https://arxiv.org/abs/2503.13914
作者: Barza Nisar,Steven L. Waslander
机构: University of Toronto Robotics Institute (多伦多大学机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry-sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. Our approach defines a self-supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor-agnostic features. Our experiments demonstrate that with a single pretrained model, our light-weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state-of-the-art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on this https URL.
zh

[CV-101] HSOD-BIT-V2: A New Challenging Benchmarkfor Hyperspectral Salient Object Detection AAAI2025

【速读】：本文旨在解决基于RGB的显著物体检测方法在复杂场景（如小目标和相似颜色特征）中的局限性，提出通过高光谱图像丰富的光谱信息实现更精确的高光谱显著物体检测（HSOD）。然而，现有HSOD方法受限于缺乏大规模可用的数据集。为应对这些挑战，论文引入了HSOD-BIT-V2，这是迄今为止最大且最具挑战性的HSOD基准数据集，设计了五个聚焦于小目标和前景-背景相似性的挑战，以强调光谱优势和现实世界的复杂性。针对这些问题，论文提出了Hyper-HRNet，这是一种高分辨率的HSOD网络。其关键在于有效提取、整合并保留有用的光谱信息，同时通过捕捉自相似光谱特征降低维度，并结合全局信息与详细的对象显著性表示，传递精细细节并精确定位物体轮廓。实验分析表明，Hyper-HRNet在具有挑战性的场景中优于现有模型。

链接: https://arxiv.org/abs/2503.13906
作者: Yuhao Qiu,Shuyan Bai,Tingfa Xu,Peifu Liu,Haolin Qin,Jianan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2025

点击查看摘要

Abstract:Salient Object Detection (SOD) is crucial in computer vision, yet RGB-based methods face limitations in challenging scenes, such as small objects and similar color features. Hyperspectral images provide a promising solution for more accurate Hyperspectral Salient Object Detection (HSOD) by abundant spectral information, while HSOD methods are hindered by the lack of extensive and available datasets. In this context, we introduce HSOD-BIT-V2, the largest and most challenging HSOD benchmark dataset to date. Five distinct challenges focusing on small objects and foreground-background similarity are designed to emphasize spectral advantages and real-world complexity. To tackle these challenges, we propose Hyper-HRNet, a high-resolution HSOD network. Hyper-HRNet effectively extracts, integrates, and preserves effective spectral information while reducing dimensionality by capturing the self-similar spectral features. Additionally, it conveys fine details and precisely locates object contours by incorporating comprehensive global information and detailed object saliency representations. Experimental analysis demonstrates that Hyper-HRNet outperforms existing models, especially in challenging scenarios.
zh

[CV-102] GBFormer: Transformer-GraphFormer Blender Network for Video Object Detection AAAI2025

【速读】：该论文旨在解决视频对象检测中全局与局部信息难以有效融合的问题。传统方法仅依赖卷积神经网络（CNNs）或视觉Transformer（ViTs）进行特征聚合，无法同时充分利用全局和局部信息，从而限制了检测性能。论文的关键解决方案在于提出了一种名为Transformer-GraphFormer Blender Network（TGBFormer）的方法，通过三个技术改进实现优势互补：首先，开发时空Transformer模块以聚合全局上下文信息，构建具有长程特征依赖关系的全局表示；其次，引入时空GraphFormer模块利用局部时空关系聚合特征，生成补充Transformer输出的局部表示；第三，设计全局-局部特征融合模块以自适应耦合基于Transformer的全局表示和基于GraphFormer的局部表示。这些改进使得TGBFormer在ImageNet VID数据集上实现了新的最优性能，特别是在单个Tesla A100 GPU上达到了86.5%的mAP，运行速度约为41.0 FPS。

链接: https://arxiv.org/abs/2503.13903
作者: Qiang Qi,Xiao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Video object detection has made significant progress in recent years thanks to convolutional neural networks (CNNs) and vision transformers (ViTs). Typically, CNNs excel at capturing local features but struggle to model global representations. Conversely, ViTs are adept at capturing long-range global features but face challenges in representing local feature details. Off-the-shelf video object detection methods solely rely on CNNs or ViTs to conduct feature aggregation, which hampers their capability to simultaneously leverage global and local information, thereby resulting in limited detection performance. In this paper, we propose a Transformer-GraphFormer Blender Network (TGBFormer) for video object detection, with three key technical improvements to fully exploit the advantages of transformers and graph convolutional networks while compensating for their limitations. First, we develop a spatial-temporal transformer module to aggregate global contextual information, constituting global representations with long-range feature dependencies. Second, we introduce a spatial-temporal GraphFormer module that utilizes local spatial and temporal relationships to aggregate features, generating new local representations that are complementary to the transformer outputs. Third, we design a global-local feature blender module to adaptively couple transformer-based global representations and GraphFormer-based local representations. Extensive experiments demonstrate that our TGBFormer establishes new state-of-the-art results on the ImageNet VID dataset. Particularly, our TGBFormer achieves 86.5% mAP while running at around 41.0 FPS on a single Tesla A100 GPU.
zh

[CV-103] Evaluating Global Geo-alignment for Precision Learned Autonomous Vehicle Localization using Aerial Data ICRA

【速读】：本文旨在解决利用航拍和卫星地图数据进行自动驾驶车辆精确实时定位的问题，特别关注如何克服传感器模态差异（sensor-modality gap）和视角差异（viewpoint difference gap）带来的挑战。论文的关键在于提出在训练阶段通过改进航拍数据与车载传感器数据之间的对齐方式（data alignment），以显著提升基于学习的定位系统的性能。研究对比了两种基于因子图框架的数据对齐方法，并通过消融实验验证了精确对齐地面真实数据（ground truth）对提升定位精度的重要性。最终，在一个包含1600公里行驶数据的综合数据集上评估了所提出的定位系统，实现了低于0.3米的位置误差和0.5°的姿态误差，满足自动驾驶应用的需求。

链接: https://arxiv.org/abs/2503.13896
作者: Yi Yang,Xuran Zhao,H. Charles Zhao,Shumin Yuan,Samuel M. Bateman,Tiffany A. Huang,Chris Beall,Will Maddern
机构: Nuro Inc (诺鲁公司)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, accepted by International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:Recently there has been growing interest in the use of aerial and satellite map data for autonomous vehicles, primarily due to its potential for significant cost reduction and enhanced scalability. Despite the advantages, aerial data also comes with challenges such as a sensor-modality gap and a viewpoint difference gap. Learned localization methods have shown promise for overcoming these challenges to provide precise metric localization for autonomous vehicles. Most learned localization methods rely on coarsely aligned ground truth, or implicit consistency-based methods to learn the localization task – however, in this paper we find that improving the alignment between aerial data and autonomous vehicle sensor data at training time is critical to the performance of a learning-based localization system. We compare two data alignment methods using a factor graph framework and, using these methods, we then evaluate the effects of closely aligned ground truth on learned localization accuracy through ablation studies. Finally, we evaluate a learned localization system using the data alignment methods on a comprehensive (1600km) autonomous vehicle dataset and demonstrate localization error below 0.3m and 0.5 ^\circ sufficient for autonomous vehicle applications.
zh

[CV-104] Exploiting Inherent Class Label: Towards Robust Scribble Supervised Semantic Segmentation

【速读】：该论文致力于解决基于草图的弱监督语义分割中的两个主要问题：一是草图标注的稀疏性可能导致预测不一致，二是不同标注者偏好的多样性可能使模型无法稳定捕捉目标的判别区域。为了解决这些问题，论文提出了一种名为“类别驱动草图增强网络”的整体框架。该框架的关键在于不仅利用提供的草图标注，还结合其对应的类别标签生成可靠的伪标签，并通过引入定位校正模块减少噪声标签的影响，以及通过距离感知模块识别草图标注和伪标签周围的可靠区域，从而实现更鲁棒的草图引导语义分割。

链接: https://arxiv.org/abs/2503.13895
作者: Xinliang Zhang,Lei Zhu,Shuang Zeng,Hangzhou He,Ourui Fu,Zhengjian Yao,Zhaoheng Xie,Yanye Lu
机构: Institute of Medical Technology, Peking University Health Science Center, Peking University, Beijing (北京大学健康科学中心医学技术研究所，北京大学，北京); Department of Biomedical Engineering, Peking University, Beijing, China (北京大学生物医学工程系，中国北京); National Biomedical Imaging Center, Peking University, Beijing 100871, China (北京大学生物医学影像中心，中国北京 100871); Peking University Shenzhen Graduate School, Shenzhen 518055, China (北京大学深圳研究生院，中国深圳 518055)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scribble-based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class-driven scribble promotion network, for robust scribble-supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo-labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo-labels. In addition, we introduce new large-scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.
zh

[CV-105] Where do Large Vision-Language Models Look at when Answering Questions?

【速读】：该论文旨在探究大型视觉-语言模型（LVLMs）在视觉理解和推理任务中的行为特性，特别是它们依赖视觉输入的程度以及哪些图像区域对其响应有贡献。由于LVLMs具有复杂的视觉架构（如多编码器和多分辨率）和可变长度输出，其自由形式生成机制难以解释。为解决这一问题，论文的关键在于扩展现有的热图可视化方法（如iGOS++），以支持开放性视觉问答任务中的LVLMs，并提出一种方法来选择与视觉相关的标记，反映生成答案与输入图像之间的相关性。此外，通过在需要视觉信息才能作答的基准数据集上对最先进的LVLMs进行全面分析，进一步揭示模型的行为特性。

链接: https://arxiv.org/abs/2503.13891
作者: Xiaoying Xing,Chia-Wen Kuo,Li Fuxin,Yulei Niu,Fan Chen,Ming Li,Ying Wu,Longyin Wen,Sijie Zhu
机构: Bytedance Intelligent Creation (字节跳动智能创作); Northwestern University (西北大学); Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at this https URL.
zh

[CV-106] YOLO-LLTS: Real-Time Low-Light Traffic Sign Detection via Prior-Guided Enhancement and Multi-Branch Feature Interaction

【速读】：本文旨在解决低光照条件下交通标志检测的挑战，提出了一种名为YOLO-LLTS的端到端实时交通标志检测算法。其关键解决方案包括三个模块：首先，引入高分辨率小目标检测特征图模块（HRFM-TOD），通过利用高分辨率特征图缓解传统PANet框架中的特征稀疏化问题，从而提升检测精度和推理速度；其次，开发多分支特征交互注意力模块（MFIA），促进通道和空间维度上多个感受野之间的深度特征交互，显著增强模型的信息提取能力；最后，提出先验引导增强模块（PGFE）以应对低光照环境下的图像质量问题，如噪声、低对比度和模糊等，通过利用先验知识丰富图像细节并提高可见性，大幅改善检测性能。此外，构建了一个包含多样化夜间场景的新数据集——中国夜间交通标志样本集（CNTSSS）。实验结果表明，YOLO-LLTS在多个数据集上的表现达到当前最优水平，并且部署于边缘设备上的测试验证了其实时性和有效性。

链接: https://arxiv.org/abs/2503.13883
作者: Ziyu Lin,Yunfan Wu,Yuhang Ma,Junzhou Chen,Ronghui Zhang,Jiaming Wu,Guodong Yin,Liang Lin
机构: Guangdong Key Laboratory of Intelligent Transportation System, School of intelligent systems engineering, Sun Yat-sen University (中山大学智能工程学院智能交通系统重点实验室); Department of Architecture and Civil Engineering, Chalmers University of Technology (查尔姆斯理工大学建筑与土木工程系); School of Mechanical Engineering, Southeast University (东南大学机械工程学院); School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting traffic signs effectively under low-light conditions remains a significant challenge. To address this issue, we propose YOLO-LLTS, an end-to-end real-time traffic sign detection algorithm specifically designed for low-light environments. Firstly, we introduce the High-Resolution Feature Map for Small Object Detection (HRFM-TOD) module to address indistinct small-object features in low-light scenarios. By leveraging high-resolution feature maps, HRFM-TOD effectively mitigates the feature dilution problem encountered in conventional PANet frameworks, thereby enhancing both detection accuracy and inference speed. Secondly, we develop the Multi-branch Feature Interaction Attention (MFIA) module, which facilitates deep feature interaction across multiple receptive fields in both channel and spatial dimensions, significantly improving the model’s information extraction capabilities. Finally, we propose the Prior-Guided Enhancement Module (PGFE) to tackle common image quality challenges in low-light environments, such as noise, low contrast, and blurriness. This module employs prior knowledge to enrich image details and enhance visibility, substantially boosting detection performance. To support this research, we construct a novel dataset, the Chinese Nighttime Traffic Sign Sample Set (CNTSSS), covering diverse nighttime scenarios, including urban, highway, and rural environments under varying weather conditions. Experimental evaluations demonstrate that YOLO-LLTS achieves state-of-the-art performance, outperforming the previous best methods by 2.7% mAP50 and 1.6% mAP50:95 on TT100K-night, 1.3% mAP50 and 1.9% mAP50:95 on CNTSSS, and achieving superior results on the CCTSDB2021 dataset. Moreover, deployment experiments on edge devices confirm the real-time applicability and effectiveness of our proposed approach.
zh

[CV-107] MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation ICLR2025

【速读】：该论文旨在解决现有推理分割（Reasoning Segmentation）数据集主要聚焦于单目标对象级别推理的问题，这限制了在多目标场景下对物体及其详细部分的精细识别能力。论文提出的关键解决方案是构建了一个名为Multi-target and Multi-granularity Reasoning (MMR) 的大规模数据集，该数据集包含194K条复杂的隐含指令，并从多目标、对象级别以及部件级别的角度综合考虑了图像掩码对（image-mask pairs）。此外，论文还提出了一种简单而有效的框架，用于支持多目标、对象级别及部件级别的推理分割任务。实验结果表明，所提出的框架在多目标和多层次场景中表现出色，但现有的推理分割模型仍有改进空间。

链接: https://arxiv.org/abs/2503.13881
作者: Donggon Jang,Yucheol Cho,Suin Lee,Taehyeon Kim,Dae-Shik Kim
机构: Department of Electrical Engineering, KAIST (电气工程系, 韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025, Code and dataset are available at \url{ this https URL }

点击查看摘要

Abstract:The fusion of Large Language Models with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to \textitturn on the TV", there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV’s button or the remote’s button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object’s parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.
zh

[CV-108] Robust3D-CIL: Robust Class-Incremental Learning for 3D Perception

【速读】：该论文旨在解决在类增量学习（Class-Incremental Learning, CIL）框架下处理真实世界3D点云数据时，因数据损坏导致模型遗忘问题加剧的挑战。论文的关键创新在于提出了一种基于最远点采样（Farthest Point Sampling）的示例选择策略，以在选择回放示例时有效保留类内多样性，从而缓解由数据损坏引起的遗忘问题。此外，还引入了一种基于点云下采样的回放方法，以更高效地利用有限的回放缓冲区内存，进一步增强模型的持续学习能力。实验结果表明，该方法可将基于回放的CIL基线性能提升2%到11%，验证了其在实际3D应用中的有效性与潜力。

链接: https://arxiv.org/abs/2503.13869
作者: Jinge Ma,Jiangpeng He,Fengqing Zhu
机构: Purdue University (普渡大学); Massachusetts Institute of Technology (麻省理工学院); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:3D perception plays a crucial role in real-world applications such as autonomous driving, robotics, and AR/VR. In practical scenarios, 3D perception models must continuously adapt to new data and emerging object categories, but retraining from scratch incurs prohibitive costs. Therefore, adopting class-incremental learning (CIL) becomes particularly essential. However, real-world 3D point cloud data often include corrupted samples, which poses significant challenges for existing CIL methods and leads to more severe forgetting on corrupted data. To address these challenges, we consider the scenario in which a CIL model can be updated using point clouds with unknown corruption to better simulate real-world conditions. Inspired by Farthest Point Sampling, we propose a novel exemplar selection strategy that effectively preserves intra-class diversity when selecting replay exemplars, mitigating forgetting induced by data corruption. Furthermore, we introduce a point cloud downsampling-based replay method to utilize the limited replay buffer memory more efficiently, thereby further enhancing the model’s continual learning ability. Extensive experiments demonstrate that our method improves the performance of replay-based CIL baselines by 2% to 11%, proving its effectiveness and promising potential for real-world 3D applications.
zh

[CV-109] HySurvPred: Multimodal Hyperbolic Embedding with Angle-Aware Hierarchical Contrastive Learning and Uncertainty Constraints for Survival Prediction IJCAI2025

【速读】：该论文旨在解决癌症生存预测中多模态学习面临的三个关键挑战：1) 现有方法依赖欧几里得空间中的多模态映射和度量，无法充分捕捉组织病理学图像（不同分辨率的patch间层次结构）和基因组数据（从基因到通路的层次结构）的层级特性；2) 将生存时间离散化为独立的风险区间，忽视其连续性和有序性，导致优化效果不佳；3) 将审查事件视为二元指示器，排除审查样本参与模型优化，未能充分利用这些数据。为应对这些挑战，论文提出了一种名为HySurvPred的新框架，其关键在于设计了三个核心模块：多模态双曲映射（Multimodal Hyperbolic Mapping, MHM）、基于角度感知排名的对比损失（Angle-aware Ranking-based Contrastive Loss, ARCL）以及审查条件下的不确定性约束（Censor-Conditioned Uncertainty Constraint, CUC）。MHM模块利用双曲空间探索每个模态内的内在层级结构，而ARCL与CUC模块则通过保留生存时间的有序性及充分挖掘审查数据，实现更有效的多模态特征整合与模型优化。实验结果表明，该方法在五个基准数据集上优于现有最先进方法。

链接: https://arxiv.org/abs/2503.13862
作者: Jiaqi Yang,Wenting Chen,Xiaohan Xing,Sean He,Xiaoling Luo,Xinheng Lyu,Linlin Shen,Guoping Qiu
机构: Shenzhen University (深圳大学); University of Nottingham, Ningbo (诺丁汉大学宁波校区); City University of Hong Kong (香港城市大学); Stanford University (斯坦福大学); University of Nottingham, UK (英国诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: submitted to IJCAI2025

点击查看摘要

Abstract:Multimodal learning that integrates histopathology images and genomic data holds great promise for cancer survival prediction. However, existing methods face key limitations: 1) They rely on multimodal mapping and metrics in Euclidean space, which cannot fully capture the hierarchical structures in histopathology (among patches from different resolutions) and genomics data (from genes to pathways). 2) They discretize survival time into independent risk intervals, which ignores its continuous and ordinal nature and fails to achieve effective optimization. 3) They treat censorship as a binary indicator, excluding censored samples from model optimization and not making full use of them. To address these challenges, we propose HySurvPred, a novel framework for survival prediction that integrates three key modules: Multimodal Hyperbolic Mapping (MHM), Angle-aware Ranking-based Contrastive Loss (ARCL) and Censor-Conditioned Uncertainty Constraint (CUC). Instead of relying on Euclidean space, we design the MHM module to explore the inherent hierarchical structures within each modality in hyperbolic space. To better integrate multimodal features in hyperbolic space, we introduce the ARCL module, which uses ranking-based contrastive learning to preserve the ordinal nature of survival time, along with the CUC module to fully explore the censored data. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on five benchmark datasets. The source code is to be released.
zh

[CV-110] RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving

【速读】：该论文旨在解决高阶元动作（meta-actions）理解与决策的准确性问题，以确保自动驾驶系统的可靠性和安全性。现有视觉-语言模型（Vision-Language Models, VLMs）在复杂自动驾驶场景中常面临空间感知不足和幻觉（hallucination）等问题，限制了其性能。为应对这些挑战，论文提出了一种检索增强决策框架（Retrieval-Augmented Decision-making, RAD）。RAD 的关键创新在于引入检索增强生成（Retrieval-Augmented Generation, RAG）管道，通过嵌入流（embedding flow）、检索流（retrieving flow）和生成流（generating flow）的三阶段动态过程提升决策准确性。此外，通过在专门优化的 NuScenes 数据集上微调 VLMs，进一步增强了模型的空间感知能力和鸟瞰图（bird’s-eye view）图像理解能力。实验结果表明，RAD 在匹配精度、F1 分数以及自定义综合评分等关键指标上显著优于基线方法，验证了其在提升自动驾驶任务中元动作决策效果方面的有效性。

链接: https://arxiv.org/abs/2503.13861
作者: Yujin Wang,Quanfeng Liu,Zhengxin Jiang,Tianyi Wang,Junfeng Jiao,Hongqing Chu,Bingzhao Gao,Hong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs’ capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird’s-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.
zh

[CV-111] Less is More: Improving Motion Diffusion Models with Sparse Keyframes

【速读】：该论文旨在解决现有运动扩散模型在处理密集帧序列时计算复杂度高、冗余信息多以及对下游任务性能受限的问题。传统方法将运动表示为密集帧序列，这导致模型需要处理大量冗余或信息量较少的帧，尤其是在学习大规模运动数据集的复杂分布时。为应对这些挑战，论文提出了一种新的扩散框架，专注于稀疏且几何意义明确的关键帧。关键在于通过屏蔽非关键帧和高效插值缺失帧来减少计算开销，并在推理过程中动态优化关键帧掩码以优先关注后期扩散步骤中的重要帧。这种方法显著降低了训练难度，同时提升了文本对齐和运动真实感的表现，并能在更少的扩散步数内保持高性能。

链接: https://arxiv.org/abs/2503.13859
作者: Jinseok Bae,Inwoo Hwang,Young Yoon Lee,Ziyu Guo,Joseph Liu,Yizhak Ben-Shabat,Young Min Kim,Mubbasir Kapadia
机构: Seoul National University (首尔国立大学); Roblox; The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in motion diffusion models have led to remarkable progress in diverse motion generation tasks, including text-to-motion synthesis. However, existing approaches represent motions as dense frame sequences, requiring the model to process redundant or less informative frames. The processing of dense animation frames imposes significant training complexity, especially when learning intricate distributions of large motion datasets even with modern neural architectures. This severely limits the performance of generative motion models for downstream tasks. Inspired by professional animators who mainly focus on sparse keyframes, we propose a novel diffusion framework explicitly designed around sparse and geometrically meaningful keyframes. Our method reduces computation by masking non-keyframes and efficiently interpolating missing frames. We dynamically refine the keyframe mask during inference to prioritize informative frames in later diffusion steps. Extensive experiments show that our approach consistently outperforms state-of-the-art methods in text alignment and motion realism, while also effectively maintaining high performance at significantly fewer diffusion steps. We further validate the robustness of our framework by using it as a generative prior and adapting it to different downstream tasks. Source code and pre-trained models will be released upon acceptance.
zh

[CV-112] MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

【速读】：该论文旨在解决多摄像头图像3D检测等3D视觉感知任务中计算效率低的问题。论文提出了一种基于Mamba的框架MamBEV，其关键是利用基于线性时空状态空间模型（SSM）的注意力机制学习统一的鸟瞰图（BEV）表示，同时引入基于SSM的交叉注意力以增强BEV查询表示与相关图像特征之间的交互。这种方法显著提高了计算和内存效率，并在多种3D感知任务中表现出色。

链接: https://arxiv.org/abs/2503.13858
作者: Hongyu Ke,Jack Morris,Kentaro Oguchi,Xiaofei Cao,Yongkang Liu,Haoxin Wang,Yi Ding
机构: Georgia State University (乔治亚州立大学); InfoTech Labs, Toyota Motor North America R&D (丰田汽车北美研发公司信息技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird’s Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV’s promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.
zh

[CV-113] Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

【速读】：该论文试图解决在多模态系统微调过程中难以区分模型在预训练阶段已学到的知识与微调阶段新学到知识的问题。为解决此问题，论文的关键方案是通过Hybrid Markov Logic Networks (HMLNs) 学习一个概率模型，将从图像中提取的视觉特征与从描述（caption）中提取的符号知识关联起来，并基于HMLN分布量化训练样本对生成描述的影响。通过在MSCOCO数据集上的推理评估，验证了该方法对不同视觉描述模型的有效性。

链接: https://arxiv.org/abs/2503.13847
作者: Monika Shah,Somdeb Sarkhel,Deepak Venugopal
机构: University of Memphis (孟菲斯大学); Adobe Research (Adobe 研究院); University of Memphis (孟菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2024 IEEE International Conference on Big Data (BigData), 10 pages

点击查看摘要

Abstract:Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM
zh

[CV-114] SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing CVPR2025

【速读】：该论文旨在解决文本驱动的动作生成中，现有方法因对骨骼关节、时间帧和文本词汇的表示过于简化而导致的信息捕捉不足及其模态间交互受限的问题。此外，针对下游任务如编辑时依赖预训练模型通常需要额外人工干预或微调的局限性，论文提出了一种骨架感知的潜在扩散模型（Skeleton-Aware Latent Diffusion, SALAD）。关键在于通过显式建模关节、帧和词语之间的复杂关系，并利用生成过程中产生的交叉注意力图，实现了基于注意力机制的零样本文本驱动动作编辑，仅需文本提示即可完成编辑，无需额外用户输入。这种方法在保持生成质量的同时显著提升了文本与动作对齐的效果，并展现出多样化的编辑能力。

链接: https://arxiv.org/abs/2503.13836
作者: Seokhyeon Hong,Chaelin Kim,Serin Yoon,Junghyun Nam,Sihun Cha,Junyong Noh
机构: Visual Media Lab, KAIST (KAIST 视觉媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: CVPR 2025; Project page this https URL

点击查看摘要

Abstract:Text-driven motion generation has advanced significantly with the rise of denoising diffusion models. However, previous methods often oversimplify representations for the skeletal joints, temporal frames, and textual words, limiting their ability to fully capture the information within each modality and their interactions. Moreover, when using pre-trained models for downstream tasks, such as editing, they typically require additional efforts, including manual interventions, optimization, or fine-tuning. In this paper, we introduce a skeleton-aware latent diffusion (SALAD), a model that explicitly captures the intricate inter-relationships between joints, frames, and words. Furthermore, by leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing using a pre-trained SALAD model, requiring no additional user input beyond text prompts. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality, and demonstrates practical versatility by providing diverse editing capabilities beyond generation. Code is available at project page.
zh

[CV-115] See-Saw Modality Balance: See Gradient and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias NAACL2025

【速读】：该论文旨在解决视觉-语言（Vision-Language, VL）模型在预测过程中因“主导模态偏差”（dominant modality bias）导致的性能下降问题，尤其是在某一模态受损时。论文通过分析发现，这种偏差源于不同模态梯度的不一致或梯度幅度差异，阻碍了损失函数的平衡收敛。为缓解此问题，论文提出了一种名为BalGrad的新框架，其关键在于引入跨模态梯度重加权（inter-modality gradient reweighting），基于各模态的贡献调整KL散度梯度，并采用跨任务梯度投影（inter-task gradient projection）以非冲突方式对齐任务方向。实验结果验证了BalGrad在UPMC Food-101、Hateful Memes和MM-IMDb数据集上的有效性，显著减轻了对特定模态的过度依赖。

链接: https://arxiv.org/abs/2503.13834
作者: JuneHyoung Kwon,MiHyeon Kim,Eunju Lee,Juhwan Choi,YoungBin Kim
机构: Department of Artificial Intelligence, Chung-Ang University (中央大学人工智能系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NAACL 2025 Main

点击查看摘要

Abstract:Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.‘’ This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality’s contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.
zh

[CV-116] Scale-Aware Contrastive Reverse Distillation for Unsupervised Medical Anomaly Detection ICLR2025

【速读】：该论文致力于解决无监督异常检测中生成模型因过拟合导致的判别能力不足以及无法有效处理异常尺度变化的问题。解决方案的关键在于提出了一种新颖的尺度感知对比逆向蒸馏（Scale-Aware Contrastive Reverse Distillation）模型，通过引入对比式学生-教师学习方法增强特征的判别能力，并设计了一种尺度适应机制以在不同尺度上软加权对比蒸馏损失，从而有效应对异常尺度变化的挑战。实验结果验证了所提方法在基准数据集上的先进性能。

链接: https://arxiv.org/abs/2503.13828
作者: Chunlei Li,Yilei Shi,Jingliang Hu,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.; Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. Code is available at this https URL.
zh

[CV-117] Stitch-a-Recipe: Video Demonstration from Multistep Descriptions

【速读】：该论文旨在解决现有方法无法有效生成多步骤描述（如烹饪食谱）对应的连贯视觉演示的问题。目前的方法仅能处理单一文本描述或动作描述，并检索或生成匹配的单一步骤视觉内容，而无法处理多步骤且需要保持整体一致性的场景。论文的关键在于提出了一种名为Stitch-a-Recipe的新颖基于检索的方法，通过从多步骤描述中组装视频演示，确保生成的视频片段不仅准确反映每个步骤描述，同时在视觉上保持连贯性。其解决方案的核心在于设计了一个包含大规模弱监督数据的训练管道，这些数据涵盖多样化的新型食谱，并引入困难负样本以同时优化正确性和一致性。

链接: https://arxiv.org/abs/2503.13821
作者: Chi Hsuan Wu,Kumar Ashutosh,Kristen Grauman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:When obtaining visual illustrations from text descriptions, today’s methods take a description with-a single text context caption, or an action description-and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe composed of multiple steps. Furthermore, simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse and novel recipes and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.
zh

[CV-118] MOSAIC: Generating Consistent Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

【速读】：该论文旨在解决从深度图像生成隐私保护的多房间室内环境数字孪生的问题。解决方案的关键在于提出了一种名为Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) 的新模型，该模型以概率方式显式考虑了同一场景内的跨视图依赖关系。MOSAIC 通过在推理阶段进行优化，避免了全景图方法中常见的误差累积问题，尤其适用于复杂场景且无需额外训练。此外，随着重叠视图数量的增加，MOSAIC 可证明地降低了去噪过程中的方差，从而提高了生成质量。实验表明，MOSAIC 在重建复杂多房间环境的图像保真度指标上优于现有最先进的基线方法。

链接: https://arxiv.org/abs/2503.13816
作者: Zhixuan Liu,Haokun Zhu,Rui Chen,Jonathan Francis,Soonmin Hwang,Ji Zhang,Jean Oh
机构: Carnegie Mellon University (卡内基梅隆大学); Bosch Center for AI (博世人工智能中心); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense. MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches. MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality. Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments. Project page is available at: this https URL
zh

[CV-119] FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification

【速读】：该论文旨在探索世界模型在遥感领域的潜力，针对多模态数据融合场景下标签效率低的问题，提出了一种名为FusDreamer的标签高效遥感世界模型。其关键解决方案包括：首先利用新颖的潜在扩散融合与多模态生成范式（LaMG），实现卓越的信息整合与细节保留能力；其次引入开放世界知识引导的一致性投影（OK-CP）模块，通过提示表示和对比学习对齐语言-视觉特征，从而在有限样本条件下桥接领域差距；最后采用端到端的多任务组合优化（MuCO）策略，捕捉细微特征偏移并约束扩散过程以实现协作学习。这些方法共同提升了跨模态数据交互及学习效率。

链接: https://arxiv.org/abs/2503.13814
作者: Jinping Wang,Weiwei Song,Hao Chen,Jinchang Ren,Huimin Zhao
机构: School of Computer Sciences, Guangdong Polytechnic Normal University, Guangzhou, 510665, China (广东技术师范大学计算机科学学院，广州，510665, 中国); Guangdong Provincial Key Laboratory of Intellectual Property and Big Data, Guangdong Polytechnic Normal University, Guangzhou, 510665, China (广东技术师范大学知识产权与大数据广东省重点实验室，广州，510665, 中国); Peng Cheng Laboratory, Shenzhen, 518000, China (鹏城实验室，深圳，518000, 中国); Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, CB3 0WA, U.K. (英国剑桥大学应用数学与理论物理系); School of Computer Sciences, Guangdong Polytechnic Normal University, Guangzhou 510640, China (广东技术师范大学计算机科学学院，广州，510640, 中国); National Subsea Centre, School of Computing, Engineering and Technology, Robert Gordon University, AB10 7AQ Aberdeen, U.K. (英国罗伯特·戈登大学计算、工程和技术学院国家海底中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emphi.e., hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at this https URL.
zh

[CV-120] Organ-aware Multi-scale Medical Image Segmentation Using Text Prompt Engineering

【速读】：该论文旨在解决医学图像分割中因单一模态视觉输入（如图像或视频）导致的劳动密集型手动标注需求以及多器官交织扫描带来的分割准确性挑战。此外，现有方法主要依赖几何提示（如点和边界框），缺乏对文本提示的支持，难以精确描述细微或模糊的解剖结构。为了解决这些问题，论文提出了一种名为Organ-aware Multi-scale Text-guided Medical Image Segmentation Model (OMT-SAM) 的新模型，用于多器官分割。其关键在于引入CLIP编码器作为新型图像-文本提示编码器，与几何提示编码器协同工作以提供丰富的上下文指导，并通过预训练的CLIP编码器和交叉注意力机制生成融合的图像-文本嵌入。同时，从MedSAM中提取多尺度视觉特征，捕捉不同粒度下的精细解剖细节，从而显著提升复杂医学图像分割任务的表现。

链接: https://arxiv.org/abs/2503.13806
作者: Wenjie Zhang,Ziyang Zhang,Mengnan He,Jiancheng Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation is essential for effective treatment planning and disease monitoring. Existing medical image segmentation methods predominantly rely on uni-modal visual inputs, such as images or videos, requiring labor-intensive manual annotations. Additionally, medical imaging techniques capture multiple intertwined organs within a single scan, further complicating segmentation accuracy. To address these challenges, MedSAM, a large-scale medical segmentation model based on the Segment Anything Model (SAM), was developed to enhance segmentation accuracy by integrating image features with user-provided prompts. While MedSAM has demonstrated strong performance across various medical segmentation tasks, it primarily relies on geometric prompts (e.g., points and bounding boxes) and lacks support for text-based prompts, which could help specify subtle or ambiguous anatomical structures. To overcome these limitations, we propose the Organ-aware Multi-scale Text-guided Medical Image Segmentation Model (OMT-SAM) for multi-organ segmentation. Our approach introduces CLIP encoders as a novel image-text prompt encoder, operating with the geometric prompt encoder to provide informative contextual guidance. We pair descriptive textual prompts with corresponding images, processing them through pre-trained CLIP encoders and a cross-attention mechanism to generate fused image-text embeddings. Additionally, we extract multi-scale visual features from MedSAM, capturing fine-grained anatomical details at different levels of granularity. We evaluate OMT-SAM on the FLARE 2021 dataset, benchmarking its performance against existing segmentation methods. Empirical results demonstrate that OMT-SAM achieves a mean Dice Similarity Coefficient of 0.937, outperforming MedSAM (0.893) and other segmentation models, highlighting its superior capability in handling complex medical image segmentation tasks.
zh

[CV-121] xt-Guided Image Invariant Feature Learning for Robust Image Watermarking

【速读】：该论文致力于解决图像水印在面对多样化变换时鲁棒性不足的问题，旨在确保内容完整性。传统基于自监督学习（Self-Supervised Learning, SSL）的方法如DINO，主要关注通用特征表示而非显式学习不变特征。论文提出了一种新颖的文本引导不变特征学习框架用于鲁棒图像水印，其关键在于利用CLIP的多模态能力，通过文本嵌入作为稳定的语义锚点，在图像失真下强制实现特征不变性。实验结果表明，该方法在多个数据集上表现出色，尤其在严重失真条件下提取准确性优于现有方法，并在一致性测试中展现出更高的余弦相似度，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.13805
作者: Muhammad Ahtesham,Xin Zhong
机构: Department of Computer Science, University of Nebraska Omaha (内布拉斯加大学奥马哈分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP’s multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.
zh

[CV-122] SMILE: a Scale-aware Multiple Instance Learning Method for Multicenter STAS Lung Cancer Histopathology Diagnosis

【速读】：该论文旨在解决肺腺癌中空气空间扩散（Spread through air spaces, STAS）这一具有侵袭性特征的病理诊断问题。当前病理学家依赖于耗时且主观性强的手动评估方法，这容易导致结果的变异性，迫切需要自动化且精确的诊断方案。论文的关键在于提出了一种尺度感知的多重实例学习方法（Scale-aware Multiple Instance Learning, SMILE），通过引入尺度自适应注意力机制，使模型能够动态调整对重要实例的关注程度，减少对局部区域的过度依赖，从而实现STAS病灶的一致性检测。实验结果表明，SMILE在多个数据集上的诊断性能超越临床平均水平，为STAS研究奠定了开放基准，并为计算病理学技术的未来扩展、可解释性和临床应用提供了基础。

链接: https://arxiv.org/abs/2503.13799
作者: Liangrui Pan,Xiaoyu Li,Yutao Dou,Qiya Song,Jiadi Luo,Qingchun Liang,Shaoliang Peng
机构: College of Computer Science and Electronic Engineering, Hunan University (湖南大学), China; College of Information Science and Engineering, Hunan Normal University (湖南师范大学), China; Department of Pathology, The Second Xiangya Hospital, Central South University (中南大学湘雅二医院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spread through air spaces (STAS) represents a newly identified aggressive pattern in lung cancer, which is known to be associated with adverse prognostic factors and complex pathological features. Pathologists currently rely on time consuming manual assessments, which are highly subjective and prone to variation. This highlights the urgent need for automated and precise diag nostic solutions. 2,970 lung cancer tissue slides are comprised from multiple centers, re-diagnosed them, and constructed and publicly released three lung cancer STAS datasets: STAS CSU (hospital), STAS TCGA, and STAS CPTAC. All STAS datasets provide corresponding pathological feature diagnoses and related clinical data. To address the bias, sparse and heterogeneous nature of STAS, we propose an scale-aware multiple instance learning(SMILE) method for STAS diagnosis of lung cancer. By introducing a scale-adaptive attention mechanism, the SMILE can adaptively adjust high attention instances, reducing over-reliance on local regions and promoting consistent detection of STAS lesions. Extensive experiments show that SMILE achieved competitive diagnostic results on STAS CSU, diagnosing 251 and 319 STAS samples in CPTAC andTCGA,respectively, surpassing clinical average AUC. The 11 open baseline results are the first to be established for STAS research, laying the foundation for the future expansion, interpretability, and clinical integration of computational pathology technologies. The datasets and code are available at this https URL.
zh

[CV-123] LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

【速读】：该论文旨在解决现有基于大规模预训练模型的开放词汇目标检测（OVD）方法中存在的合成数据偏差和过拟合问题。传统方法依赖于人工标注的数据生成过程，容易引入偏见，而直接利用大型语言模型（LLMs）的隐藏状态可以有效规避这一问题，但这一方向尚未得到充分探索。论文的关键在于提出了一种系统性方法，通过利用多模态大型语言模型（MLLMs）解码器层的中间隐状态，实现高效的知识迁移。具体而言，论文设计了一个零初始化的交叉注意力适配器（cross-attention adapter），并提出了名为LED（LLM增强开放词汇目标检测）的新方法。实验表明，早期LLM层的中间隐状态保留了强大的空间语义相关性，能够显著提升复杂自由形式文本查询任务的表现，同时保持对普通类别检测性能的影响较小。例如，在Omnilabel数据集上，采用Swin-T作为视觉编码器的Qwen2-0.5B模型通过该适配策略使GroundingDINO提升了2.33%，且仅增加了8.7%的GFLOPs开销；更大的视觉编码器进一步将性能提升了6.22%。此外，论文通过消融实验验证了适配器架构、LLM大小以及适配层选择的设计合理性。

链接: https://arxiv.org/abs/2503.13794
作者: Yang Zhou,Shiyu Zhao,Yuxiao Chen,Zhenting Wang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large foundation models trained on large-scale visual-text data can significantly enhance Open Vocabulary Object Detection (OVD) through data generation. However, this may lead to biased synthetic data and overfitting to specific configurations. It can sidestep biases of manually curated data generation by directly leveraging hidden states of Large Language Models (LLMs), which is surprisingly rarely explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of a MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge transfer from LLMs to object detectors, an new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We demonstrate that intermediate hidden states from early LLM layers retain strong spatial-semantic correlations that are beneficial to grounding tasks. Experiments show that our adaptation strategy significantly enhances the performance on complex free-form text queries while remaining the same on plain categories. With our adaptation, Qwen2-0.5B with Swin-T as the vision encoder improves GroundingDINO by 2.33% on Omnilabel, at the overhead of 8.7% more GFLOPs. Qwen2-0.5B with a larger vision encoder can further boost the performance by 6.22%. We further validate our design by ablating on varied adapter architectures, sizes of LLMs, and which layers to add adaptation.
zh

[CV-124] Identifying and Mitigating Position Bias of Multi-image Vision-Language Models CVPR2025

【速读】：该论文旨在解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在多图像推理任务中无法稳健利用跨图像信息的问题，尤其是图像位置变化对预测结果的显著影响。论文通过引入位置式问答（Position-wise Question Answering, PQA）任务，揭示了LVLMs中存在的明显位置偏置现象：开源模型在处理后置图像时表现较好，但在中间或前置图像上表现欠佳；而专有模型虽在首尾图像上表现改善，但在中间图像上仍存在困难。为缓解这一问题，论文提出了一种名为软注意力（SoFt Attention, SoFA）的简单且无需额外训练的解决方案，通过在线性插值中结合图像间的因果注意力与双向注意力来减轻位置偏置。实验结果表明，SoFA有效减少了位置偏置，并提升了现有LVLMs的推理性能。

链接: https://arxiv.org/abs/2503.13792
作者: Xinyu Tian,Shu Zou,Zhaoyuan Yang,Jing Zhang
机构: Australian National University (澳大利亚国立大学); GE Research (GE研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.
zh

[CV-125] Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants

【速读】：该论文旨在解决复杂冠层结构植物（如矮番茄）总叶面积（Total Leaf Area, TLA）难以准确、非破坏性估算的问题。传统方法通常存在劳动密集、对植物有损伤或无法充分捕捉冠层复杂性的局限性。为应对这一挑战，论文提出了一种结合RGB图像序列的3D重建与机器学习的非破坏性方法。其关键在于利用Alpha Shape算法进行3D点云重建（设置参数 (\alpha = 3)），并通过Extreme Gradient Boosting回归模型优化预测精度，在实验中实现了较高的决定系数 (R^2 = 0.80) 和较低的平均绝对误差 (MAE = 489 , \text{cm}^2)，同时在跨季节验证中仍保持稳健性能 ((R^2 = 0.56), (MAE = 579 , \text{cm}^2)）。这一方法的关键创新在于通过多传感器融合和自动化建模，解决了复杂冠层下叶面积精确估算的难题，并展示了其在城市农业和精准农业中的应用潜力。

链接: https://arxiv.org/abs/2503.13778
作者: Dmitrii Usenko,David Helman,Chen Giladi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 11 figures, submitted to Computers and Electronics in Agriculture

点击查看摘要

Abstract:Accurate estimation of total leaf area (TLA) is crucial for evaluating plant growth, photosynthetic activity, and transpiration. However, it remains challenging for bushy plants like dwarf tomatoes due to their complex canopies. Traditional methods are often labor-intensive, damaging to plants, or limited in capturing canopy complexity. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars: Mohamed, Hahms Gelbe Topftomate, and Red Robin – grown under controlled greenhouse conditions. Two experiments (spring-summer and autumn-winter) included 73 plants, yielding 418 TLA measurements via an “onion” approach. High-resolution videos were recorded, and 500 frames per plant were used for 3D reconstruction. Point clouds were processed using four algorithms (Alpha Shape, Marching Cubes, Poisson’s, Ball Pivoting), and meshes were evaluated with seven regression models: Multivariable Linear Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, Random Forest, Extreme Gradient Boosting, and Multilayer Perceptron. The Alpha Shape reconstruction ( \alpha = 3 ) with Extreme Gradient Boosting achieved the best performance ( R^2 = 0.80 , MAE = 489 cm^2 ). Cross-experiment validation showed robust results ( R^2 = 0.56 , MAE = 579 cm^2 ). Feature importance analysis identified height, width, and surface area as key predictors. This scalable, automated TLA estimation method is suited for urban farming and precision agriculture, offering applications in automated pruning, resource efficiency, and sustainable food production. The approach demonstrated robustness across variable environmental conditions and canopy structures.
zh

[CV-126] 8-Calves Image dataset

【速读】：该论文旨在构建一个基准数据集（8-Calves）以评估在遮挡丰富且时间一致性高的环境中目标检测与身份分类的能力。论文的关键解决方案在于利用一个包含八头荷斯坦弗里斯牛在一小时内连续视频帧（总计67,760帧）的数据集，并结合静态帧进行任务测试。对于目标检测，通过微调多种YOLO模型（包括25种YOLO变体和3种Transformer模型）来优化性能；而对于身份分类，则评估了来自23个预训练视觉模型（如ResNet、ConvNextV2、ViTs等）的嵌入向量，并采用线性分类器和KNN方法。研究发现，较小的YOLO模型表现优于较大的模型，而现代架构如ConvNextV2在身份分类任务中表现出色。此外，论文强调了最小化且针对性强的数据增强策略以及有效的预训练策略的重要性。关键创新点还体现在长时间序列（1小时对比以往10分钟）带来的挑战，这为测试遮挡处理、时间一致性及效率提供了实用工具。

链接: https://arxiv.org/abs/2503.13777
作者: Xuyang Fang,Sion Hannuna,Neill Campbell
机构: University of Bristol (布里斯托尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity classification in occlusion-rich, temporally consistent environments. The dataset comprises a 1-hour video (67,760 frames) of eight Holstein Friesian calves in a barn, with ground truth bounding boxes and identities, alongside 900 static frames for detection tasks. Each calf exhibits a unique coat pattern, enabling precise identity distinction. For cow detection, we fine-tuned 28 models (25 YOLO variants, 3 transformers) on 600 frames, testing on the full video. Results reveal smaller YOLO models (e.g., YOLOV9c) outperform larger counterparts despite potential bias from a YOLOv8m-based labeling pipeline. For identity classification, embeddings from 23 pretrained vision models (ResNet, ConvNextV2, ViTs) were evaluated via linear classifiers and KNN. Modern architectures like ConvNextV2 excelled, while larger models frequently overfit, highlighting inefficiencies in scaling. Key findings include: (1) Minimal, targeted augmentations (e.g., rotation) outperform complex strategies on simpler datasets; (2) Pretraining strategies (e.g., BEiT, DinoV2) significantly boost identity recognition; (3) Temporal continuity and natural motion patterns offer unique challenges absent in synthetic or domain-specific benchmarks. The dataset’s controlled design and extended sequences (1 hour vs. prior 10-minute benchmarks) make it a pragmatic tool for stress-testing occlusion handling, temporal consistency, and efficiency. The link to the dataset is this https URL. Comments: 11 pages, 5 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.13777 [cs.CV] (or arXiv:2503.13777v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.13777 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xuyang Fang [view email] [v1] Mon, 17 Mar 2025 23:47:52 UTC (6,499 KB)
zh

[CV-127] Continual Unlearning for Foundational Text-to-Image Models without Generalization Erosion

【速读】：本文旨在解决如何在不进行大规模重新训练的情况下，有效从预训练的生成式基础模型（Generative Foundation Models）中移除选定概念的问题。论文提出了一种名为“持续性无学习”（Continual Unlearning）的新范式，并设计了Decremental Unlearning without Generalization Erosion (DUGE) 算法作为其核心解决方案。DUGE 的关键在于通过三种损失函数实现目标：一是交叉注意力损失（Cross-Attention Loss），引导模型关注不含目标概念的图像；二是先验保护损失（Prior-Preservation Loss），保护与非目标概念相关的知识；三是正则化损失（Regularization Loss），防止模型因移除概念而导致泛化能力下降。实验结果表明，该方法能够有针对性地移除特定概念，同时保持模型的整体性能和完整性，为生成式模型的优化提供了一种实用的手段，降低了版权侵权、未经授权材料使用及艺术风格复制等风险，同时确保非目标概念不受影响，维持了模型的核心功能。

链接: https://arxiv.org/abs/2503.13769
作者: Kartik Thakral,Tamar Glaser,Tal Hassner,Mayank Vatsa,Richa Singh
机构: IIT Jodhpur (印度技术学院焦特布尔); Harman International (哈曼国际); Weir AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under submission in T-PAMI

点击查看摘要

Abstract:How can we effectively unlearn selected concepts from pre-trained generative foundation models without resorting to extensive retraining? This research introduces `continual unlearning’, a novel paradigm that enables the targeted removal of multiple specific concepts from foundational generative models, incrementally. We propose Decremental Unlearning without Generalization Erosion (DUGE) algorithm which selectively unlearns the generation of undesired concepts while preserving the generation of related, non-targeted concepts and alleviating generalization erosion. For this, DUGE targets three losses: a cross-attention loss that steers the focus towards images devoid of the target concept; a prior-preservation loss that safeguards knowledge related to non-target concepts; and a regularization loss that prevents the model from suffering from generalization erosion. Experimental results demonstrate the ability of the proposed approach to exclude certain concepts without compromising the overall integrity and performance of the model. This offers a pragmatic solution for refining generative models, adeptly handling the intricacies of model training and concept management lowering the risks of copyright infringement, personal or licensed material misuse, and replication of distinctive artistic styles. Importantly, it maintains the non-targeted concepts, thereby safeguarding the model’s core capabilities and effectiveness.
zh

[CV-128] Fast alignment of heterogeneous images in sliced Wasserstein distance

【速读】：该论文旨在解决计算机视觉中异质图像（heterogeneous images）对齐的问题，即对相似但不完全相同的图像进行精确配准。传统方法在处理此类问题时往往面临效率或鲁棒性不足的挑战。论文提出了一种基于最优传输（Optimal Transport）的快速算法，其关键在于结合了快速傅里叶变换方法的速度优势与切片概率度量（sliced probability metrics）的鲁棒性。通过利用切片2-Wasserstein距离（sliced 2-Wasserstein distance），该算法能够在 (O(L^2 \log L)) 操作内高效计算 (L \times L) 图像的对齐结果，并且对图像中的平移（translations）、旋转（rotations）及形变（deformations）具有较强的鲁棒性。

链接: https://arxiv.org/abs/2503.13756
作者: Yunpeng Shi,Amit Singer,Eric J. Verbeke
机构: Department of Mathematics, University of California at Davis (加州大学戴维斯分校数学系), USA; Department of Mathematics, Princeton University (普林斯顿大学数学系), USA; Program in Applied and Computational Mathematics, Princeton University (普林斯顿大学应用与计算数学项目), USA
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Many applications of computer vision rely on the alignment of similar but non-identical images. We present a fast algorithm for aligning heterogeneous images based on optimal transport. Our approach combines the speed of fast Fourier methods with the robustness of sliced probability metrics and allows us to efficiently compute the alignment between two L \times L images using the sliced 2-Wasserstein distance in O(L^2 \log L) operations. We show that our method is robust to translations, rotations and deformations in the images.
zh

[CV-129] FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

【速读】：该论文旨在解决基于深度学习的视频超分辨率（Video Super-Resolution, VSR）方法因集中式架构引发的隐私问题，特别是在对隐私要求严格的场景中。同时，现有联邦学习（Federated Learning, FL）方法在低级视觉任务上的表现不佳，导致重建效果不理想。为了解决这些问题，论文提出了一种名为FedVSR的新框架，这是一种与具体网络结构无关且无状态的联邦学习方案。其关键在于引入了一个轻量级的损失项，该损失项能够提升本地优化性能，并以较低的计算开销指导全局聚合过程。实验结果表明，FedVSR相较于通用的联邦学习方法，在峰值信噪比（PSNR）上平均提升了0.85 dB，验证了其有效性。

链接: https://arxiv.org/abs/2503.13745
作者: Ali Mollaahmadi Dehaghi,Hossein KhademSohi,Reza Razavi,Steve Drew,Mohammad Moshirpour
机构: University of Calgary(卡尔加里大学); Userful Corporation(Userful公司); University of California, Irvine(加州大学欧文分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: this https URL
zh

[CV-130] MonoCT: Overcoming Monocular 3D Detection Domain Shift with Consistent Teacher Models ICRA2025

【速读】：该论文致力于解决单目3D目标检测在不同传感器、环境以及相机设置下的领域适应问题。解决方案的关键在于提出了一种新颖的无监督领域自适应方法MonoCT，其通过生成高度准确的伪标签实现自监督。为了缓解领域偏移，MonoCT引入了一个具有集成概念的新型广义深度增强（Generalized Depth Enhancement, GDE）模块以提升深度估计精度。此外，通过探索模型内部一致性测量和多样性最大化（Diversity Maximization, DM）策略，论文还提出了一个新的伪标签评分（Pseudo Label Scoring, PLS）模块，进一步生成高质量的伪标签用于自训练。实验结果表明，MonoCT在六个基准数据集上的表现大幅超越现有最先进的领域适应方法（AP Mod.最低提升约21%），并且能够很好地泛化到汽车、交通摄像头和无人机视角场景。

链接: https://arxiv.org/abs/2503.13743
作者: Johannes Meier,Louis Inchingolo,Oussema Dhaouadi,Yan Xia,Jacques Kaiser,Daniel Cremers
机构: DeepScenario; TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA2025

点击查看摘要

Abstract:We tackle the problem of monocular 3D object detection across different sensors, environments, and camera setups. In this paper, we introduce a novel unsupervised domain adaptation approach, MonoCT, that generates highly accurate pseudo labels for self-supervision. Inspired by our observation that accurate depth estimation is critical to mitigating domain shifts, MonoCT introduces a novel Generalized Depth Enhancement (GDE) module with an ensemble concept to improve depth estimation accuracy. Moreover, we introduce a novel Pseudo Label Scoring (PLS) module by exploring inner-model consistency measurement and a Diversity Maximization (DM) strategy to further generate high-quality pseudo labels for self-training. Extensive experiments on six benchmarks show that MonoCT outperforms existing SOTA domain adaptation methods by large margins (~21% minimum for AP Mod.) and generalizes well to car, traffic camera and drone views.
zh

[CV-131] C2D-ISR: Optimizing Attention-based Image Super-resolution from Continuous to Discrete Scales

【速读】：该论文旨在解决现有基于注意力机制的单图像超分辨率（SISR）方法受限于简单训练策略和针对离散放大比例设计的网络架构的问题，这些限制阻碍了模型有效捕捉多尺度信息的能力。为了解决这些问题，论文提出了一种名为\textbfC2D-ISR的新框架，从性能和复杂度两个角度优化基于注意力机制的图像超分辨率模型。该方案的关键在于采用两阶段训练方法和分层编码机制，通过连续尺度训练离散尺度模型以学习跨尺度的相关性和多尺度特征表示，同时结合现有的基于注意力的网络结构增强空间特征融合与跨尺度信息聚合能力，并显著提升推理速度。

链接: https://arxiv.org/abs/2503.13740
作者: Yuxuan Jiang,Chengxi Zeng,Siyue Teng,Fan Zhang,Xiaoqing Zhu,Joel Sole,David Bull
机构: Visual Information Laboratory, University of Bristol (视觉信息实验室，布里斯托大学); Netflix Inc. (Netflix Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model’s ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbfC2D-ISR, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at this http URL.
zh

[CV-132] Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes CVPR2025

【速读】：该论文旨在解决多视角下行人关联（multi-view person association）这一基础性问题，特别是在挑战场景中，当行人外观相似时，传统基于行人重识别特征的方法变得不可靠。为实现更鲁棒的关联，论文提出了一种无需任何标注信息的自监督非校准多视角行人关联方法（Self-Supervised Un-calibrated Multi-View Person Association, Self-MVA）。关键在于利用跨视角图像同步任务作为自监督预训练目标，通过编码行人的统一几何与外观特征，并结合匈牙利匹配桥接实例级与图像级距离的差距，以实现无监督学习下的关联。此外，通过引入两种自监督线性约束（多视角重投影和成对边关联），进一步缩小解空间，从而在多个公开基准数据集上取得了当前最优性能。

链接: https://arxiv.org/abs/2503.13739
作者: Keqi Chen,Vinkle Srivastav,Didier Mutter,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357 (斯特拉斯堡大学, 国家科学研究中心, 国家医学研究院, ICube, 联合研究组7357); IHU Strasbourg (斯特拉斯堡国际医院); University Hospital of Strasbourg (斯特拉斯堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for CVPR 2025. Code: this https URL

点击查看摘要

Abstract:Multi-view person association is a fundamental step towards multi-view analysis of human activities. Although the person re-identification features have been proven effective, they become unreliable in challenging scenes where persons share similar appearances. Therefore, cross-view geometric constraints are required for a more robust association. However, most existing approaches are either fully-supervised using ground-truth identity labels or require calibrated camera parameters that are hard to obtain. In this work, we investigate the potential of learning from synchronization, and propose a self-supervised uncalibrated multi-view person association approach, Self-MVA, without using any annotations. Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. The model encodes each person’s unified geometric and appearance features, and we train it by utilizing synchronization labels for supervision after applying Hungarian matching to bridge the gap between instance-wise and image-wise distances. To further reduce the solution space, we propose two types of self-supervised linear constraints: multi-view re-projection and pairwise edge association. Extensive experiments on three challenging public benchmark datasets (WILDTRACK, MVOR, and SOLDIERS) show that our approach achieves state-of-the-art results, surpassing existing unsupervised and fully-supervised approaches. Code is available at this https URL.
zh

[CV-133] owards Scalable Modeling of Compressed Videos for Efficient Action Recognition

【速读】：该论文旨在解决在深度视频表示学习中因解码开销大、原始视频流规模庞大以及高时间冗余性所导致的计算挑战。不同于现有仅关注单一模态或简单多模态处理的方法，本文提出了一种高效的压缩域操作框架，充分利用I帧和P帧（包括运动矢量和残差）的所有可用模态来构建更强的共享视频表示。解决方案的关键在于设计了一个端到端的混合框架，通过三方面创新显著降低推理成本：首先，采用具有高效尖峰时间调制器的双编码器方案以减少延迟并保持跨域特征聚合；其次，引入统一Transformer模型利用全局自注意力机制增强I帧与P帧之间的上下文交互；最后，设计多模态混合块从联合时空标记嵌入中建模丰富的表征。实验结果表明，该方法在多个基准数据集上实现了轻量化架构，并达到了最先进的视频识别性能，同时具备较低的成本和快速的推理速度。

链接: https://arxiv.org/abs/2503.13724
作者: Shristi Das Biswas,Efstathia Soufleri,Arani Roy,Kaushik Roy
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Training robust deep video representations has proven to be computationally challenging due to substantial decoding overheads, the enormous size of raw video streams, and their inherent high temporal redundancy. Different from existing schemes, operating exclusively in the compressed video domain and exploiting all freely available modalities, i.e., I-frames, and P-frames (motion vectors and residuals) offers a compute-efficient alternative. Existing methods approach this task as a naive multi-modality problem, ignoring the temporal correlation and implicit sparsity across P-frames for modeling stronger shared representations for videos of the same action, making training and generalization easier. By revisiting the high-level design of dominant video understanding backbones, we increase inference speed by a factor of 56 while retaining similar performance. For this, we propose a hybrid end-to-end framework that factorizes learning across three key concepts to reduce inference cost by 330\times versus prior art: First, a specially designed dual-encoder scheme with efficient Spiking Temporal Modulators to minimize latency while retaining cross-domain feature aggregation. Second, a unified transformer model to capture inter-modal dependencies using global self-attention to enhance I-frame – P-frame contextual interactions. Third, a Multi-Modal Mixer Block to model rich representations from the joint spatiotemporal token embeddings. Experiments show that our method results in a lightweight architecture achieving state-of-the-art video recognition performance on UCF-101, HMDB-51, K-400, K-600 and SS-v2 datasets with favorable costs ( 0.73 J/V) and fast inference ( 16 V/s). Our observations bring new insights into practical design choices for efficient next-generation spatiotemporal learners. Code is available.
zh

[CV-134] SED-MVS: Segmentation-Driven and Edge-Aligned Deformation Multi-View Stereo with Depth Restoration and Occlusion Constraint

【速读】：该论文旨在解决现有基于patch-deformation方法在多视图立体重建中，虽然通过可变形补丁有效扩展了纹理缺失区域的感受野，但忽视了因易被忽略的边缘跳过现象引起的变形不稳定问题，可能导致匹配失真。为了解决这一问题，论文提出了一种名为SEDMVS的方法，其关键是采用全景分割与多轨迹扩散策略来实现由分割驱动且与边缘对齐的patch变形。具体而言，首先利用SAM2进行全景分割以提供深度边缘引导，指导patch变形；接着采用多轨迹扩散策略确保patch全面与深度边缘对齐。此外，为了避免随机初始化可能带来的不准确性，结合LoFTR的稀疏点和DepthAnything V2的单目深度图来恢复可靠的现实深度图用于初始化和监督引导。最后，将分割图像与单目深度图相结合以挖掘实例间遮挡关系，并进一步将其视为遮挡图以实施两种不同的边缘约束，从而促进遮挡感知的patch变形。广泛的实验结果验证了该方法在ETH3D、Tanks Temples、BlendedMVS和Strecha数据集上的最先进性能和鲁棒泛化能力。

链接: https://arxiv.org/abs/2503.13721
作者: Zhenlong Yuan,Zhidong Yang,Yujun Cai,Kuangxin Wu,Mufan Liu,Dapeng Zhang,Hao Jiang,Zhaoxin Li,Zhaoqi Wang
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Electrical Engineering and Computer Science, The University of Queensland (UQ), Australia (澳大利亚昆士兰大学电气工程与计算机科学学院); Information Technology Department, Hunan Police Academy (湖南警察学院信息技术系), Changsha 410100, China; Cooperative MediaNet Innovation Center, Shanghai Jiao Tong University (上海交通大学协同媒体网创新中心), Shanghai, 200240, China; DSLAB, School of Information Science & Engineering, Lanzhou University (兰州大学信息科学与工程学院DSLAB实验室), 730000, China; Agricultural Information Institute, Chinese Academy of Agricultural Sciences (中国农业科学院农业信息研究所) and Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs (农业农村部农业大数据重点实验室), 100081, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, patch-deformation methods have exhibited significant effectiveness in multi-view stereo owing to the deformable and expandable patches in reconstructing textureless areas. However, such methods primarily emphasize broadening the receptive field in textureless areas, while neglecting deformation instability caused by easily overlooked edge-skipping, potentially leading to matching distortions. To address this, we propose SED-MVS, which adopts panoptic segmentation and multi-trajectory diffusion strategy for segmentation-driven and edge-aligned patch deformation. Specifically, to prevent unanticipated edge-skipping, we first employ SAM2 for panoptic segmentation as depth-edge guidance to guide patch deformation, followed by multi-trajectory diffusion strategy to ensure patches are comprehensively aligned with depth edges. Moreover, to avoid potential inaccuracy of random initialization, we combine both sparse points from LoFTR and monocular depth map from DepthAnything V2 to restore reliable and realistic depth map for initialization and supervised guidance. Finally, we integrate segmentation image with monocular depth map to exploit inter-instance occlusion relationship, then further regard them as occlusion map to implement two distinct edge constraint, thereby facilitating occlusion-aware patch deformation. Extensive results on ETH3D, Tanks Temples, BlendedMVS and Strecha datasets validate the state-of-the-art performance and robust generalization capability of our proposed method.
zh

[CV-135] Improving Geometric Consistency for 360-Degree Neural Radiance Fields in Indoor Scenarios

【速读】：该论文旨在解决神经辐射场（Neural Radiance Fields, NeRFs）在处理大型低纹理区域时存在的问题，特别是在室内环境中因缺乏特征的建筑表面（如墙壁、天花板和地板）导致的“浮云状”伪影（floaters），这些问题会显著降低场景的真实感。论文的关键在于提出了一种高效且鲁棒的方法来计算密集深度先验，特别针对室内环境中的大型低纹理建筑表面。解决方案的核心包括引入一种新的深度损失函数以增强低特征区域的渲染质量，并通过互补的深度块正则化进一步优化其他区域的深度一致性。这一方法在两个合成的360度室内场景上的实验表明，与标准光度损失和均方误差深度监督相比，能够显著提升视觉保真度。

链接: https://arxiv.org/abs/2503.13710
作者: Iryna Repinetska,Anna Hilsmann,Peter Eisert
机构: Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute (弗劳恩霍夫电信研究所, 海因里希赫兹研究所); Department of Computer Science, Humboldt University (洪堡大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ‘‘floaters’’ that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ‘‘inside-out’’ views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
zh

[CV-136] Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory

【速读】：该论文旨在解决长视频理解在实际应用（如视频检索、摘要生成和问答）中因传统方法计算需求高且常受GPU内存限制而导致效率瓶颈的问题。论文提出的解决方案——Long-Video Memory Network (Long-VMNet)，其关键是采用固定大小的记忆表示存储从输入视频中采样的判别性片段，并通过神经采样器识别判别性tokens，从而实现高效处理。此外，Long-VMNet仅需单次扫描视频即可完成理解任务，显著提升了效率，在Rest-ADL数据集上的实验表明其推理速度提升了18倍到75倍，同时保持了竞争力的预测性能。

链接: https://arxiv.org/abs/2503.13707
作者: Saket Gurukar,Asim Kadav
机构: Samsung Research America (三星研究美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-form video understanding is essential for various applications such as video retrieval, summarizing, and question answering. Yet, traditional approaches demand substantial computing power and are often bottlenecked by GPU memory. To tackle this issue, we present Long-Video Memory Network, Long-VMNet, a novel video understanding method that employs a fixed-size memory representation to store discriminative patches sampled from the input video. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Additionally, Long-VMNet only needs one scan through the video, greatly boosting efficiency. Our results on the Rest-ADL dataset demonstrate an 18x – 75x improvement in inference times for long-form video retrieval and answering questions, with a competitive predictive performance.
zh

[CV-137] Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

【速读】：该论文致力于解决音频-视觉事件感知领域中现有方法面临的三个主要问题：一是训练数据词汇量有限，导致模型难以泛化到未见过的新事件类别；二是标注过程耗时且繁琐，限制了方法的可扩展性；三是现有模型未能有效应对事件分布随时间的变化，并且通常采用晚期融合策略，导致多模态交互信息的丢失。为了解决这些问题，论文提出了一种名为Audio-Visual Adaptive Video Analysis (\textAV^2\textA) 的模型不可知方法。其关键在于引入基于分数级融合的技术以保留更丰富的多模态交互信息，同时开发了一种视频内标签偏移算法，通过利用输入视频数据及其先前帧的预测结果动态调整后续帧的事件分布。此外，\textAV^2\textA 还实现了首个无需训练的开放词汇基线，显著提升了在零样本和弱监督场景下的性能表现。

链接: https://arxiv.org/abs/2503.13693
作者: Eitan Shaar,Ariel Shaulov,Gal Chechik,Lior Wolf
机构: Bar-Ilan University (巴伊兰大学); Tel-Aviv University (特拉维夫大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ( \textAV^2\textA ), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. \textAV^2\textA also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that \textAV^2\textA achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of \textAV^2\textA on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.
zh

[CV-138] FiVE: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）编辑领域缺乏标准化基准以实现公平评估的问题，以及现有方法在处理细粒度视频编辑任务时对超参数敏感性较高且难以保持上下文和时间一致性的问题。为了解决这些问题，论文的关键创新在于引入了FiVE（Fine-grained Video Editing Benchmark），一个用于评估新兴扩散模型和修正流模型的细粒度视频编辑基准数据集，包含多种编辑类型及对应的掩码，并扩展了最新的修正流模型Pyramid-Flow和Wan2.1为无需训练和反转的视频编辑模型Pyramid-Edit和Wan-Edit。通过在FiVE基准上的实验，论文验证了基于修正流的方法优于基于扩散的方法，并提出了FiVE-Acc这一利用视觉语言模型的新指标来增强对象级评估能力。

链接: https://arxiv.org/abs/2503.13684
作者: Minghan Li,Chenxi Xie,Yichen Wu,Lei Zhang,Mengyu Wang
机构: Harvard AI and Robotics Lab, Harvard University (哈佛大学); Broad Institute (布罗德研究所); Hong Kong Polytechnic University (香港理工大学); School of Engineering and Applied Sciences, Harvard University (哈佛大学); City University of Hong Kong (香港城市大学); Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 14 figures, 16 tables

点击查看摘要

Abstract:Numerous text-to-video (T2V) editing methods have emerged recently, but the lack of a standardized benchmark for fair evaluation has led to inconsistent claims and an inability to assess model sensitivity to hyperparameters. Fine-grained video editing is crucial for enabling precise, object-level modifications while maintaining context and temporal consistency. To address this, we introduce FiVE, a Fine-grained Video Editing Benchmark for evaluating emerging diffusion and rectified flow models. Our benchmark includes 74 real-world videos and 26 generated videos, featuring 6 fine-grained editing types, 420 object-level editing prompt pairs, and their corresponding masks. Additionally, we adapt the latest rectified flow (RF) T2V generation models, Pyramid-Flow and Wan2.1, by introducing FlowEdit, resulting in training-free and inversion-free video editing models Pyramid-Edit and Wan-Edit. We evaluate five diffusion-based and two RF-based editing methods on our FiVE benchmark using 15 metrics, covering background preservation, text-video similarity, temporal consistency, video quality, and runtime. To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. Experimental results demonstrate that RF-based editing significantly outperforms diffusion-based methods, with Wan-Edit achieving the best overall performance and exhibiting the least sensitivity to hyperparameters. More video demo available on the anonymous website: this https URL
zh

[CV-139] Web Artifact Attacks Disrupt Vision Language Models

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）因训练数据中无意关联而导致的鲁棒性不足问题。具体而言，现有工作已利用这些关联作为攻击向量，通过在图像中插入与目标类别完全匹配的文字来操纵模型预测，这种攻击方式依赖于模型对文本的偏重特性。然而，此类攻击仅关注匹配文字的攻击，忽视了更广泛的关联形式，如不匹配文字和图形符号，这些问题源于网络规模数据中的品牌内容泛滥。为弥补这一空白，论文引入了一种基于 Artifact 的新型攻击方法，利用不匹配文字和图形元素误导模型，且这些 Artifact 并非预先定义，增加了防御难度。关键在于将 Artifact 攻击视为搜索问题，并通过扩展先前工作的 Artifact 意识提示至图形场景，实现了对五种数据集的有效攻击，部分攻击组合成功率达到 100%，并展示了跨模型高达 90% 的迁移成功率。为防御此类攻击，论文提出通过增强提示策略实现相对标准提示最高 15% 的成功率降低，表明了提升模型鲁棒性的潜在方向。

链接: https://arxiv.org/abs/2503.13652
作者: Maan Qraitem,Piotr Teterwak,Kate Saenko,Bryan A. Plummer
机构: Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) (e.g., CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs’ text-heavy bias-a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100% attack success rates. These attacks transfer across models with up to 90% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work’s artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15% relative to standard prompts, suggesting a promising direction for enhancing model robustness.
zh

[CV-140] Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLM s in Egocentric Videos CVPR2025

【速读】：该论文旨在解决当前第一人称视角（egocentric）视频问答数据集中存在的问题，即许多问题可以通过少量帧或常识推理作答，而无需依赖视频的实际内容，导致现有方法无法充分评估模型对时间动态的理解能力。论文指出，现有的多模态大语言模型（Multi-Modal Large Language Models, MLLMs）在这些基准数据集上的高表现往往仅依赖于文本或单帧输入，而非真正的视频时间信息。为了解决这一局限性，论文提出了EgoTempo数据集，其设计重点在于评估需要整合整个视频信息的任务，迫使模型依赖时间模式而非静态线索或预存知识。因此，EgoTempo的关键在于通过引入强调时间理解的新任务，推动领域内更深入的研究，并激励能够更好地捕捉时间动态复杂性的模型发展。

链接: https://arxiv.org/abs/2503.13646
作者: Chiara Plizzari,Alessio Tonioni,Yongqin Xian,Achin Kulshrestha,Federico Tombari
机构: Google (谷歌); Politecnico di Torino (都灵理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Dataset and code are available at this https URL

点击查看摘要

Abstract:Understanding fine-grained temporal dynamics is crucial in egocentric videos, where continuous streams capture frequent, close-up interactions with objects. In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. Our analysis shows that state-of-the-art Multi-Modal Large Language Models (MLLMs) on these benchmarks achieve remarkably high performance using just text or a single frame as input. To address these limitations, we introduce EgoTempo, a dataset specifically designed to evaluate temporal understanding in the egocentric domain. EgoTempo emphasizes tasks that require integrating information across the entire video, ensuring that models would need to rely on temporal patterns rather than static cues or pre-existing knowledge. Extensive experiments on EgoTempo show that current MLLMs still fall short in temporal reasoning on egocentric videos, and thus we hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics. Dataset and code are available at this https URL.
zh

[CV-141] A Convex formulation for linear discriminant analysis

链接: https://arxiv.org/abs/2503.13623
作者: Sai Vijay Kumar Surineela,Prathyusha Kanakamalla,Harigovind Harikumar,Tomojit Ghosh
机构: Wright State University (赖特州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Total pages 29 including references, six figures, seven tables. Submitted to an Elsevier journal

点击查看摘要

[CV-142] Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization

【速读】：该论文致力于解决单域泛化（Single Domain Generalization, SDG）任务中，利用单一源域数据训练的模型在面对多样化目标域场景时性能下降的问题。尤其关注如何有效利用合成数据增强模型的泛化能力，同时避免因合成数据与真实目标域特征分布不一致导致的性能退化。论文的关键解决方案是提出了一种名为Discriminative Domain Reassembly and Soft-Fusion (DRSF) 的训练框架，其核心在于通过两个关键模块来缓解分布偏移问题：首先，Discriminative Feature Decoupling and Reassembly (DFDR) 模块采用熵引导注意力机制重新校准通道级特征，抑制合成噪声的同时保持语义一致性；其次，Multi-pseudo-domain Soft Fusion (MDSF) 模块利用潜在空间特征插值的对抗训练，实现跨域特征的平滑过渡。这些设计使得DRSF能够在保证较低计算开销的前提下显著提升模型性能，并且具备与无监督领域适应范式无缝集成的潜力，适用于多样化的实际应用场景。

链接: https://arxiv.org/abs/2503.13617
作者: Hao Li,Yubin Xiao,Ke Liang,Mengzhu Wang,Long Lan,Kenli Li,Xinwang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:Single Domain Generalization (SDG) aims to train models with consistent performance across diverse scenarios using data from a single source. While using latent diffusion models (LDMs) show promise in augmenting limited source data, we demonstrate that directly using synthetic data can be detrimental due to significant feature distribution discrepancies between synthetic and real target domains, leading to performance degradation. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on object detection and semantic segmentation tasks demonstrate that DRSF achieves substantial performance gains with only marginal computational overhead. Notably, DRSF’s plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability in addressing diverse and real-world domain challenges.
zh

[CV-143] Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers

链接: https://arxiv.org/abs/2503.13588
作者: Shiran Yuan,Hao Zhao
机构: UC Berkeley (加州大学伯克利分校); Tsinghua University (清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Full codebase, training set, and eval benchmark at this https URL

点击查看摘要

[CV-144] Seeing the Future Perceiving the Future: A Unified Driving World Model for Future Generation and Perception

【速读】：该论文旨在解决未来场景生成与感知任务中现有模型在像素级预测或几何推理方面的局限性问题，提出了一种名为UniFuture的统一世界模型。其关键在于通过Dual-Latent Sharing方案实现图像与深度序列在共享潜在空间中的转换，并结合Multi-scale Latent Interaction机制，在多尺度上双向优化图像与深度特征，从而确保未来场景在表观（RGB图像）和几何（深度）上的连贯性预测。这一方法使得UniFuture仅需当前图像作为输入即可生成高一致性未来图像-深度对，在nuScenes数据集上的实验验证了其在生成与感知任务中的优越性能。

链接: https://arxiv.org/abs/2503.13587
作者: Dingkang Liang,Dingyuan Zhang,Xin Zhou,Sifan Tu,Tianrui Feng,Xiaofan Li,Yumeng Zhang,Mingyang Du,Xiao Tan,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Baidu Inc. (百度公司), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is at this https URL

点击查看摘要

Abstract:We present UniFuture, a simple yet effective driving world model that seamlessly integrates future scene generation and perception within a single framework. Unlike existing models focusing solely on pixel-level future prediction or geometric reasoning, our approach jointly models future appearance (i.e., RGB image) and geometry (i.e., depth), ensuring coherent predictions. Specifically, during the training, we first introduce a Dual-Latent Sharing scheme, which transfers image and depth sequence in a shared latent space, allowing both modalities to benefit from shared feature learning. Additionally, we propose a Multi-scale Latent Interaction mechanism, which facilitates bidirectional refinement between image and depth features at multiple spatial scales, effectively enhancing geometry consistency and perceptual alignment. During testing, our UniFuture can easily predict high-consistency future image-depth pairs by only using the current image as input. Extensive experiments on the nuScenes dataset demonstrate that UniFuture outperforms specialized models on future generation and perception tasks, highlighting the advantages of a unified, structurally-aware world model. The project page is at this https URL.
zh

[CV-145] A Comprehensive Survey on Visual Concept Mining in Text-to-image Diffusion Models

【速读】：该论文旨在解决文本到图像扩散模型在生成高质量图像时因文本信号固有限制而导致的概念控制力不足的问题。论文的关键解决方案是通过视觉概念挖掘（Visual Concept Mining, VCM）技术，利用参考图像提取补充文本输入的视觉概念表示，从而增强模型的可控性。为此，作者将现有研究分类为概念学习（Concept Learning）、概念擦除（Concept Erasing）、概念分解（Concept Decomposition）和概念组合（Concept Combination）四个关键领域，并分析其基础原理，同时指出挑战与未来研究方向。

链接: https://arxiv.org/abs/2503.13576
作者: Ziqiang Li,Jun Li,Lizhi Xiong,Zhangjie Fu,Zechao Li
机构: School of Computer Science, Nanjing University of Information Science and Technology (南京信息工程大学计算机科学学院); School of Computer Science, Nanjing University of Science and Technology (南京理工大学计算机科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Text-to-image diffusion models have made significant advancements in generating high-quality, diverse images from text prompts. However, the inherent limitations of textual signals often prevent these models from fully capturing specific concepts, thereby reducing their controllability. To address this issue, several approaches have incorporated personalization techniques, utilizing reference images to mine visual concept representations that complement textual inputs and enhance the controllability of text-to-image diffusion models. Despite these advances, a comprehensive, systematic exploration of visual concept mining remains limited. In this paper, we categorize existing research into four key areas: Concept Learning, Concept Erasing, Concept Decomposition, and Concept Combination. This classification provides valuable insights into the foundational principles of Visual Concept Mining (VCM) techniques. Additionally, we identify key challenges and propose future research directions to propel this important and interesting field forward.
zh

[CV-146] Online Signature Verification based on the Lagrange formulation with 2D and 3D robotic models

【速读】：该论文旨在解决基于数字化仪数据推断书写者手臂姿势、运动学及动力学信息的挑战。解决方案的关键在于提出了一组基于在线签名动力学的新特征，这些特征通过拉格朗日公式推导得出，获取二维和三维机器人手臂模型的广义坐标和力矩序列，并将运动学与动力学特征相结合，最终证明了这些特征在在线自动签名验证任务中的显著有效性，特别是在集成到深度学习模型后达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.13573
作者: Moises Diaz,Miguel A. Ferrer,Juan M. Gil,Rafael Rodriguez,Peirong Zhang,Lianwen Jin
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online Signature Verification commonly relies on function-based features, such as time-sampled horizontal and vertical coordinates, as well as the pressure exerted by the writer, obtained through a digitizer. Although inferring additional information about the writers arm pose, kinematics, and dynamics based on digitizer data can be useful, it constitutes a challenge. In this paper, we tackle this challenge by proposing a new set of features based on the dynamics of online signatures. These new features are inferred through a Lagrangian formulation, obtaining the sequences of generalized coordinates and torques for 2D and 3D robotic arm models. By combining kinematic and dynamic robotic features, our results demonstrate their significant effectiveness for online automatic signature verification and achieving state-of-the-art results when integrated into deep learning models.
zh

[CV-147] Context-aware Multimodal AI Reveals Hidden Pathways in Five Centuries of Art Evolution

【速读】：该论文试图解决的问题是如何利用多模态生成式 AI (Multimodal Generative AI) 表征艺术作品的潜在空间，特别是探索其在区分艺术时期、风格及艺术家方面的潜力。现有研究虽已广泛探讨生成式 AI 的创作能力，但对其在挖掘艺术作品潜在语义信息方面的能力关注较少。论文的关键解决方案是采用先进的生成式模型 Stable Diffusion，从大量西方绘画作品中提取形式（如颜色）与语境（如主题）两类潜在信息，并通过分析发现语境信息比形式元素更能有效区分艺术特征。此外，通过注入历史语境关键词，实验成功再现了艺术发展的演化路径，强调了社会与艺术相互作用的重要性，从而拓展了传统基于形式的艺术分析方法。

链接: https://arxiv.org/abs/2503.13531
作者: Jin Kim,Byunghwee Lee,Taekho You,Jinhyuk Yun
机构: Department of Intelligent Semiconductors, Soongsil University (崇实大学智能半导体系); Korea Advanced Institute of Science and Technology (韩国科学技术院); Luddy School of Informatics, Computing, and Engineering, Indiana University (印第安纳大学卢迪信息计算与工程学院); Institute for Social Data Science, Pohang University of Science and Technology (浦项科技大学社会数据科学研究所); Center for Digital Humanities & Computational Social Sciences, Korea Advanced Institute of Science and Technology (韩国科学技术院数字人文与计算社会科学中心); School of AI Convergence, Soongsil University (崇实大学人工智能融合学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 30 pages, 4 figures. Some example paintings are blurred to avoid potential copyright violations

点击查看摘要

Abstract:The rise of multimodal generative AI is transforming the intersection of technology and art, offering deeper insights into large-scale artwork. Although its creative capabilities have been widely explored, its potential to represent artwork in latent spaces remains underexamined. We use cutting-edge generative AI, specifically Stable Diffusion, to analyze 500 years of Western paintings by extracting two types of latent information with the model: formal aspects (e.g., colors) and contextual aspects (e.g., subject). Our findings reveal that contextual information differentiates between artistic periods, styles, and individual artists more successfully than formal elements. Additionally, using contextual keywords extracted from paintings, we show how artistic expression evolves alongside societal changes. Our generative experiment, infusing prospective contexts into historical artworks, successfully reproduces the evolutionary trajectory of artworks, highlighting the significance of mutual interaction between society and art. This study demonstrates how multimodal AI expands traditional formal analysis by integrating temporal, cultural, and historical contexts.
zh

[CV-148] CoCMT: Communication-Efficient Cross-Modal Transformer for Collaborative Perception

【速读】：该论文旨在解决多智能体协同感知系统中因传输大量非关键中间特征图而导致通信带宽需求过高的问题。为提升通信效率同时保持感知能力，论文提出了一种基于目标查询的协作框架CoCMT，其关键是引入Efficient Query Transformer (EQFormer)，通过选择性提取和传输关键特征实现高效通信，并通过协同深度监督增强各阶段间的正向强化，从而提升整体性能。实验结果表明，CoCMT在OPV2V和V2V4Real数据集上不仅优于现有先进方法，还能大幅减少通信需求，在V2V4Real数据集上仅需0.416 Mb带宽（比SOTA方法降低83倍），同时将AP70提升了1.1%。这一突破使得CoCMT能够在带宽受限的环境中实现实用的协同感知部署，而无需牺牲检测精度。

链接: https://arxiv.org/abs/2503.13504
作者: Rujia Wang,Xiangbo Gao,Hao Xiang,Runsheng Xu,Zhengzhong Tu
机构: Computer Science and Engineering, Texas A&M University (德克萨斯农工大学), USA; University of California, Los Angeles (加州大学洛杉矶分校), USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Multi-agent collaborative perception enhances each agent perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth, 83 times less than SOTA methods, while improving AP70 by 1.1 percent. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy.
zh

[CV-149] AUTV: Creating Underwater Video Datasets with Pixel-wise Annotations

【速读】：该论文旨在解决水下视频分析因动态海洋环境和相机运动导致的挑战性任务问题。现有无需训练的视频生成技术通常在帧间学习运动动态，但容易产生明显的运动中断和错位现象。为解决这些问题，论文提出了一种名为AUTV的框架，用于合成具有像素级标注的海洋视频数据。该框架的关键在于通过构建两个视频数据集（UTV和SUTV）来验证其有效性，其中UTV包含2,000个视频-文本对的真实世界数据，SUTV则包含10,000个带有海洋物体分割掩码的合成视频。UTV提供多样化的水下视频及其全面的标注信息，包括外观、纹理、相机内参、光照条件及动物行为；而SUTV可进一步提升水下下游任务性能，如视频修复与视频对象分割。

链接: https://arxiv.org/abs/2503.12828
作者: Quang Trung Truong,Wong Yuk Kwan,Duc Thanh Nguyen,Binh-Son Hua,Sai-Kit Yeung
机构: Hong Kong University of Science and Technology (香港科技大学); Deakin University (迪肯大学); Trinity College Dublin (都柏林三一学院)
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Underwater video analysis, hampered by the dynamic marine environment and camera motion, remains a challenging task in computer vision. Existing training-free video generation techniques, learning motion dynamics on the frame-by-frame basis, often produce poor results with noticeable motion interruptions and misaligments. To address these issues, we propose AUTV, a framework for synthesizing marine video data with pixel-wise annotations. We demonstrate the effectiveness of this framework by constructing two video datasets, namely UTV, a real-world dataset comprising 2,000 video-text pairs, and SUTV, a synthetic video dataset including 10,000 videos with segmentation masks for marine objects. UTV provides diverse underwater videos with comprehensive annotations including appearance, texture, camera intrinsics, lighting, and animal behavior. SUTV can be used to improve underwater downstream tasks, which are demonstrated in video inpainting and video object segmentation.
zh

[CV-150] Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models ICSE2025

链接: https://arxiv.org/abs/2409.13661
作者: Luciano Baresi,Davide Yi Xian Hu,Andrea Stocco,Paolo Tonella
机构: Politecnico di Milano (米兰理工大学); Technical University of Munich (慕尼黑工业大学); fortiss GmbH (fortiss有限责任公司); Software Institute - USI (USI软件学院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the 47th International Conference on Software Engineering (ICSE 2025). This research was partially supported by project EMELIOT, funded by MUR under the PRIN 2020 program (n. 2020W3A5FY), by the Bavarian Ministry of Economic Affairs, Regional Development and Energy, by the TUM Global Incentive Fund, and by the EU Project Sec4AI4Sec (n. 101120393)

点击查看摘要

[CV-151] Weakly Supervised Spatial Implicit Neural Representation Learning for 3D MRI-Ultrasound Deformable Image Registration in HDR Prostate Brachytherapy

链接: https://arxiv.org/abs/2503.14395
作者: Jing Wang,Ruirui Liu,Yu Lei,Michael J. Baine,Tian Liu,Yang Lei
机构: Department of Radiation Oncology, Icahn School of Medicine at Mount Sinai (西奈山伊坎医学院), New York, NY; Department of Radiation Oncology, University of Nebraska Medical Center (内布拉斯加大学医学中心), Omaha, NE
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-152] Advancing Medical Representation Learning Through High-Quality Data

链接: https://arxiv.org/abs/2503.14377
作者: Negin Baghbanzadeh,Adibvafa Fallahpour,Yasaman Parhizkar,Franklin Ogidi,Shuvendu Roy,Sajad Ashkezari,Vahid Reza Khazaie,Michael Colacci,Ali Etemad,Arash Afkanpour,Elham Dolatabadi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-153] Multi-Prototype Embedding Refinement for Semi-Supervised Medical Image Segmentation

链接: https://arxiv.org/abs/2503.14343
作者: Yali Bi,Enyu Che,Yinan Chen,Yuanpeng He,Jingwei Qu
机构: College of Computer and Information Science, Southwest University (西南大学), Chongqing, P.R. China; School of Computer Science, Peking University (北京大学), Beijing, P.R. China
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-154] RoMedFormer: A Rotary-Embedding Transformer Foundation Model for 3D Genito-Pelvic Structure Segmentation in MRI and CT

链接: https://arxiv.org/abs/2503.14304
作者: Yuheng Li,Mingzhe Hu,Richard L.J. Qiu,Maria Thor,Andre Williams,Deborah Marshall,Xiaofeng Yang
机构: Department of Radiation Oncology and Winship Cancer Institute, Emory University (埃默里大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-155] Image-Based Metrics in Ultrasound for Estimation of Global Speed-of-Sound

链接: https://arxiv.org/abs/2503.14094
作者: Roman Denkin,Orcun Goksel
机构: Centre for Interdisciplinary Mathematics (交叉学科数学中心), Uppsala University (乌普萨拉大学), Sweden(瑞典); Medtech Science and Innovation Centre (医疗技术科学与创新中心), Uppsala University (乌普萨拉大学), Sweden(瑞典); Department of Information Technology (信息技术系), Uppsala University (乌普萨拉大学), Sweden(瑞典)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

[CV-156] Shift Scale and Rotation Invariant Multiple Object Detection using Balanced Joint Transform Correlator

【速读】：该论文试图解决多目标检测问题，即当输入图像包含多个目标时，传统Polar Mellin Transform (PMT) 技术无法有效提取平移、缩放和旋转不变特征。为了解决这一问题，论文提出了Segmented PMT (SPMT)，通过将输入图像分割处理并分别应用PMT技术，从而实现对同一帧内多个物体的同时检测。关键在于引入分割策略以扩展原有方法的应用范围，同时保持检测的不变性和鲁棒性，最终验证了其在联合变换相关器中的有效性，显著提升了多目标检测能力及匹配与非匹配目标之间的区分性能。

链接: https://arxiv.org/abs/2503.14034
作者: Xi Shen,Julian Gamboa,Tabassom Hamidfar,Shamima Mitu,Selim M. Shahriar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Polar Mellin Transform (PMT) is a well-known technique that converts images into shift, scale and rotation invariant signatures for object detection using opto-electronic correlators. However, this technique cannot be properly applied when there are multiple targets in a single input. Here, we propose a Segmented PMT (SPMT) that extends this methodology for cases where multiple objects are present within the same frame. Simulations show that this SPMT can be integrated into an opto-electronic joint transform correlator to create a correlation system capable of detecting multiple objects simultaneously, presenting robust detection capabilities across various transformation conditions, with remarkable discrimination between matching and non-matching targets.
zh

[CV-157] Striving for Simplicity: Simple Yet Effective Prior-Aware Pseudo-Labeling for Semi-Supervised Ultrasound Image Segmentation MICCAI2024

【速读】：该论文旨在解决医学超声成像中自动化分割面临的挑战，特别是在标签数据稀缺的情况下。传统方法依赖于一致性正则化或伪标签技术，但这些方法往往过于复杂且容易受伪影影响，导致解剖学上不合理的分割结果。论文的关键创新在于提出了一种结合对抗学习形状先验的简单伪标签方法。通过设计一个编码器-双解码器网络，利用形状先验作为隐式的形状模型，该方法能够有效约束解剖学不合理但未偏离真实标注的预测结果。这一简洁而高效的方法在不同数据划分协议下的两个基准数据集上达到了最先进的性能，为半监督医学图像分割提供了强有力的基线方案。

链接: https://arxiv.org/abs/2503.13987
作者: Yaxiong Chen,Yujie Wang,Zixuan Zheng,Jingliang Hu,Yilei Shi,Shengwu Xiong,Xiao Xiang Zhu,Lichao Mou
机构: MedAI Technology (Wuxi) Co. Ltd.
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2024

点击查看摘要

Abstract:Medical ultrasound imaging is ubiquitous, but manual analysis struggles to keep pace. Automated segmentation can help but requires large labeled datasets, which are scarce. Semi-supervised learning leveraging both unlabeled and limited labeled data is a promising approach. State-of-the-art methods use consistency regularization or pseudo-labeling but grow increasingly complex. Without sufficient labels, these models often latch onto artifacts or allow anatomically implausible segmentations. In this paper, we present a simple yet effective pseudo-labeling method with an adversarially learned shape prior to regularize segmentations. Specifically, we devise an encoder-twin-decoder network where the shape prior acts as an implicit shape model, penalizing anatomically implausible but not ground-truth-deviating predictions. Without bells and whistles, our simple approach achieves state-of-the-art performance on two benchmarks under different partition protocols. We provide a strong baseline for future semi-supervised medical image segmentation. Code is available at this https URL.
zh

[CV-158] Fibonacci-Net: A Lightweight CNN model for Automatic Brain Tumor Classification

【速读】：本文旨在解决自动脑肿瘤分类在不平衡磁共振成像（MRI）数据集上的性能瓶颈问题。传统卷积神经网络（CNN）模型因类别不均衡问题而表现受限。为应对这一挑战，论文提出了一个轻量级的CNN模型“Fibonacci-Net”以及一种新颖的池化技术。关键创新点包括：(I) 模型各卷积层的滤波器数量依据斐波那契数列确定；(II) 在模型后两个块中使用深度可分离卷积（Depth-wise Separable Convolution, DWSC）层以显著降低计算复杂度；(III) 引入一种新型的平均-最大池化层（Average-2Max Pooling Layer），通过在第2至第4块及第3至第5块之间部署平行拼接（skip connection），增强特征表达能力，从而在一定程度上缓解类别不均衡问题。实验验证表明，所提方法在包含44个类别的最具挑战性的MRI数据集上实现了96.2%的准确率、97.17%的精确率、95.9%的召回率、96.5%的F1分数和99.9%的特异性。

链接: https://arxiv.org/abs/2503.13928
作者: Santanu Roy,Ashvath Suresh,Archit Gupta,Shubhi Tiwari,Palak Sahu,Prashant Adhikari,Yuvraj S. Shekhawat
机构: Christ (Deemed to be University)(克里斯特（视同为大学）大学); NIIT University, Rajasthan, India (NIIT大学，拉贾斯坦邦，印度)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research proposes a very lightweight model “Fibonacci-Net” along with a novel pooling technique, for automatic brain tumor classification from imbalanced Magnetic Resonance Imaging (MRI) datasets. Automatic brain tumor detection from MRI dataset has garnered significant attention in the research community, since the inception of Convolutional Neural Network (CNN) models. However, the performance of conventional CNN models is hindered due to class imbalance problems. The novelties of this work are as follows: (I) A lightweight CNN model is proposed in which the number of filters in different convolutional layers is chosen according to the numbers of Fibonacci series. (II) In the last two blocks of the proposed model, depth-wise separable convolution (DWSC) layers are employed to considerably reduce the computational complexity of the model. (III) Two parallel concatenations (or, skip connections) are deployed from 2nd to 4th, and 3rd to 5th convolutional block in the proposed Fibonacci-Net. This skip connection encompasses a novel Average-2Max pooling layer that produces two stacks of convoluted output, having a bit different statistics. Therefore, this parallel concatenation block works as an efficient feature augmenter inside the model, thus, automatically alleviating the class imbalance problem to a certain extent. For validity purpose, we have implemented the proposed framework on three MRI datasets which are highly class-imbalanced. (a) The first dataset has four classes, i.e., glioma tumor, meningioma tumor, pituitary tumor, and no-tumor. (b) Second and third MRI datasets have 15 and 44 classes respectively. Experimental results reveal that, after employing the proposed Fibonacci-Net we have achieved 96.2% accuracy, 97.17% precision, 95.9% recall, 96.5% F1 score, and 99.9% specificity on the most challenging ``44-classes MRI dataset’'.
zh

[CV-159] Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

【速读】：该论文试图解决商业化的数字乳腺断层合成（Digital Breast Tomosynthesis, DBT）AI模型在亚组分析中的性能评估不足的问题。解决方案的关键在于通过大规模回顾性研究（包含163,449例筛查乳房X光检查数据）对Lunit INSIGHT DBT模型进行细致的粒度评估，并将模型性能在人口统计学、成像特征及病理亚组中进行分层分析，以识别潜在的性能差异与不足。研究发现，尽管模型整体表现良好（AUC=0.91），但在非浸润性癌症、钙化病灶及致密乳腺组织等特定亚组中性能显著降低，这凸显了在临床应用前需对AI模型特性进行详尽评估的重要性。

链接: https://arxiv.org/abs/2503.13581
作者: Beatrice Brown-Mulry,Rohan Satya Isaac,Sang Hyup Lee,Ambika Seth,KyungJee Min,Theo Dapamede,Frank Li,Aawez Mansuri,MinJae Woo,Christian Allison Fauria-Robinson,Bhavna Paryani,Judy Wawira Gichoya,Hari Trivedi
机构: HITI Lab, Emory University (埃默里大学 HITI 实验室), Atlanta, GA, USA; Lunit (乐土), Seoul, South Korea; Clemson University (克莱姆森大学), Clemson, SC, USA; Emory University (埃默里大学), Atlanta, GA, USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures (plus 7 figures in supplement), 3 tables (plus 1 table in supplement)

点击查看摘要

Abstract:While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 163,449 screening mammography exams from the Emory Breast Imaging Dataset (EMBED). Model performance was evaluated in a binary context with various negative exam types (162,081 exams) compared against screen detected cancers (1,368 exams) as the positive class. The analysis was stratified across demographic, imaging, and pathologic subgroups to identify potential disparities. The model achieved an overall AUC of 0.91 (95% CI: 0.90-0.92) with a precision of 0.08 (95% CI: 0.08-0.08), and a recall of 0.73 (95% CI: 0.71-0.76). Performance was found to be robust across demographics, but cases with non-invasive cancers (AUC: 0.85, 95% CI: 0.83-0.87), calcifications (AUC: 0.80, 95% CI: 0.78-0.82), and dense breast tissue (AUC: 0.90, 95% CI: 0.88-0.91) were associated with significantly lower performance compared to other groups. These results highlight the need for detailed evaluation of model characteristics and vigilance in considering adoption of new tools for clinical deployment.
zh

[CV-160] MSWAL: 3D Multi-class Segmentation of Whole Abdominal Lesions Dataset

链接: https://arxiv.org/abs/2503.13560
作者: Zhaodong Wu,Qiaochu Zhao,Ming Hu,Yulong Li,Haochen Xue,Kang Dang,Zhengyong Jiang,Angelos Stefanidis,Qiufeng Wang,Imran Razzak,Zongyuan Ge,Junjun He,Yu Qiao,Zhong Zheng,Feilong Tang,Jionglong Su
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-161] Feasibility study for reconstruction of knee MRI from one corresponding X-ray via CNN

【速读】：该论文试图解决利用单一X射线图像生成对应的三维磁共振成像（MRI）的问题。解决方案的关键在于使用一种基于深度学习的方法，通过训练一个卷积自动编码器（Convolutional Auto-Encoder, CAE）模型的隐藏变量作为生成器模型的输入，以实现从X射线到三维MRI的转换。

链接: https://arxiv.org/abs/2503.13555
作者: Zhe Wang,Aladine Chetouani,Rachid Jennane
机构: IDP Laboratory, UMR CNRS 7013, University of Orleans (IDP实验室, UMR CNRS 7013, 奥尔良大学); dept. name of organization (of Aff.) (系名称); IDP Laboratory, UMR CNRS 7013, University of Orleans (IDP实验室, UMR CNRS 7013, 奥尔良大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generally, X-ray, as an inexpensive and popular medical imaging technique, is widely chosen by medical practitioners. With the development of medical technology, Magnetic Resonance Imaging (MRI), an advanced medical imaging technique, has already become a supplementary diagnostic option for the diagnosis of KOA. We propose in this paper a deep-learning-based approach for generating MRI from one corresponding X-ray. Our method uses the hidden variables of a Convolutional Auto-Encoder (CAE) model, trained for reconstructing X-ray image, as inputs of a generator model to provide 3D MRI.
zh

[CV-162] Periodontal Bone Loss Analysis via Keypoint Detection With Heuristic Post-Processing

链接: https://arxiv.org/abs/2503.13477
作者: Ryan Banks,Vishal Thengane,María Eugenia Guerrero,Nelly Maria García-Madueño,Yunpeng Li,Hongying Tang,Akhilanand Chaurasia
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 7 tables, 5 figures, 3 equations, journal paper submitted to Computers in Biology and Medicine

点击查看摘要

[CV-163] Robust Detection of Extremely Thin Lines Using 0.2mm Piano Wire

链接: https://arxiv.org/abs/2503.13473
作者: Jisoo Hong,Youngjin Jung,Jihwan Bae,Seungho Song,Sung-Woo Kang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

[CV-164] Multimodal Lead-Specific Modeling of ECG for Low-Cost Pulmonary Hypertension Assessment

链接: https://arxiv.org/abs/2503.13470
作者: Mohammod N. I. Suvon,Shuo Zhou,Prasun C. Tripathi,Wenrui Fan,Samer Alabed,Bishesh Khanal,Venet Osmani,Andrew J. Swift,Chen(Cherise)Chen,Haiping Lu
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-165] Conditional Electrocardiogram Generation Using Hierarchical Variational Autoencoders

链接: https://arxiv.org/abs/2503.13469
作者: Ivan Sviridov,Konstantin Egorov
机构: Sber AI Lab (斯伯 AI 实验室), Moscow, Russia
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 6 figures, 7 tables

点击查看摘要

人工智能

[AI-0] Measuring AI Ability to Complete Long Tasks

链接: https://arxiv.org/abs/2503.14499
作者: Thomas Kwa,Ben West,Joel Becker,Amy Deng,Katharyn Garcia,Max Hasin,Sami Jawhar,Megan Kinniment,Nate Rush,Sydney Von Arx,Ryan Bloom,Thomas Broadley,Haoxing Du,Brian Goodrich,Nikola Jurkovic,Luke Harold Miles,Seraphina Nix,Tao Lin,Neev Parikh,David Rein,Lucas Jun Koba Sato,Hjalmar Wijk,Daniel M. Ziegler,Elizabeth Barnes,Lawrence Chan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models’ time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results – including their degree of external validity – and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

[AI-1] Engineering Scientific Assistants using Interactive Structured Induction of Programs

链接: https://arxiv.org/abs/2503.14488
作者: Shraddha Surana,Ashwin Srinivasan
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We are interested in the construction of software that can act as scientific assistants to domain specialists. It is expected that such assistants will be needed to accelerate the identification of ways to address complex problems requiring urgent solutions. In this paper, our focus is not on a specific scientific problem, but on the software-engineering of such ‘science accelerators’. Recent developments in ‘No Code’ techniques would seem to suggest that scientist can simply hypothesise solutions simply by conversing with a large language model (LLM). However, for complex scientific problems, this seems unlikely given the current state of LLM technology. What does appear feasible is that a software engineer can use LLMs to rapidly construct programs for use by a domain-specialist, including the specialist’s requirements expressed in natural language. We propose the design of an interactive form of ‘structured’ inductive programming in which a software-engineer and an LLM collaboratively construct an ‘assistant’ for a scientific data analysis. The paper describes a simple implementation called iStrucInd that adapts a ‘2-way Intelligibility’ protocol to implement the interaction between the software engineer and the LLM. We test the tool on two different non-trivial scientific data analysis tasks. Specifically, we compare the system constructed by iStrucInd against systems constructed manually and by Low Code/No Code methods along dimensions of: (a) program performance; (b) program quality; and © programming effort. The results show iStrucInd allows a software engineer to develop better programs faster suggesting interactive structured induction can play a useful role in the rapid construction of scientific assistants.

[AI-2] Attribution Score Alignment in Explainable Data Management

链接: https://arxiv.org/abs/2503.14469
作者: Felipe Azua,Leopoldo Bertossi
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Different attribution-scores have been proposed to quantify the relevance of database tuples for a query answer from a database. Among them, we find Causal Responsibility, the Shapley Value, the Banzhaf Power-Index, and the Causal Effect. They have been analyzed in isolation, mainly in terms of computational properties. In this work, we start an investigation into the alignment of these scores on the basis of the queries at hand; that is, on whether they induce compatible rankings of tuples. We are able to identify vast classes of queries for which some pairs of scores are always aligned, and others for which they are not. It turns out that the presence of exogenous tuples makes a crucial difference in this regard.

[AI-3] VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms

链接: https://arxiv.org/abs/2503.14427
作者: Seungwon Lim,Sungwoong Kim,Jihwan Yu,Sungjae Lee,Jiwan Chung,Youngjae Yu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Escape rooms present a unique cognitive challenge that demands exploration-driven planning: players should actively search their environment, continuously update their knowledge based on new discoveries, and connect disparate clues to determine which elements are relevant to their objectives. Motivated by this, we introduce VisEscape, a benchmark of 20 virtual escape rooms specifically designed to evaluate AI models under these challenging conditions, where success depends not only on solving isolated puzzles but also on iteratively constructing and refining spatial-temporal knowledge of a dynamically changing environment. On VisEscape, we observed that even state-of-the-art multimodal models generally fail to escape the rooms, showing considerable variation in their levels of progress and trajectories. To address this issue, we propose VisEscaper, which effectively integrates Memory, Feedback, and ReAct modules, demonstrating significant improvements by performing 3.7 times more effectively and 5.0 times more efficiently on average.

[AI-4] Iffy-Or-Not: Extending the Web to Support the Critical Evaluation of Fallacious Texts

链接: https://arxiv.org/abs/2503.14412
作者: Gionnieve Lim,Juho Kim,Simon T. Perrault
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social platforms have expanded opportunities for deliberation with the comments being used to inform one’s opinion. However, using such information to form opinions is challenged by unsubstantiated or false content. To enhance the quality of opinion formation and potentially confer resistance to misinformation, we developed Iffy-Or-Not (ION), a browser extension that seeks to invoke critical thinking when reading texts. With three features guided by argumentation theory, ION highlights fallacious content, suggests diverse queries to probe them with, and offers deeper questions to consider and chat with others about. From a user study (N=18), we found that ION encourages users to be more attentive to the content, suggests queries that align with or are preferable to their own, and poses thought-provoking questions that expands their perspectives. However, some participants expressed aversion to ION due to misalignments with their information goals and thinking predispositions. Potential backfiring effects with ION are discussed.

[AI-5] d Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels

链接: https://arxiv.org/abs/2503.14376
作者: Maximilian Beck,Korbinian Pöppel,Phillip Lippe,Sepp Hochreiter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Linear RNNs with gating recently demonstrated competitive performance compared to Transformers in language modeling. Although their linear compute scaling in sequence length offers theoretical runtime advantages over Transformers, realizing these benefits in practice requires optimized custom kernels, as Transformers rely on the highly efficient Flash Attention kernels. Leveraging the chunkwise-parallel formulation of linear RNNs, Flash Linear Attention (FLA) shows that linear RNN kernels are faster than Flash Attention, by parallelizing over chunks of the input sequence. However, since the chunk size of FLA is limited, many intermediate states must be materialized in GPU memory. This leads to low arithmetic intensity and causes high memory consumption and IO cost, especially for long-context pre-training. In this work, we present Tiled Flash Linear Attention (TFLA), a novel kernel algorithm for linear RNNs, that enables arbitrary large chunk sizes by introducing an additional level of sequence parallelization within each chunk. First, we apply TFLA to the xLSTM with matrix memory, the mLSTM. Second, we propose an mLSTM variant with sigmoid input gate and reduced computation for even faster kernel runtimes at equal language modeling performance. In our speed benchmarks, we show that our new mLSTM kernels based on TFLA outperform highly optimized Flash Attention, Linear Attention and Mamba kernels, setting a new state of the art for efficient long-context sequence modeling primitives.

[AI-6] Revealing higher-order neural representations with generative artificial intelligence

链接: https://arxiv.org/abs/2503.14333
作者: Hojjat Azimi Asrari,Megan A. K. Peters
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Studies often aim to reveal how neural representations encode aspects of an observer’s environment, such as its contents or structure. These are first-order" representations (FORs), because they're about" the external world. A less-common target is higher-order" representations (HORs), which are about" FORs – their contents, stability, or uncertainty. HORs of uncertainty appear critically involved in adaptive behaviors including learning under uncertainty, influencing learning rates and internal model updating based on environmental feedback. However, HORs about uncertainty are unlikely to be direct read-outs" of FOR characteristics, instead reflecting estimation processes which may be lossy, bias-prone, or distortive and which may also incorporate estimates of distributions of uncertainty the observer is likely to experience. While some research has targeted neural representations of instantaneously" estimated uncertainty, how the brain represents \textitdistributions of expected uncertainty remains largely unexplored. Here, we propose a novel reinforcement learning (RL) based generative artificial intelligence (genAI) approach to explore neural representations of uncertainty distributions. We use existing functional magnetic resonance imaging data, where humans learned to `de-noise’ their brain states to achieve target neural patterns, to train denoising diffusion genAI models with RL algorithms to learn noise distributions similar to how humans might learn to do the same. We then explore these models’ learned noise-distribution HORs compared to control models trained with traditional backpropagation. Results reveal model-dependent differences in noise distribution representations – with the RL-based model offering much higher explanatory power for human behavior – offering an exciting path towards using genAI to explore neural noise-distribution HORs.

[AI-7] COPA: Comparing the Incomparable to Explore the Pareto Front

链接: https://arxiv.org/abs/2503.14321
作者: Adrián Javaloy,Antonio Vergari,Isabel Valera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 14 figures. Under submission

点击查看摘要

Abstract:In machine learning (ML), it is common to account for multiple objectives when, e.g., selecting a model to deploy. However, it is often unclear how one should compare, aggregate and, ultimately, trade-off these objectives, as they might be measured in different units or scales. For example, when deploying large language models (LLMs), we might not only care about their performance, but also their CO2 consumption. In this work, we investigate how objectives can be sensibly compared and aggregated to navigate their Pareto front. To do so, we propose to make incomparable objectives comparable via their CDFs, approximated by their relative rankings. This allows us to aggregate them while matching user-specific preferences, allowing practitioners to meaningfully navigate and search for models in the Pareto front. We demonstrate the potential impact of our methodology in diverse areas such as LLM selection, domain generalization, and AutoML benchmarking, where classical ways to aggregate and normalize objectives fail.

[AI-8] CTSAC: Curriculum-Based Transformer Soft Actor-Critic for Goal-Oriented Robot Exploration ICRA

链接: https://arxiv.org/abs/2503.14254
作者: Chunyu Yang,Shengben Bi,Yihui Xu,Xin Zhang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7pages,7 figures,Thesis received by 2025 ICRA

点击查看摘要

Abstract:With the increasing demand for efficient and flexible robotic exploration solutions, Reinforcement Learning (RL) is becoming a promising approach in the field of autonomous robotic exploration. However, current RL-based exploration algorithms often face limited environmental reasoning capabilities, slow convergence rates, and substantial challenges in Sim-To-Real (S2R) transfer. To address these issues, we propose a Curriculum Learning-based Transformer Reinforcement Learning Algorithm (CTSAC) aimed at improving both exploration efficiency and transfer performance. To enhance the robot’s reasoning ability, a Transformer is integrated into the perception network of the Soft Actor-Critic (SAC) framework, leveraging historical information to improve the farsightedness of the strategy. A periodic review-based curriculum learning is proposed, which enhances training efficiency while mitigating catastrophic forgetting during curriculum transitions. Training is conducted on the ROS-Gazebo continuous robotic simulation platform, with LiDAR clustering optimization to further reduce the S2R gap. Experimental results demonstrate the CTSAC algorithm outperforms the state-of-the-art non-learning and learning-based algorithms in terms of success rate and success rate-weighted exploration time. Moreover, real-world experiments validate the strong S2R transfer capabilities of CTSAC.

[AI-9] A Parallel Hybrid Action Space Reinforcement Learning Model for Real-world Adaptive Traffic Signal Control

链接: https://arxiv.org/abs/2503.14250
作者: Yuxuan Wang,Meng Long,Qiang Wu,Wei Liu,Jiatian Pi,Xinmin Yang
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures, Reinforcement Learning

点击查看摘要

Abstract:Adaptive traffic signal control (ATSC) can effectively reduce vehicle travel times by dynamically adjusting signal timings but poses a critical challenge in real-world scenarios due to the complexity of real-time decision-making in dynamic and uncertain traffic conditions. The burgeoning field of intelligent transportation systems, bolstered by artificial intelligence techniques and extensive data availability, offers new prospects for the implementation of ATSC. In this study, we introduce a parallel hybrid action space reinforcement learning model (PH-DDPG) that optimizes traffic signal phase and duration of traffic signals simultaneously, eliminating the need for sequential decision-making seen in traditional two-stage models. Our model features a task-specific parallel hybrid action space tailored for adaptive traffic control, which directly outputs discrete phase selections and their associated continuous duration parameters concurrently, thereby inherently addressing dynamic traffic adaptation through unified parametric optimization. %Our model features a unique parallel hybrid action space that allows for the simultaneous output of each action and its optimal parameters, streamlining the decision-making process. Furthermore, to ascertain the robustness and effectiveness of this approach, we executed ablation studies focusing on the utilization of a random action parameter mask within the critic network, which decouples the parameter space for individual actions, facilitating the use of preferable parameters for each action. The results from these studies confirm the efficacy of this method, distinctly enhancing real-world applicability

[AI-10] GeoFlow-SLAM: A Robust Tightly-Coupled RGBD-Inertial Fusion SLAM for Dynamic Legged Robotics

链接: https://arxiv.org/abs/2503.14247
作者: Tingyang Xiao,Xiaolin Zhou,Liu Liu,Wei Sui,Wei Feng,Jiaxiong Qiu,Xinjie Wang,Zhizhong Su
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-inertial SLAM for legged robots operating in highly dynamic this http URL integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges:feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less this http URL, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at this https URL

[AI-11] rading-off Accuracy and Communication Cost in Federated Learning

链接: https://arxiv.org/abs/2503.14246
作者: Mattia Jacopo Villani,Emanuele Natale,Frederik Mallmann-Trenn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Leveraging the training-by-pruning paradigm introduced by Zhou et al. and Isik et al. introduced a federated learning protocol that achieves a 34-fold reduction in communication cost. We achieve a compression improvements of orders of orders of magnitude over the state-of-the-art. The central idea of our framework is to encode the network weights \vec w by a the vector of trainable parameters \vec p , such that \vec w = Q\cdot \vec p where Q is a carefully-generate sparse random matrix (that remains fixed throughout training). In such framework, the previous work of Zhou et al. [NeurIPS’19] is retrieved when Q is diagonal and \vec p has the same dimension of \vec w . We instead show that \vec p can effectively be chosen much smaller than \vec w , while retaining the same accuracy at the price of a decrease of the sparsity of Q . Since server and clients only need to share \vec p , such a trade-off leads to a substantial improvement in communication cost. Moreover, we provide theoretical insight into our framework and establish a novel link between training-by-sampling and random convex geometry.

[AI-12] KG-IRAG : A Knowledge Graph-Based Iterative Retrieval-Augmented Generation Framework for Temporal Reasoning

链接: https://arxiv.org/abs/2503.14234
作者: Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 14 pages, 4 figures

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) has proven highly effective in enhancing the performance of Large Language Models (LLMs) on tasks that require external knowledge. By leveraging Knowledge Graphs (KGs), GraphRAG improves information retrieval for complex reasoning tasks, providing more precise and comprehensive retrieval and generating more accurate responses to QAs. However, most RAG methods fall short in addressing multi-step reasoning, particularly when both information extraction and inference are necessary. To address this limitation, this paper presents Knowledge Graph-Based Iterative Retrieval-Augmented Generation (KG-IRAG), a novel framework that integrates KGs with iterative reasoning to improve LLMs’ ability to handle queries involving temporal and logical dependencies. Through iterative retrieval steps, KG-IRAG incrementally gathers relevant data from external KGs, enabling step-by-step reasoning. The proposed approach is particularly suited for scenarios where reasoning is required alongside dynamic temporal data extraction, such as determining optimal travel times based on weather conditions or traffic patterns. Experimental results show that KG-IRAG improves accuracy in complex reasoning tasks by effectively integrating external knowledge with iterative, logic-based retrieval. Additionally, three new datasets: weatherQA-Irish, weatherQA-Sydney, and trafficQA-TFNSW, are formed to evaluate KG-IRAG’s performance, demonstrating its potential beyond traditional RAG applications.

[AI-13] Stochastic Trajectory Prediction under Unstructured Constraints ICRA2025

链接: https://arxiv.org/abs/2503.14203
作者: Hao Ma,Zhiqiang Pu,Shijie Wang,Boyin Liu,Huimu Wang,Yanyan Liang,Jianqiang Yi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: has been accepted by ICRA 2025

点击查看摘要

Abstract:Trajectory prediction facilitates effective planning and decision-making, while constrained trajectory prediction integrates regulation into prediction. Recent advances in constrained trajectory prediction focus on structured constraints by constructing optimization objectives. However, handling unstructured constraints is challenging due to the lack of differentiable formal definitions. To address this, we propose a novel method for constrained trajectory prediction using a conditional generative paradigm, named Controllable Trajectory Diffusion (CTD). The key idea is that any trajectory corresponds to a degree of conformity to a constraint. By quantifying this degree and treating it as a condition, a model can implicitly learn to predict trajectories under unstructured constraints. CTD employs a pre-trained scoring model to predict the degree of conformity (i.e., a score), and uses this score as a condition for a conditional diffusion model to generate trajectories. Experimental results demonstrate that CTD achieves high accuracy on the ETH/UCY and SDD benchmarks. Qualitative analysis confirms that CTD ensures adherence to unstructured constraints and can predict trajectories that satisfy combinatorial constraints.

[AI-14] Driving behavior recognition via self-discovery learning

链接: https://arxiv.org/abs/2503.14194
作者: Yilin Wang
类目: Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Autonomous driving systems require a deep understanding of human driving behaviors to achieve higher intelligence and this http URL advancements in deep learning, challenges such as long-tail distribution due to scarce samples and confusion from similar behaviors hinder effective driving behavior this http URL methods often fail to address sample confusion adequately, as datasets frequently contain ambiguous samples that obscure unique semantic information.

[AI-15] Inferring Event Descriptions from Time Series with Language Models

链接: https://arxiv.org/abs/2503.14190
作者: Mingtian Tan,Mike A. Merrill,Zack Gottesman,Tim Althoff,David Evans,Tom Hartvigsen
类目: Artificial Intelligence (cs.AI)
*备注: 17 pages, 9 Figures

点击查看摘要

Abstract:Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. When analyzing time series, we often seek to understand the underlying events occurring in the measured environment. For example, one might ask: What caused a sharp drop in the stock price? Events are often described with natural language, so we conduct the first study of whether Large Language Models (LLMs) can infer natural language events from time series. We curate a new benchmark featuring win probabilities collected from 4,200 basketball and American football games, featuring 1.7M timesteps with real value data and corresponding natural language events. Building on the recent wave of using LLMs on time series, we evaluate 16 LLMs and find that they demonstrate promising abilities to infer events from time series data. The open-weights DeepSeek-R1 32B model outperforms proprietary models like GPT-4o. Despite this impressive initial performance, we also find clear avenues to improve recent models, as we identify failures when altering the provided context, event sequence lengths, and evaluation strategy. (All resources needed to reproduce our work are available: this https URL)

[AI-16] Variable Time-Step MPC for Agile Multi-Rotor UAV Interception of Dynamic Targets

链接: https://arxiv.org/abs/2503.14184
作者: Atharva Ghotavadekar,František Nekovář,Martin Saska,Jan Faigl
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Agile trajectory planning can improve the efficiency of multi-rotor Uncrewed Aerial Vehicles (UAVs) in scenarios with combined task-oriented and kinematic trajectory planning, such as monitoring spatio-temporal phenomena or intercepting dynamic targets. Agile planning using existing non-linear model predictive control methods is limited by the number of planning steps as it becomes increasingly computationally demanding. That reduces the prediction horizon length, leading to a decrease in solution quality. Besides, the fixed time-step length limits the utilization of the available UAV dynamics in the target neighborhood. In this paper, we propose to address these limitations by introducing variable time steps and coupling them with the prediction horizon length. A simplified point-mass motion primitive is used to leverage the differential flatness of quadrotor dynamics and the generation of feasible trajectories in the flat output space. Based on the presented evaluation results and experimentally validated deployment, the proposed method increases the solution quality by enabling planning for long flight segments but allowing tightly sampled maneuvering.

[AI-17] Can LLM s Enable Verification in Mainstream Programming?

链接: https://arxiv.org/abs/2503.14183
作者: Aleksandr Shefer,Igor Engel,Stanislav Alekseev,Daniil Berezun,Ekaterina Verbitskaia,Anton Podkopaev
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Although formal methods are capable of producing reliable software, they have seen minimal adoption in everyday programming. Automatic code generation using large language models is becoming increasingly widespread, but it rarely considers producing strong correctness guarantees. In this study, we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus). To do so, we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval. We also assess what types of information are sufficient to achieve good-quality results.

[AI-18] EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models

链接: https://arxiv.org/abs/2503.14162
作者: Zongyun Zhang,Jiacheng Ruan,Xian Gao,Ting Liu,Yuzhuo Fu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing. Although existing zero-shot defect segmentation and detection methods have shown effectiveness, they cannot provide detailed descriptions of the defects. Furthermore, the application of large multi-modal models in IAD remains in its infancy, facing challenges in balancing question-answering (QA) performance and mask-based grounding capabilities, often owing to overfitting during the fine-tuning process. To address these challenges, we propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction. This decoupling is achieved through independent optimization objectives and tailored learning strategies. Additionally, we contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA), encompassing a wide range of defect types and industrial scenarios. Unlike conventional datasets that rely on GPT-generated data, DDQA ensures authenticity and reliability and offers a robust foundation for model training. Experimental results demonstrate that our proposed method, Explainable Industrial Anomaly Detection Assistant (EIAD), achieves outstanding performance in defect detection and localization tasks. It not only significantly enhances accuracy but also improves interpretability. These advancements highlight the potential of EIAD for practical applications in industrial settings.

[AI-19] Inference-Time Intervention in Large Language Models for Reliable Requirement Verification

链接: https://arxiv.org/abs/2503.14130
作者: Paul Darm,James Xie,Annalisa Riccardi
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Steering the behavior of Large Language Models (LLMs) remains a challenge, particularly in engineering applications where precision and reliability are critical. While fine-tuning and prompting methods can modify model behavior, they lack the dynamic and exact control necessary for engineering applications. Inference-time intervention techniques provide a promising alternative, allowing targeted adjustments to LLM outputs. In this work, we demonstrate how interventions enable fine-grained control for automating the usually time-intensive requirement verification process in Model-Based Systems Engineering (MBSE). Using two early-stage Capella SysML models of space missions with associated requirements, we apply the intervened LLMs to reason over a graph representation of the model to determine whether a requirement is fulfilled. Our method achieves robust and reliable outputs, significantly improving over both a baseline model and a fine-tuning approach. By identifying and modifying as few as one to three specialised attention heads, we can significantly change the model’s behavior. When combined with self-consistency, this allows us to achieve perfect precision on our holdout test set.

[AI-20] Sensory-driven microinterventions for improved health and wellbeing

链接: https://arxiv.org/abs/2503.14102
作者: Youssef Abdalla,Elia Gatti,Mine Orlu,Marianna Obrist
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The five senses are gateways to our wellbeing and their decline is considered a significant public health challenge which is linked to multiple conditions that contribute significantly to morbidity and mortality. Modern technology, with its ubiquitous nature and fast data processing has the ability to leverage the power of the senses to transform our approach to day to day healthcare, with positive effects on our quality of life. Here, we introduce the idea of sensory-driven microinterventions for preventative, personalised healthcare. Microinterventions are targeted, timely, minimally invasive strategies that seamlessly integrate into our daily life. This idea harnesses human’s sensory capabilities, leverages technological advances in sensory stimulation and real-time processing ability for sensing the senses. The collection of sensory data from our continuous interaction with technology - for example the tone of voice, gait movement, smart home behaviour - opens up a shift towards personalised technology-enabled, sensory-focused healthcare interventions, coupled with the potential of early detection and timely treatment of sensory deficits that can signal critical health insights, especially for neurodegenerative diseases such as Parkinson’s disease.

[AI-21] heoretical Foundation of Flow-Based Time Series Generation: Provable Approximation Generalization and Efficiency

链接: https://arxiv.org/abs/2503.14076
作者: Jiangxuan Long,Zhao Song,Chiwun Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 33 pages

点击查看摘要

Abstract:Recent studies suggest utilizing generative models instead of traditional auto-regressive algorithms for time series forecasting (TSF) tasks. These non-auto-regressive approaches involving different generative methods, including GAN, Diffusion, and Flow Matching for time series, have empirically demonstrated high-quality generation capability and accuracy. However, we still lack an appropriate understanding of how it processes approximation and generalization. This paper presents the first theoretical framework from the perspective of flow-based generative models to relieve the knowledge of limitations. In particular, we provide our insights with strict guarantees from three perspectives: \textbfApproximation , \textbfGeneralization and \textbfEfficiency . In detail, our analysis achieves the contributions as follows: \bullet By assuming a general data model, the fitting of the flow-based generative models is confirmed to converge to arbitrary error under the universal approximation of Diffusion Transformer (DiT). \bullet Introducing a polynomial-based regularization for flow matching, the generalization error thus be bounded since the generalization of polynomial approximation. \bullet The sampling for generation is considered as an optimization process, we demonstrate its fast convergence with updating standard first-order gradient descent of some objective. Comments: 33 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.14076 [cs.LG] (or arXiv:2503.14076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.14076 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chiwun Yang [view email] [v1] Tue, 18 Mar 2025 09:53:48 UTC (80 KB)

[AI-22] ON-Traffic: An Operator Learning Framework for Online Traffic Flow Estimation and Uncertainty Quantification from Lagrangian Sensors

链接: https://arxiv.org/abs/2503.14053
作者: Jake Rap,Amritam Das
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Accurate traffic flow estimation and prediction are critical for the efficient management of transportation systems, particularly under increasing urbanization. Traditional methods relying on static sensors often suffer from limited spatial coverage, while probe vehicles provide richer, albeit sparse and irregular data. This work introduces ON-Traffic, a novel deep operator Network and a receding horizon learning-based framework tailored for online estimation of spatio-temporal traffic state along with quantified uncertainty by using measurements from moving probe vehicles and downstream boundary inputs. Our framework is evaluated in both numerical and simulation datasets, showcasing its ability to handle irregular, sparse input data, adapt to time-shifted scenarios, and provide well-calibrated uncertainty estimates. The results demonstrate that the model captures complex traffic phenomena, including shockwaves and congestion propagation, while maintaining robustness to noise and sensor dropout. These advancements present a significant step toward online, adaptive traffic management systems.

[AI-23] COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning IROS2025

链接: https://arxiv.org/abs/2503.13934
作者: Yuki Tomita,Kohei Matsumoto,Yuki Hyodo,Ryo Kurazume
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: This work has been submitted to IROS 2025 for possible publication

点击查看摘要

Abstract:Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these, methods that assume a continuous action space typically rely on a Gaussian distribution assumption, which limits the flexibility of generated actions. Meanwhile, the application of diffusion models to reinforcement learning has advanced, allowing for more flexible action distributions compared with Gaussian distribution-based approaches. In this study, we applied a diffusion-based reinforcement learning approach to social navigation and validated its effectiveness. Furthermore, by leveraging the characteristics of diffusion models, we propose an extension that enables post-training action smoothing and adaptation to static obstacle scenarios not considered during the training steps.

[AI-24] Learning Accurate Models on Incomplete Data with Minimal Imputation

链接: https://arxiv.org/abs/2503.13921
作者: Cheng Zhen,Nischal Aryal,Arash Termehchy,Prayoga,Garrett Biwer,Sankalp Patil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Missing data often exists in real-world datasets, requiring significant time and effort for imputation to learn accurate machine learning (ML) models. In this paper, we demonstrate that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce the concept of minimal data imputation, which ensures accurate ML models trained over the imputed dataset. Implementing minimal imputation guarantees both minimal imputation effort and optimal ML models. We propose algorithms to find exact and approximate minimal imputation for various ML models. Our extensive experiments indicate that our proposed algorithms significantly reduce the time and effort required for data imputation.

[AI-25] Learning Bimanual Manipulation via Action Chunking and Inter-Arm Coordination with Transformers

链接: https://arxiv.org/abs/2503.13916
作者: Tomohiro Motoda,Ryo Hanai,Ryoichi Nakajo,Masaki Murooka,Floris Erich,Yukiyasu Domae
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 6 pages, 5 figures, 1 table

点击查看摘要

Abstract:Robots that can operate autonomously in a human living environment are necessary to have the ability to handle various tasks flexibly. One crucial element is coordinated bimanual movements that enable functions that are difficult to perform with one hand alone. In recent years, learning-based models that focus on the possibilities of bimanual movements have been proposed. However, the high degree of freedom of the robot makes it challenging to reason about control, and the left and right robot arms need to adjust their actions depending on the situation, making it difficult to realize more dexterous tasks. To address the issue, we focus on coordination and efficiency between both arms, particularly for synchronized actions. Therefore, we propose a novel imitation learning architecture that predicts cooperative actions. We differentiate the architecture for both arms and add an intermediate encoder layer, Inter-Arm Coordinated transformer Encoder (IACE), that facilitates synchronization and temporal alignment to ensure smooth and coordinated actions. To verify the effectiveness of our architectures, we perform distinctive bimanual tasks. The experimental results showed that our model demonstrated a high success rate for comparison and suggested a suitable architecture for the policy learning of bimanual manipulation.

[AI-26] KANITE: Kolmogorov-Arnold Networks for ITE estimation

链接: https://arxiv.org/abs/2503.13912
作者: Eshan Mehendale,Abhinav Thorat,Ravi Kolla,Niranjan Pedanekar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:We introduce KANITE, a framework leveraging Kolmogorov-Arnold Networks (KANs) for Individual Treatment Effect (ITE) estimation under multiple treatments setting in causal inference. By utilizing KAN’s unique abilities to learn univariate activation functions as opposed to learning linear weights by Multi-Layer Perceptrons (MLPs), we improve the estimates of ITEs. The KANITE framework comprises two key architectures: this http URL Probability Metric (IPM) architecture: This employs an IPM loss in a specialized manner to effectively align towards ITE estimation across multiple treatments. 2. Entropy Balancing (EB) architecture: This uses weights for samples that are learned by optimizing entropy subject to balancing the covariates across treatment groups. Extensive evaluations on benchmark datasets demonstrate that KANITE outperforms state-of-the-art algorithms in both \epsilon_\textPEHE and \epsilon_\textATE metrics. Our experiments highlight the advantages of KANITE in achieving improved causal estimates, emphasizing the potential of KANs to advance causal inference methodologies across diverse application areas.

[AI-27] MoK-RAG : Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

链接: https://arxiv.org/abs/2503.13882
作者: Zhengsheng Guo,Linwei Zheng,Xinyang Chen,Xuefeng Bai,Kehai Chen,Min Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While human cognition inherently retrieves information from diverse and specialized knowledge sources during decision-making processes, current Retrieval-Augmented Generation (RAG) systems typically operate through single-source knowledge retrieval, leading to a cognitive-algorithmic discrepancy. To bridge this gap, we introduce MoK-RAG, a novel multi-source RAG framework that implements a mixture of knowledge paths enhanced retrieval mechanism through functional partitioning of a large language model (LLM) corpus into distinct sections, enabling retrieval from multiple specialized knowledge paths. Applied to the generation of 3D simulated environments, our proposed MoK-RAG3D enhances this paradigm by partitioning 3D assets into distinct sections and organizing them based on a hierarchical knowledge tree structure. Different from previous methods that only use manual evaluation, we pioneered the introduction of automated evaluation methods for 3D scenes. Both automatic and human evaluations in our experiments demonstrate that MoK-RAG3D can assist Embodied AI agents in generating diverse scenes.

[AI-28] Bridging Social Psychology and LLM Reasoning : Conflict-Aware Meta-Review Generation via Cognitive Alignment

链接: https://arxiv.org/abs/2503.13879
作者: Wei Chen,Han Ding,Meng Yuan,Zhao Zhang,Deqing Wang,Fuzhen Zhuang
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages

点击查看摘要

Abstract:The rapid growth of scholarly submissions has overwhelmed traditional peer review systems, driving the need for intelligent automation to preserve scientific rigor. While large language models (LLMs) show promise in automating manuscript critiques, their ability to synthesize high-stakes meta-reviews, which require conflict-aware reasoning and consensus derivation, remains underdeveloped. Existing methods fail to effectively handle conflicting viewpoints within differing opinions, and often introduce additional cognitive biases, such as anchoring effects and conformity this http URL overcome these limitations, we propose the Cognitive Alignment Framework (CAF), a dual-process architecture that transforms LLMs into adaptive scientific arbitrators. By operationalizing Kahneman’s dual-process theory, CAF introduces a three-step cognitive pipeline: review initialization, incremental integration, and cognitive this http URL validation shows that CAF outperforms existing LLM-based methods, with sentiment consistency gains reaching up to 19.47% and content consistency improving by as much as 12.95%.

[AI-29] Out-of-Distribution Generalization in Time Series: A Survey

链接: https://arxiv.org/abs/2503.13868
作者: Xin Wu,Fei Teng,Xingwang Li,Ji Zhang,Tianrui Li,Qiang Duan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 8 figures, 5 tables. Work in Progress

点击查看摘要

Abstract:Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field’s evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at this https URL.

[AI-30] MDTeamGPT : A Self-Evolving LLM -based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation

链接: https://arxiv.org/abs/2503.13856
作者: Kai Chen,Xinfeng Li,Tianpei Yang,Hewei Wang,Wei Dong,Yang Gao
类目: Artificial Intelligence (cs.AI)
*备注: 24 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in various fields. However, challenges remain in Multi-Disciplinary Team (MDT) medical consultations. Current research enhances reasoning through role assignment, task decomposition, and accumulation of medical experience. Multi-role collaboration in MDT consultations often results in excessively long dialogue histories. This increases the model’s cognitive burden and degrades both efficiency and accuracy. Some methods only store treatment histories. They do not extract effective experience or reflect on errors. This limits knowledge generalization and system evolution. We propose a multi-agent MDT medical consultation framework based on LLMs to address these issues. Our framework uses consensus aggregation and a residual discussion structure for multi-round consultations. It also employs a Correct Answer Knowledge Base (CorrectKB) and a Chain-of-Thought Knowledge Base (ChainKB) to accumulate consultation experience. These mechanisms enable the framework to evolve and continually improve diagnosis rationality and accuracy. Experimental results on the MedQA and PubMedQA datasets demonstrate that our framework achieves accuracies of 90.1% and 83.9%, respectively, and that the constructed knowledge bases generalize effectively across test sets from both datasets.

[AI-31] WebNav: An Intelligent Agent for Voice-Controlled Web Navigation

链接: https://arxiv.org/abs/2503.13843
作者: Trisanth Srinivasan,Santosh Patapati
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The increasing reliance on web interfaces presents many challenges for visually impaired users, showcasing the need for more advanced assistive technologies. This paper introduces WebNav, a voice-controlled web navigation agent that leverages a ReAct-inspired architecture and generative AI to provide this framework. WebNav comprises of a hierarchical structure: a Digital Navigation Module (DIGNAV) for high-level strategic planning, an Assistant Module for translating abstract commands into executable actions, and an Inference Module for low-level interaction. A key component is a dynamic labeling engine, implemented as a browser extension, that generates real-time labels for interactive elements, creating mapping between voice commands and Document Object Model (DOM) components. Preliminary evaluations show that WebNav outperforms traditional screen readers in response time and task completion accuracy for the visually impaired. Future work will focus on extensive user evaluations, benchmark development, and refining the agent’s adaptive capabilities for real-world deployment.

[AI-32] Counterfactual experience augmented off-policy reinforcement learning

链接: https://arxiv.org/abs/2503.13842
作者: Sunbowen Lee,Yicheng Gong,Chao Deng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted by Neurocomputing, this https URL

点击查看摘要

Abstract:Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent’s reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at this https URL.

[AI-33] VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences

链接: https://arxiv.org/abs/2503.13817
作者: Anukriti Singh,Amisha Bhaskar,Peihong Yu,Souradip Chakraborty,Ruthwik Dasyam,Amrit Bedi,Pratap Tokekar
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages

点击查看摘要

Abstract:Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent’s full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent’s policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent’s performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.

[AI-34] Automatic MILP Model Construction for Multi-Robot Task Allocation and Scheduling Based on Large Language Models

链接: https://arxiv.org/abs/2503.13813
作者: Mingming Peng,Zhendong Chen,Jie Yang,Jin Huang,Zhengqi Shi,Qihao Liu,Xinyu Li,Liang Gao
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:With the accelerated development of Industry 4.0, intelligent manufacturing systems increasingly require efficient task allocation and scheduling in multi-robot systems. However, existing methods rely on domain expertise and face challenges in adapting to dynamic production constraints. Additionally, enterprises have high privacy requirements for production scheduling data, which prevents the use of cloud-based large language models (LLMs) for solution development. To address these challenges, there is an urgent need for an automated modeling solution that meets data privacy requirements. This study proposes a knowledge-augmented mixed integer linear programming (MILP) automated formulation framework, integrating local LLMs with domain-specific knowledge bases to generate executable code from natural language descriptions automatically. The framework employs a knowledge-guided DeepSeek-R1-Distill-Qwen-32B model to extract complex spatiotemporal constraints (82% average accuracy) and leverages a supervised fine-tuned Qwen2.5-Coder-7B-Instruct model for efficient MILP code generation (90% average accuracy). Experimental results demonstrate that the framework successfully achieves automatic modeling in the aircraft skin manufacturing case while ensuring data privacy and computational efficiency. This research provides a low-barrier and highly reliable technical path for modeling in complex industrial scenarios.

[AI-35] he Empty Chair: Using LLM s to Raise Missing Perspectives in Policy Deliberations

链接: https://arxiv.org/abs/2503.13812
作者: Suyash Fulay,Deb Roy
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deliberation is essential to well-functioning democracies, yet physical, economic, and social barriers often exclude certain groups, reducing representativeness and contributing to issues like group polarization. In this work, we explore the use of large language model (LLM) personas to introduce missing perspectives in policy deliberations. We develop and evaluate a tool that transcribes conversations in real-time and simulates input from relevant but absent stakeholders. We deploy this tool in a 19-person student citizens’ assembly on campus sustainability. Participants and facilitators found that the tool sparked new discussions and surfaced valuable perspectives they had not previously considered. However, they also noted that AI-generated responses were sometimes overly general. They raised concerns about overreliance on AI for perspective-taking. Our findings highlight both the promise and potential risks of using LLMs to raise missing points of view in group deliberation settings.

[AI-36] Empowering GraphRAG with Knowledge Filtering and Integration

链接: https://arxiv.org/abs/2503.13804
作者: Kai Guo,Harry Shomer,Shenglai Zeng,Haoyu Han,Yu Wang,Jiliang Tang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have revolutionized the field of natural language processing. However, they often suffer from knowledge gaps and hallucinations. Graph retrieval-augmented generation (GraphRAG) enhances LLM reasoning by integrating structured knowledge from external graphs. However, we identify two key challenges that plague GraphRAG:(1) Retrieving noisy and irrelevant information can degrade performance and (2)Excessive reliance on external knowledge suppresses the model’s intrinsic reasoning. To address these issues, we propose GraphRAG-FI (Filtering and Integration), consisting of GraphRAG-Filtering and GraphRAG-Integration. GraphRAG-Filtering employs a two-stage filtering mechanism to refine retrieved information. GraphRAG-Integration employs a logits-based selection strategy to balance external knowledge from GraphRAG with the LLM’s intrinsic reasoning,reducing over-reliance on retrievals. Experiments on knowledge graph QA tasks demonstrate that GraphRAG-FI significantly improves reasoning performance across multiple backbone models, establishing a more reliable and effective GraphRAG framework.

[AI-37] AI-Powered Prediction of Nanoparticle Pharmacokinetics: A Multi-View Learning Approach

链接: https://arxiv.org/abs/2503.13798
作者: Amirhossein Khakpour,Lucia Florescu,Richard Tilley,Haibo Jiang,K. Swaminathan Iyer,Gustavo Carneiro
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The clinical translation of nanoparticle-based treatments remains limited due to the unpredictability of (nanoparticle) NP pharmacokinetics \unicodex2014 how they distribute, accumulate, and clear from the body. Predicting these behaviours is challenging due to complex biological interactions and the difficulty of obtaining high-quality experimental datasets. Existing AI-driven approaches rely heavily on data-driven learning but fail to integrate crucial knowledge about NP properties and biodistribution mechanisms. We introduce a multi-view deep learning framework that enhances pharmacokinetic predictions by incorporating prior knowledge of key NP properties such as size and charge into a cross-attention mechanism, enabling context-aware feature selection and improving generalization despite small datasets. To further enhance prediction robustness, we employ an ensemble learning approach, combining deep learning with XGBoost (XGB) and Random Forest (RF), which significantly outperforms existing AI models. Our interpretability analysis reveals key physicochemical properties driving NP biodistribution, providing biologically meaningful insights into possible mechanisms governing NP behaviour in vivo rather than a black-box model. Furthermore, by bridging machine learning with physiologically based pharmacokinetic (PBPK) modelling, this work lays the foundation for data-efficient AI-driven drug discovery and precision nanomedicine.

[AI-38] Mapping the Trust Terrain: LLM s in Software Engineering – Insights and Perspectives

链接: https://arxiv.org/abs/2503.13793
作者: Dipin Khati,Yijin Liu,David N. Palacio,Yixuan Zhang,Denys Poshyvanyk
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Applications of Large Language Models (LLMs) are rapidly growing in industry and academia for various software engineering (SE) tasks. As these models become more integral to critical processes, ensuring their reliability and trustworthiness becomes essential. Consequently, the concept of trust in these systems is becoming increasingly critical. Well-calibrated trust is important, as excessive trust can lead to security vulnerabilities, and risks, while insufficient trust can hinder innovation. However, the landscape of trust-related concepts in LLMs in SE is relatively unclear, with concepts such as trust, distrust, and trustworthiness lacking clear conceptualizations in the SE community. To bring clarity to the current research status and identify opportunities for future work, we conducted a comprehensive review of 88 papers: a systematic literature review of 18 papers focused on LLMs in SE, complemented by an analysis of 70 papers from broader trust literature. Additionally, we conducted a survey study with 25 domain experts to gain insights into practitioners’ understanding of trust and identify gaps between existing literature and developers’ perceptions. The result of our analysis serves as a roadmap that covers trust-related concepts in LLMs in SE and highlights areas for future exploration.

[AI-39] Evaluating the Application of SOLID Principles in Modern AI Framework Architectures

链接: https://arxiv.org/abs/2503.13786
作者: Jonesh Shrestha
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 12 references

点击查看摘要

Abstract:This research evaluates the extent to which modern AI frameworks, specifically TensorFlow and scikit-learn, adhere to the SOLID design principles - Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion. Analyzing the frameworks architectural documentation and design philosophies, this research investigates architectural trade-offs when balancing software engineering best practices with AI-specific needs. I examined each frameworks documentation, source code, and architectural components to evaluate their adherence to these principles. The results show that both frameworks adopt certain aspects of SOLID design principles but make intentional trade-offs to address performance, scalability, and the experimental nature of AI development. TensorFlow focuses on performance and scalability, sometimes sacrificing strict adherence to principles like Single Responsibility and Interface Segregation. While scikit-learns design philosophy aligns more closely with SOLID principles through consistent interfaces and composition principles, sticking closer to SOLID guidelines but with occasional deviations for performance optimizations and scalability. This research discovered that applying SOLID principles in AI frameworks depends on context, as performance, scalability, and flexibility often require deviations from traditional software engineering principles. This research contributes to understanding how domain-specific constraints influence architectural decisions in modern AI frameworks and how these frameworks strategically adapted design choices to effectively balance these contradicting requirements.

[AI-40] owards AI-assisted Academic Writing NAACL2025

链接: https://arxiv.org/abs/2503.13771
作者: Daniel J. Liebling,Malcolm Kane,Madeleine Grunde-Mclaughlin,Ian J. Lang,Subhashini Venugopalan,Michael P. Brenner
类目: Artificial Intelligence (cs.AI)
*备注: accepted to NAACL 2025 Workshop on AI for Scientific Discovery

点击查看摘要

Abstract:We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user’s current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.

[AI-41] From Autonomous Agents to Integrated Systems A New Paradigm: Orchestrated Distributed Intelligence

链接: https://arxiv.org/abs/2503.13754
作者: Krti Tallam
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid evolution of artificial intelligence (AI) has ushered in a new era of integrated systems that merge computational prowess with human decision-making. In this paper, we introduce the concept of \textbfOrchestrated Distributed Intelligence (ODI), a novel paradigm that reconceptualizes AI not as isolated autonomous agents, but as cohesive, orchestrated networks that work in tandem with human expertise. ODI leverages advanced orchestration layers, multi-loop feedback mechanisms, and a high cognitive density framework to transform static, record-keeping systems into dynamic, action-oriented environments. Through a comprehensive review of multi-agent system literature, recent technological advances, and practical insights from industry forums, we argue that the future of AI lies in integrating distributed intelligence within human-centric workflows. This approach not only enhances operational efficiency and strategic agility but also addresses challenges related to scalability, transparency, and ethical decision-making. Our work outlines key theoretical implications and presents a practical roadmap for future research and enterprise innovation, aiming to pave the way for responsible and adaptive AI systems that drive sustainable innovation in human organizations.

[AI-42] A Circular Construction Product Ontology for End-of-Life Decision-Making

链接: https://arxiv.org/abs/2503.13708
作者: Kwabena Adu-Duodu,Stanly Wilson,Yinhao Li,Aanuoluwapo Oladimeji,Talea Huraysi,Masoud Barati,Charith Perera,Ellis Solaiman,Omer Rana,Rajiv Ranjan,Tejal Shah
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Efficient management of end-of-life (EoL) products is critical for advancing circularity in supply chains, particularly within the construction industry where EoL strategies are hindered by heterogenous lifecycle data and data silos. Current tools like Environmental Product Declarations (EPDs) and Digital Product Passports (DPPs) are limited by their dependency on seamless data integration and interoperability which remain significant challenges. To address these, we present the Circular Construction Product Ontology (CCPO), an applied framework designed to overcome semantic and data heterogeneity challenges in EoL decision-making for construction products. CCPO standardises vocabulary and facilitates data integration across supply chain stakeholders enabling lifecycle assessments (LCA) and robust decision-making. By aggregating disparate data into a unified product provenance, CCPO enables automated EoL recommendations through customisable SWRL rules aligned with European standards and stakeholder-specific circularity SLAs, demonstrating its scalability and integration capabilities. The adopted circular product scenario depicts CCPO’s application while competency question evaluations show its superior performance in generating accurate EoL suggestions highlighting its potential to greatly improve decision-making in circular supply chains and its applicability in real-world construction environments.

[AI-43] INPROVF: Leverag ing Large Language Models to Repair High-level Robot Controllers from Assumption Violations ICLR2025

链接: https://arxiv.org/abs/2503.13660
作者: Qian Meng,Jin Peng Zhou,Kilian Q. Weinberger,Hadas Kress-Gazit
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: To appear in ICLR 2025 Workshop: VerifAI: AI Verification in the Wild; in submission to 2025 IEEE 21th International Conference on Automation Science and Engineering (CASE), Los Angeles, CA, USA: IEEE, Aug. 2025

点击查看摘要

Abstract:This paper presents INPROVF, an automatic framework that combines large language models (LLMs) and formal methods to speed up the repair process of high-level robot controllers. Previous approaches based solely on formal methods are computationally expensive and cannot scale to large state spaces. In contrast, INPROVF uses LLMs to generate repair candidates, and formal methods to verify their correctness. To improve the quality of these candidates, our framework first translates the symbolic representations of the environment and controllers into natural language descriptions. If a candidate fails the verification, INPROVF provides feedback on potential unsafe behaviors or unsatisfied tasks, and iteratively prompts LLMs to generate improved solutions. We demonstrate the effectiveness of INPROVF through 12 violations with various workspaces, tasks, and state space sizes.

[AI-44] Why Do Multi-Agent LLM Systems Fail?

链接: https://arxiv.org/abs/2503.13657
作者: Mert Cemri,Melissa Z. Pan,Shuyi Yang,Lakshya A. Agrawal,Bhavya Chopra,Rishabh Tiwari,Kurt Keutzer,Aditya Parameswaran,Dan Klein,Kannan Ramchandran,Matei Zaharia,Joseph E. Gonzalez,Ion Stoica
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite growing enthusiasm for Multi-Agent Systems (MAS), where multiple LLM agents collaborate to accomplish tasks, their performance gains across popular benchmarks remain minimal compared to single-agent frameworks. This gap highlights the need to analyze the challenges hindering MAS effectiveness. In this paper, we present the first comprehensive study of MAS challenges. We analyze five popular MAS frameworks across over 150 tasks, involving six expert human annotators. We identify 14 unique failure modes and propose a comprehensive taxonomy applicable to various MAS frameworks. This taxonomy emerges iteratively from agreements among three expert annotators per study, achieving a Cohen’s Kappa score of 0.88. These fine-grained failure modes are organized into 3 categories, (i) specification and system design failures, (ii) inter-agent misalignment, and (iii) task verification and termination. To support scalable evaluation, we integrate MASFT with LLM-as-a-Judge. We also explore if identified failures could be easily prevented by proposing two interventions: improved specification of agent roles and enhanced orchestration strategies. Our findings reveal that identified failures require more complex solutions, highlighting a clear roadmap for future research. We open-source our dataset and LLM annotator. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.13657 [cs.AI] (or arXiv:2503.13657v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.13657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-45] Superalignment with Dynamic Human Values ICLR2025

链接: https://arxiv.org/abs/2503.13621
作者: Florian Mai,David Kaczér,Nicholas Kluge Corrêa,Lucie Flek
类目: Artificial Intelligence (cs.AI)
*备注: Published at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

点击查看摘要

[AI-46] LLM Test Generation via Iterative Hybrid Program Analysis

链接: https://arxiv.org/abs/2503.13580
作者: Sijia Gu,Noor Nashid,Ali Mesbah
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automating unit test generation remains a significant challenge, particularly for complex methods in real-world projects. While Large Language Models (LLMs) have made strides in code generation, they struggle to achieve high branch coverage due to their limited ability to reason about intricate control flow structures. To address this limitation, we introduce Panta, a technique that emulates the iterative process human developers follow when analyzing code and constructing test cases. Panta integrates static control flow analysis and dynamic code coverage analysis to systematically guide LLMs in identifying uncovered execution paths and generating better test cases. By incorporating an iterative feedback-driven mechanism, our technique continuously refines test generation based on static and dynamic path coverage insights, ensuring more comprehensive and effective testing. Our empirical evaluation, conducted on classes with high cyclomatic complexity from open-source projects, demonstrates that Panta achieves 26% higher line coverage and 23% higher branch coverage compared to the state-of-the-art.

[AI-47] ASMR: Adaptive Skeleton-Mesh Rigging and Skinning via 2D Generative Prior

链接: https://arxiv.org/abs/2503.13579
作者: Seokhyeon Hong,Soojin Choi,Chaelin Kim,Sihun Cha,Junyong Noh
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Eurographics 2025; Project Page this https URL

点击查看摘要

Abstract:Despite the growing accessibility of skeletal motion data, integrating it for animating character meshes remains challenging due to diverse configurations of both skeletons and meshes. Specifically, the body scale and bone lengths of the skeleton should be adjusted in accordance with the size and proportions of the mesh, ensuring that all joints are accurately positioned within the character mesh. Furthermore, defining skinning weights is complicated by variations in skeletal configurations, such as the number of joints and their hierarchy, as well as differences in mesh configurations, including their connectivity and shapes. While existing approaches have made efforts to automate this process, they hardly address the variations in both skeletal and mesh configurations. In this paper, we present a novel method for the automatic rigging and skinning of character meshes using skeletal motion data, accommodating arbitrary configurations of both meshes and skeletons. The proposed method predicts the optimal skeleton aligned with the size and proportion of the mesh as well as defines skinning weights for various mesh-skeleton configurations, without requiring explicit supervision tailored to each of them. By incorporating Diffusion 3D Features (Diff3F) as semantic descriptors of character meshes, our method achieves robust generalization across different configurations. To assess the performance of our method in comparison to existing approaches, we conducted comprehensive evaluations encompassing both quantitative and qualitative analyses, specifically examining the predicted skeletons, skinning weights, and deformation quality.

[AI-48] ExChanGeAI: An End-to-End Platform and Efficient Foundation Model for Electrocardiogram Analysis and Fine-tuning

链接: https://arxiv.org/abs/2503.13570
作者: Lucas Bickmann,Lucas Plagwitz,Antonius Büscher,Lars Eckardt,Julian Varghese
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-49] WMINet: A Wheel-Mounted Inertial Learning Approach For Mobile-Robot Positioning

链接: https://arxiv.org/abs/2503.13568
作者: Gal Versano,Itzik Klein
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous mobile robots are widely used for navigation, transportation, and inspection tasks indoors and outdoors. In practical situations of limited satellite signals or poor lighting conditions, navigation depends only on inertial sensors. In such cases, the navigation solution rapidly drifts due to inertial measurement errors. In this work, we propose WMINet a wheel-mounted inertial deep learning approach to estimate the mobile robot’s position based only on its inertial sensors. To that end, we merge two common practical methods to reduce inertial drift: a wheel-mounted approach and driving the mobile robot in periodic trajectories. Additionally, we enforce a wheelbase constraint to further improve positioning performance. To evaluate our proposed approach we recorded using the Rosbot-XL a wheel-mounted initial dataset totaling 190 minutes, which is made publicly available. Our approach demonstrated a 66% improvement over state-of-the-art approaches. As a consequence, our approach enables navigation in challenging environments and bridges the pure inertial gap. This enables seamless robot navigation using only inertial sensors for short periods.

[AI-50] APF: Boosting adaptive-potential function reinforcement learning methods with a W-shaped network for high-dimensional games

链接: https://arxiv.org/abs/2503.13557
作者: Yifei Chen,Lambert Schomaker
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 46 pages

点击查看摘要

Abstract:Studies in reward shaping for reinforcement learning (RL) have flourished in recent years due to its ability to speed up training. Our previous work proposed an adaptive potential function (APF) and showed that APF can accelerate the Q-learning with a Multi-layer Perceptron algorithm in the low-dimensional domain. This paper proposes to extend APF with an encoder (APF+) for RL state representation, allowing applying APF to the pixel-based Atari games using a state-encoding method that projects high-dimensional game’s pixel frames to low-dimensional embeddings. We approach by designing the state-representation encoder as a W-shaped network (W-Net), by using which we are able to encode both the background as well as the moving entities in the game frames. Specifically, the embeddings derived from the pre-trained W-Net consist of two latent vectors: One represents the input state, and the other represents the deviation of the input state’s representation from itself. We then incorporate W-Net into APF to train a downstream Dueling Deep Q-Network (DDQN), obtain the APF-WNet-DDQN, and demonstrate its effectiveness in Atari game-playing tasks. To evaluate the APF+W-Net module in such high-dimensional tasks, we compare with two types of baseline methods: (i) the basic DDQN; and (ii) two encoder-replaced APF-DDQN methods where we replace W-Net by (a) an unsupervised state representation method called Spatiotemporal Deep Infomax (ST-DIM) and (b) a ground truth state representation provided by the Atari Annotated RAM Interface (ARI). The experiment results show that out of 20 Atari games, APF-WNet-DDQN outperforms DDQN (14/20 games) and APF-STDIM-DDQN (13/20 games) significantly. In comparison against the APF-ARI-DDQN which employs embeddings directly of the detailed game-internal state information, the APF-WNet-DDQN achieves a comparable performance.

[AI-51] LLM s Leaning in European Elections

链接: https://arxiv.org/abs/2503.13554
作者: Federico Ricciuti
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-52] owards Privacy-Preserving Data-Driven Education: The Potential of Federated Learning

链接: https://arxiv.org/abs/2503.13550
作者: Mohammad Khalil,Ronas Shakya,Qinyi Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-53] A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

链接: https://arxiv.org/abs/2503.13549
作者: Ronas Shakya,Farhad Vadiee,Mohammad Khalil
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-54] Fuzzy Rule-based Differentiable Representation Learning

链接: https://arxiv.org/abs/2503.13548
作者: Wei Zhang,Zhaohong Deng,Guanjin Wang,Kup-Sze Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation learning has emerged as a crucial focus in machine and deep learning, involving the extraction of meaningful and useful features and patterns from the input data, thereby enhancing the performance of various downstream tasks such as classification, clustering, and prediction. Current mainstream representation learning methods primarily rely on non-linear data mining techniques such as kernel methods and deep neural networks to extract abstract knowledge from complex datasets. However, most of these methods are black-box, lacking transparency and interpretability in the learning process, which constrains their practical utility. To this end, this paper introduces a novel representation learning method grounded in an interpretable fuzzy rule-based model. Specifically, it is built upon the Takagi-Sugeno-Kang fuzzy system (TSK-FS) to initially map input data to a high-dimensional fuzzy feature space through the antecedent part of the TSK-FS. Subsequently, a novel differentiable optimization method is proposed for the consequence part learning which can preserve the model’s interpretability and transparency while further exploring the nonlinear relationships within the data. This optimization method retains the essence of traditional optimization, with certain parts of the process parameterized corresponding differentiable modules constructed, and a deep optimization process implemented. Consequently, this method not only enhances the model’s performance but also ensures its interpretability. Moreover, a second-order geometry preservation method is introduced to further improve the robustness of the proposed method. Extensive experiments conducted on various benchmark datasets validate the superiority of the proposed method, highlighting its potential for advancing representation learning methodologies.

[AI-55] CNCast: Leverag ing 3D Swin Transformer and DiT for Enhanced Regional Weather Forecasting

链接: https://arxiv.org/abs/2503.13546
作者: Hongli Liang(1),Yuanting Zhang(1),Qingye Meng(1),Shuangshuang He(1),Xingyuan Yuan(1) ((1) ColorfulClouds Technology Co., Ltd)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-56] Enhancing Visual Representation with Textual Semantics: Textual Semantics-Powered Prototypes for Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2503.13543
作者: Xinghao Wu,Jianwei Niu,Xuefeng Liu,Guogang Zhu,Jiayuan Zhang,Shaojie Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:Federated Prototype Learning (FedPL) has emerged as an effective strategy for handling data heterogeneity in Federated Learning (FL). In FedPL, clients collaboratively construct a set of global feature centers (prototypes), and let local features align with these prototypes to mitigate the effects of data heterogeneity. The performance of FedPL highly depends on the quality of prototypes. Existing methods assume that larger inter-class distances among prototypes yield better performance, and thus design different methods to increase these distances. However, we observe that while these methods increase prototype distances to enhance class discrimination, they inevitably disrupt essential semantic relationships among classes, which are crucial for model generalization. This raises an important question: how to construct prototypes that inherently preserve semantic relationships among classes? Directly learning these relationships from limited and heterogeneous client data can be problematic in FL. Recently, the success of pre-trained language models (PLMs) demonstrates their ability to capture semantic relationships from vast textual corpora. Motivated by this, we propose FedTSP, a novel method that leverages PLMs to construct semantically enriched prototypes from the textual modality, enabling more effective collaboration in heterogeneous data settings. We first use a large language model (LLM) to generate fine-grained textual descriptions for each class, which are then processed by a PLM on the server to form textual prototypes. To address the modality gap between client image models and the PLM, we introduce trainable prompts, allowing prototypes to adapt better to client tasks. Extensive experiments demonstrate that FedTSP mitigates data heterogeneity while significantly accelerating convergence.

[AI-57] HAR-DoReMi: Optimizing Data Mixture for Self-Supervised Human Activity Recognition Across Heterogeneous IMU Datasets

链接: https://arxiv.org/abs/2503.13542
作者: Lulu Ban,Tao Zhu,Xiangqing Lu,Qi Qiu,Wenyong Han,Shuangjian Li,Liming Chen,Kevin I-Kai Wang,Mingxing Nie,Yaping Wan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cross-dataset Human Activity Recognition (HAR) suffers from limited model generalization, hindering its practical deployment. To address this critical challenge, inspired by the success of DoReMi in Large Language Models (LLMs), we introduce a data mixture optimization strategy for pre-training HAR models, aiming to improve the recognition performance across heterogeneous datasets. However, directly applying DoReMi to the HAR field encounters new challenges due to the continuous, multi-channel and intrinsic heterogeneous characteristics of IMU sensor data. To overcome these limitations, we propose a novel framework HAR-DoReMi, which introduces a masked reconstruction task based on Mean Squared Error (MSE) loss. By raplacing the discrete language sequence prediction task, which relies on the Negative Log-Likelihood (NLL) loss, in the original DoReMi framework, the proposed framework is inherently more appropriate for handling the continuous and multi-channel characteristics of IMU data. In addition, HAR-DoReMi integrates the Mahony fusion algorithm into the self-supervised HAR pre-training, aiming to mitigate the heterogeneity of varying sensor orientation. This is achieved by estimating the sensor orientation within each dataset and facilitating alignment with a unified coordinate system, thereby improving the cross-dataset generalization ability of the HAR model. Experimental evaluation on multiple cross-dataset HAR transfer tasks demonstrates that HAR-DoReMi improves the accuracy by an average of 6.51%, compared to the current state-of-the-art method with only approximately 30% to 50% of the data usage. These results confirm the effectiveness of HAR-DoReMi in improving the generalization and data efficiency of pre-training HAR models, underscoring its significant potential to facilitate the practical deployment of HAR technology.

[AI-58] MSCMHMST: A traffic flow prediction model based on Transformer

链接: https://arxiv.org/abs/2503.13540
作者: Weiyang Geng,Yiming Pan,Zhecong Xing,Dongyu Liu,Rui Liu,Yuan Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study proposes a hybrid model based on Transformers, named MSCMHMST, aimed at addressing key challenges in traffic flow prediction. Traditional single-method approaches show limitations in traffic prediction tasks, whereas hybrid methods, by integrating the strengths of different models, can provide more accurate and robust predictions. The MSCMHMST model introduces a multi-head, multi-scale attention mechanism, allowing the model to parallel process different parts of the data and learn its intrinsic representations from multiple perspectives, thereby enhancing the model’s ability to handle complex situations. This mechanism enables the model to capture features at various scales effectively, understanding both short-term changes and long-term trends. Verified through experiments on the PeMS04/08 dataset with specific experimental settings, the MSCMHMST model demonstrated excellent robustness and accuracy in long, medium, and short-term traffic flow predictions. The results indicate that this model has significant potential, offering a new and effective solution for the field of traffic flow prediction.

[AI-59] From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

链接: https://arxiv.org/abs/2503.13538
作者: Siliang Zeng,Yao Liu,Huzefa Rangwala,George Karypis,Mingyi Hong,Rasool Fakoor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding distinct types of data, including demonstration data and preference data. In RLHF, human preferences are typically modeled through a reward model, which serves as a proxy to guide policy learning during the reinforcement learning stage, ultimately producing a policy aligned with human preferences. However, in this paper, we propose a fresh perspective on learning alignment based on inverse reinforcement learning principles, where the optimal policy is still derived from reward maximization. However, instead of relying on preference data, we directly learn the reward model from demonstration data. This new formulation offers the flexibility to be applied even when only demonstration data is available, a capability that current RLHF methods lack, and it also shows that demonstration data offers more utility than what conventional wisdom suggests. Our extensive evaluation, based on public reward benchmark, HuggingFace Open LLM Leaderboard and MT-Bench, demonstrates that our approach compares favorably to state-of-the-art methods that rely solely on demonstration data.

[AI-60] Unlocking Learning Potentials: The Transformative Effect of Generative AI in Education Across Grade Levels

链接: https://arxiv.org/abs/2503.13535
作者: Meijuan Xie,Liling Luo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-61] he Status Quo and Future of AI-TPACK for Mathematics Teacher Education Students: A Case Study in Chinese Universities

链接: https://arxiv.org/abs/2503.13533
作者: Meijuan Xie,Liling Luo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As artificial intelligence (AI) technology becomes increasingly prevalent in the filed of education, there is a growing need for mathematics teacher education students (MTES) to demonstrate proficiency in the integration of AI with the technological pedagogical content knowledge (AI-TPACK). To study the issue, we firstly devised an systematic AI-TPACK scale and test on 412 MTES from seven universities. Through descriptive statistical analyses, we found that the current status of AI-TPACK for MTES in China is at a basic, preliminary stage. Secondly, we compared MTES between three different grades on the six variables and found that there is no discernible difference, which suggested that graduate studies were observed to have no promotion in the development of AI-TPACK competencies. Thirdly, we proposed a new AI-TPACK structural equation model (AI-TPACK-SEM) to explore the impact of self-efficacy and teaching beliefs on AI-TPACK. Our findings indicate a positive correlation between self-efficacy and AI-TPACK. We also come to a conclusion that may be contrary to common perception, excessive teaching beliefs may impede the advancement of AI-TPACK. Overall, this paper revealed the current status of AI-TPACK for MTES in China for the first time, designed a dedicated SEM to study the effect of specific factors on AI-TPACK, and proposed some suggestions on future developments.

[AI-62] Cognitive Activation and Chaotic Dynamics in Large Language Models : A Quasi-Lyapunov Analysis of Reasoning Mechanisms

链接: https://arxiv.org/abs/2503.13530
作者: Xiaojian Li,Yongkang Leng,Ruiqing Ding,Hangjie Mo,Shanlin Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The human-like reasoning capabilities exhibited by Large Language Models (LLMs) challenge the traditional neural network theory’s understanding of the flexibility of fixed-parameter systems. This paper proposes the “Cognitive Activation” theory, revealing the essence of LLMs’ reasoning mechanisms from the perspective of dynamic systems: the model’s reasoning ability stems from a chaotic process of dynamic information extraction in the parameter space. By introducing the Quasi-Lyapunov Exponent (QLE), we quantitatively analyze the chaotic characteristics of the model at different layers. Experiments show that the model’s information accumulation follows a nonlinear exponential law, and the Multilayer Perceptron (MLP) accounts for a higher proportion in the final output than the attention mechanism. Further experiments indicate that minor initial value perturbations will have a substantial impact on the model’s reasoning ability, confirming the theoretical analysis that large language models are chaotic systems. This research provides a chaos theory framework for the interpretability of LLMs’ reasoning and reveals potential pathways for balancing creativity and reliability in model design.

[AI-63] owards a Digital Twin Modeling Method for Container Terminal Port

链接: https://arxiv.org/abs/2503.13511
作者: Faouzi Hakimi(AMU),Tarek Khaled(LIS, LIRICA),Mohammed Al-Kharaz(LIS),Arthur Cartel Foahom Gouabou(AMU, LIS, Iamp;M),Kenza Amzil(LISPEN)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper introduces a novel strategy aimed at enhancing productivity and minimizing non-productive movements within container terminals, specifically focusing on container yards. It advocates for the implementation of a digital twin-based methodology to streamline the operations of stacking cranes (SCs) responsible for container handling. The proposed approach entails the creation of a virtual container yard that mirrors the physical yard within a digital twin system, facilitating real-time observation and validation. In addition, this article demonstrates the effectiveness of using a digital twin to reduce unproductive movements and improve productivity through simulation. It defines various operational strategies and takes into account different yard contexts, providing a comprehensive understanding of optimisation possibilities. By exploiting the capabilities of the digital twin, managers and operators are provided with crucial information on operational dynamics, enabling them to identify areas for improvement. This visualisation helps decision-makers to make informed choices about their stacking strategies, thereby improving the efficiency of overall container terminal operations. Overall, this paper present a digital twin solution in container terminal operations, offering a powerful tool for optimising productivity and minimising inefficiencies.

[AI-64] Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection ICLR2025

链接: https://arxiv.org/abs/2503.13500
作者: Yucheng Suo,Fan Ma,Kaixin Shen,Linchao Zhu,Yi Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025

点击查看摘要

[AI-65] Leverag ing Knowledge Graphs and LLM s for Context-Aware Messaging

链接: https://arxiv.org/abs/2503.13499
作者: Rajeev Kumar,Harishankar Kumar,Kumari Shalini
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Personalized messaging plays an essential role in improving communication in areas such as healthcare, education, and professional engagement. This paper introduces a framework that uses the Knowledge Graph (KG) to dynamically rephrase written communications by integrating individual and context-specific data. The knowledge graph represents individuals, locations, and events as critical nodes, linking entities mentioned in messages to their corresponding graph nodes. The extraction of relevant information, such as preferences, professional roles, and cultural norms, is then combined with the original message and processed through a large language model (LLM) to generate personalized responses. The framework demonstrates notable message acceptance rates in various domains: 42% in healthcare, 53% in education, and 78% in professional recruitment. By integrating entity linking, event detection, and language modeling, this approach offers a structured and scalable solution for context-aware, audience-specific communication, facilitating advanced applications in diverse fields.

[AI-66] Mobility-aware Seamless Service Migration and Resource Allocation in Multi-edge IoV Systems

链接: https://arxiv.org/abs/2503.13494
作者: Zheyi Chen,Sijin Huang,Geyong Min,Zhaolong Ning,Jie Li,Yan Zhang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mobile Edge Computing (MEC) offers low-latency and high-bandwidth support for Internet-of-Vehicles (IoV) applications. However, due to high vehicle mobility and finite communication coverage of base stations, it is hard to maintain uninterrupted and high-quality services without proper service migration among MEC servers. Existing solutions commonly rely on prior knowledge and rarely consider efficient resource allocation during the service migration process, making it hard to reach optimal performance in dynamic IoV environments. To address these important challenges, we propose SR-CL, a novel mobility-aware seamless Service migration and Resource allocation framework via Convex-optimization-enabled deep reinforcement Learning in multi-edge IoV systems. First, we decouple the Mixed Integer Nonlinear Programming (MINLP) problem of service migration and resource allocation into two sub-problems. Next, we design a new actor-critic-based asynchronous-update deep reinforcement learning method to handle service migration, where the delayed-update actor makes migration decisions and the one-step-update critic evaluates the decisions to guide the policy update. Notably, we theoretically derive the optimal resource allocation with convex optimization for each MEC server, thereby further improving system performance. Using the real-world datasets of vehicle trajectories and testbed, extensive experiments are conducted to verify the effectiveness of the proposed SR-CL. Compared to benchmark methods, the SR-CL achieves superior convergence and delay performance under various scenarios.

[AI-67] AI-driven control of bioelectric signalling for real-time topological reorganization of cells

链接: https://arxiv.org/abs/2503.13489
作者: Gonçalo Hora de Carvalho
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Biological Physics (physics.bio-ph); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Understanding and manipulating bioelectric signaling could present a new wave of progress in developmental biology, regenerative medicine, and synthetic biology. Bioelectric signals, defined as voltage gradients across cell membranes caused by ionic movements, play a role in regulating crucial processes including cellular differentiation, proliferation, apoptosis, and tissue morphogenesis. Recent studies demonstrate the ability to modulate these signals to achieve controlled tissue regeneration and morphological outcomes in organisms such as planaria and frogs. However, significant knowledge gaps remain, particularly in predicting and controlling the spatial and temporal dynamics of membrane potentials (V_mem), understanding their regulatory roles in tissue and organ development, and exploring their therapeutic potential in diseases. In this work we propose an experiment using Deep Reinforcement Learning (DRL) framework together with lab automation techniques for real-time manipulation of bioelectric signals to guide tissue regeneration and morphogenesis. The proposed framework should interact continuously with biological systems, adapting strategies based on direct biological feedback. Combining DRL with real-time measurement techniques – such as optogenetics, voltage-sensitive dyes, fluorescent reporters, and advanced microscopy – could provide a comprehensive platform for precise bioelectric control, leading to improved understanding of bioelectric mechanisms in morphogenesis, quantitative bioelectric models, identification of minimal experimental setups, and advancements in bioelectric modulation techniques relevant to regenerative medicine and cancer therapy. Ultimately, this research aims to utilize bioelectric signaling to develop new biomedical and bioengineering applications.

[AI-68] Completeness of Datasets Documentation on ML/AI repositories: an Empirical Investigation

链接: https://arxiv.org/abs/2503.13463
作者: Marco Rondina,Antonio Vetrò,Juan Carlos De Martin
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-69] Pauli Network Circuit Synthesis with Reinforcement Learning

链接: https://arxiv.org/abs/2503.14448
作者: Ayushi Dubal,David Kremer,Simon Martiel,Victor Villar,Derek Wang,Juan Cruz-Benito
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

[AI-70] Ensemble Knowledge Distillation for Machine Learning Interatomic Potentials

链接: https://arxiv.org/abs/2503.14293
作者: Sakib Matin,Emily Shinkle,Yulia Pimonova,Galen T. Craven,Ying Wai Li,Kipton Barros,Nicholas Lubbers
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) are a promising tool to accelerate atomistic simulations and molecular property prediction. The quality of MLIPs strongly depends on the quantity of available training data as well as the quantum chemistry (QC) level of theory used to generate that data. Datasets generated with high-fidelity QC methods, such as coupled cluster, are typically restricted to small molecules and may be missing energy gradients. With this limited quantity of data, it is often difficult to train good MLIP models. We present an ensemble knowledge distillation (EKD) method to improve MLIP accuracy when trained to energy-only datasets. In our EKD approach, first, multiple teacher models are trained to QC energies and then used to generate atomic forces for all configurations in the dataset. Next, a student MLIP is trained to both QC energies and to ensemble-averaged forces generated by the teacher models. We apply this workflow on the ANI-1ccx dataset which consists of organic molecules with configuration energies computed at the coupled cluster level of theory. The resulting student MLIPs achieve new state-of-the-art accuracy on the out-of-sample COMP6 benchmark and improved stability for molecular dynamics simulations. The EKD approach for MLIP is broadly applicable for chemical, biomolecular and materials science simulations.

[AI-71] Strategic White Paper on AI Infrastructure for Particle Nuclear and Astroparticle Physics: Insights from JENA and EuCAIF

链接: https://arxiv.org/abs/2503.14192
作者: Sascha Caron,Andreas Ipp,Gert Aarts,Gábor Bíró,Daniele Bonacorsi,Elena Cuoco,Caterina Doglioni,Tommaso Dorigo,Julián García Pardiñas,Stefano Giagu,Tobias Golling,Lukas Heinrich,Ik Siong Heng,Paula Gina Isar,Karolos Potamianos,Liliana Teodorescu,John Veitch,Pietro Vischia,Christoph Weniger
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph); Nuclear Theory (nucl-th)
*备注: 19 pages, 5 figures

点击查看摘要

[AI-72] oward Large-Scale Distributed Quantum Long Short-Term Memory with Modular Quantum Computers

链接: https://arxiv.org/abs/2503.14088
作者: Kuan-Cheng Chen,Samuel Yen-Chi Chen,Chen-Yu Liu,Kin K. Leung
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce a Distributed Quantum Long Short-Term Memory (QLSTM) framework that leverages modular quantum computing to address scalability challenges on Noisy Intermediate-Scale Quantum (NISQ) devices. By embedding variational quantum circuits into LSTM cells, the QLSTM captures long-range temporal dependencies, while a distributed architecture partitions the underlying Variational Quantum Circuits (VQCs) into smaller, manageable subcircuits that can be executed on a network of quantum processing units. We assess the proposed framework using nontrivial benchmark problems such as damped harmonic oscillators and Nonlinear Autoregressive Moving Average sequences. Our results demonstrate that the distributed QLSTM achieves stable convergence and improved training dynamics compared to classical approaches. This work underscores the potential of modular, distributed quantum computing architectures for large-scale sequence modelling, providing a foundation for the future integration of hybrid quantum-classical solutions into advanced Quantum High-performance computing (HPC) ecosystems.

[AI-73] Beyond holography: the entropic quantum gravity foundations of image processing

链接: https://arxiv.org/abs/2503.14048
作者: Ginestra Bianconi
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注: (7 pages, 1 figure)

点击查看摘要

[AI-74] Optimizing ML Training with Metagradient Descent

链接: https://arxiv.org/abs/2503.13751
作者: Logan Engstrom,Andrew Ilyas,Benjamin Chen,Axel Feldmann,William Moses,Aleksander Madry
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-75] Convolutional neural network for early detection of lameness and irregularity in horses using an IMU sensor

链接: https://arxiv.org/abs/2503.13578
作者: Benoît Savoini,Jonathan Bertolaccini,Stéphane Montavon,Michel Deriaz
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at AMLDS 2025

点击查看摘要

[AI-76] Micro Text Classification Based on Balanced Positive-Unlabeled Learning

链接: https://arxiv.org/abs/2503.13562
作者: Lin-Han Jia,Lan-Zhe Guo,Zhi Zhou,Si-Ye Han,Zi-Wen Li,Yu-Feng Li
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In real-world text classification tasks, negative texts often contain a minimal proportion of negative content, which is especially problematic in areas like text quality control, legal risk screening, and sensitive information interception. This challenge manifests at two levels: at the macro level, distinguishing negative texts is difficult due to the high similarity between coarse-grained positive and negative samples; at the micro level, the issue stems from extreme class imbalance and a lack of fine-grained labels. To address these challenges, we propose transforming the coarse-grained positive-negative (PN) classification task into an imbalanced fine-grained positive-unlabeled (PU) classification problem, supported by theoretical analysis. We introduce a novel framework, Balanced Fine-Grained Positive-Unlabeled (BFGPU) learning, which features a unique PU learning loss function that optimizes macro-level performance amidst severe imbalance at the micro level. The framework’s performance is further boosted by rebalanced pseudo-labeling and threshold adjustment. Extensive experiments on both public and real-world datasets demonstrate the effectiveness of BFGPU, which outperforms other methods, even in extreme scenarios where both macro and micro levels are highly imbalanced.

[AI-77] Survival Analysis with Machine Learning for Predicting Li-ion Battery Remaining Useful Life

链接: https://arxiv.org/abs/2503.13558
作者: Jingyuan Xue,Longfei Wei,Fang Sheng,Yuxin Gao,Jianfei Zhang
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-78] Advanced Deep Learning Methods for Protein Structure Prediction and Design

链接: https://arxiv.org/abs/2503.13522
作者: Weikun Wu,Tianyang Wang,Yichao Zhang,Ningyuan Deng,Xinyuan Song,Ziqian Bi,Zheyu Yao,Keyu Chen,Ming Li,Qian Niu,Junyu Liu,Benji Peng,Sen Zhang,Ming Liu,Li Zhang,Xuanhe Pan,Jinlang Wang,Pohsun Feng,Yizhu Wen,Lawrence KQ Yan,Hongming Tseng,Yan Zhong,Yunze Wang,Ziyuan Qin,Bowen Jing,Junjie Yang,Jun Zhou,Chia Xin Liang,Junhao Song
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-79] Event-Driven Implementation of a Physical Reservoir Computing Framework for superficial EMG-based Gesture Recognition

链接: https://arxiv.org/abs/2503.13492
作者: Yuqi Ding,Elisa Donati,Haobo Li,Hadi Heidari
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 11 pages, 9 figures, journal

点击查看摘要

[AI-80] Onboard Terrain Classification via Stacked Intelligent Metasurface-Diffractive Deep Neural Networks from SAR Level-0 Raw Data ICLR2025

链接: https://arxiv.org/abs/2503.13488
作者: Mengbing Liu,Xin Li,Jiancheng An,Chau Yuen
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: Accepted to the Machine Learning for Remote Sensing (ML4RS) Workshop at ICLR 2025

点击查看摘要

[AI-81] Radar Pulse Deinterleaving with Transformer Based Deep Metric Learning

链接: https://arxiv.org/abs/2503.13476
作者: Edward Gunn,Adam Hosford,Daniel Mannion,Jarrod Williams,Varun Chhabra,Victoria Nockles
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Preprint: Accepted to IEEE International Radar Conference 2025

点击查看摘要

[AI-82] Cross-Subject Depression Level Classification Using EEG Signals with a Sample Confidence Method

链接: https://arxiv.org/abs/2503.13475
作者: ZhongYi Zhang,ChenYang Xu,LiXuan Zhao,HuiRang Hou,QingHao Meng
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

[AI-83] How Metacognitive Architectures Remember Their Own Thoughts: A Systematic Review

链接: https://arxiv.org/abs/2503.13467
作者: Robin Nolte,Mihai Pomarlan,Ayden Janssen,Daniel Beßler,Kamyar Javanmardi,Sascha Jongebloed,Robert Porzel,John Bateman,Michael Beetz,Rainer Malaka
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 69 pages, 13 figures. In preparation for submission

点击查看摘要

Abstract:Inspired by human cognition, metacognition has gained significant attention for its potential to enhance autonomy, adaptability, and robust learning in artificial agents. Yet research on Computational Metacognitive Architectures (CMAs) remains fragmented: diverse theories, terminologies, and design choices have led to disjointed developments and limited comparability across systems. Existing overviews and surveys often remain at a broad, conceptual level, making it difficult to synthesize deeper insights into the underlying algorithms and representations, and their respective success. We address this gap by performing an explorative systematic review of how CMAs model, store, remember and process their metacognitive experiences, one of Flavell’s (1979) three foundational components of metacognition. Following this organizing principle, we identify 35 CMAs that feature episodic introspective data ranging from symbolic event traces to sub-symbolic arousal metrics. We consider different aspects - ranging from the underlying psychological theories to the content and structure of collected data, to the algorithms used and evaluation results - and derive a unifying perspective that allows us to compare in depth how different Computational Metacognitive Architectures (CMAs) leverage metacognitive experiences for tasks such as error diagnosis, self-repair, and goal-driven learning. Our findings highlight both the promise of metacognitive experiences - in boosting adaptability, explainability, and overall system performance - and the persistent lack of shared standards or evaluation benchmarks.

[AI-84] A novel Fourier Adjacency Transformer for advanced EEG emotion recognition

链接: https://arxiv.org/abs/2503.13465
作者: Jinfeng Wang,Yanhao Huang,Sifan Song,Boqian Wang,Jionglong Su,Jiaman Ding
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:EEG emotion recognition faces significant hurdles due to noise interference, signal nonstationarity, and the inherent complexity of brain activity which make accurately emotion classification. In this study, we present the Fourier Adjacency Transformer, a novel framework that seamlessly integrates Fourier-based periodic analysis with graph-driven structural modeling. Our method first leverages novel Fourier-inspired modules to extract periodic features from embedded EEG signals, effectively decoupling them from aperiodic components. Subsequently, we employ an adjacency attention scheme to reinforce universal inter-channel correlation patterns, coupling these patterns with their sample-based counterparts. Empirical evaluations on SEED and DEAP datasets demonstrate that our method surpasses existing state-of-the-art techniques, achieving an improvement of approximately 6.5% in recognition accuracy. By unifying periodicity and structural insights, this framework offers a promising direction for future research in EEG emotion analysis.

机器学习

[LG-0] EnvBench: A Benchmark for Automated Environment Setup ICLR’25

链接: https://arxiv.org/abs/2503.14443
作者: Aleksandra Eliseeva,Alexander Kovrigin,Ilia Kholkin,Egor Bogomolov,Yaroslav Zharov
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at the DL4Code workshop at ICLR’25

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have enabled researchers to focus on practical repository-level tasks in software engineering domain. In this work, we consider a cornerstone task for automating work with software repositories-environment setup, i.e., a task of configuring a repository-specific development environment on a system. Existing studies on environment setup introduce innovative agentic strategies, but their evaluation is often based on small datasets that may not capture the full range of configuration challenges encountered in practice. To address this gap, we introduce a comprehensive environment setup benchmark EnvBench. It encompasses 329 Python and 665 JVM-based (Java, Kotlin) repositories, with a focus on repositories that present genuine configuration challenges, excluding projects that can be fully configured by simple deterministic scripts. To enable further benchmark extension and usage for model tuning, we implement two automatic metrics: a static analysis check for missing imports in Python and a compilation check for JVM languages. We demonstrate the applicability of our benchmark by evaluating three environment setup approaches, including a simple zero-shot baseline and two agentic workflows, that we test with two powerful LLM backbones, GPT-4o and GPT-4o-mini. The best approach manages to successfully configure 6.69% repositories for Python and 29.47% repositories for JVM, suggesting that EnvBench remains challenging for current approaches. Our benchmark suite is publicly available at this https URL. The dataset and experiment trajectories are available at this https URL.

[LG-1] Inducing Causal Structure for Interpretable Neural Networks Applied to Glucose Prediction for T1DM Patients

链接: https://arxiv.org/abs/2503.14442
作者: Ana Esponera(1),Giovanni Cinnà(1 and 2) ((1) Medical Informatics Department from Amsterdam University Medical Center The Netherlands (2) Institute for Logic Language and Computation from University of Amsterdam The Netherlands)
类目: Machine Learning (cs.LG)
*备注: 27 pages, 10 pages, to be published in the Proceedings of Machine Learning Research (PMLR), to be presented at the conference CLeaR 2025

点击查看摘要

Abstract:Causal abstraction techniques such as Interchange Intervention Training (IIT) have been proposed to infuse neural network with expert knowledge encoded in causal models, but their application to real-world problems remains limited. This article explores the application of IIT in predicting blood glucose levels in Type 1 Diabetes Mellitus (T1DM) patients. The study utilizes an acyclic version of the simglucose simulator approved by the FDA to train a Multi-Layer Perceptron (MLP) model, employing IIT to impose causal relationships. Results show that the model trained with IIT effectively abstracted the causal structure and outperformed the standard one in terms of predictive performance across different prediction horizons (PHs) post-meal. Furthermore, the breakdown of the counterfactual loss can be leveraged to explain which part of the causal mechanism are more or less effectively captured by the model. These preliminary results suggest the potential of IIT in enhancing predictive models in healthcare by effectively complying with expert knowledge.

[LG-2] Graph-CNNs for RF Imaging: Learning the Electric Field Integral Equations

链接: https://arxiv.org/abs/2503.14439
作者: Kyriakos Stylianopoulos,Panagiotis Gavriilidis,Gabriele Gradoni,George C. Alexandropoulos
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to EUSIPCO 2025

点击查看摘要

Abstract:Radio-Frequency (RF) imaging concerns the digital recreation of the surfaces of scene objects based on the scattered field at distributed receivers. To solve this difficult inverse scattering problems, data-driven methods are often employed that extract patterns from similar training examples, while offering minimal latency. In this paper, we first provide an approximate yet fast electromagnetic model, which is based on the electric field integral equations, for data generation, and subsequently propose a Deep Neural Network (DNN) architecture to learn the corresponding inverse model. A graph-attention backbone allows for the system geometry to be passed to the DNN, where residual convolutional layers extract features about the objects, while a UNet head performs the final image reconstruction. Our quantitative and qualitative evaluations on two synthetic data sets of different characteristics showcase the performance gains of thee proposed advanced architecture and its relative resilience to signal noise levels and various reception configurations.

[LG-3] Landscape Complexity for the Empirical Risk of Generalized Linear Models: Discrimination between Structured Data

链接: https://arxiv.org/abs/2503.14403
作者: Theodoros G. Tsironis,Aris L. Moustakas
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We use the Kac-Rice formula and results from random matrix theory to obtain the average number of critical points of a family of high-dimensional empirical loss functions, where the data are correlated d -dimensional Gaussian vectors, whose number has a fixed ratio with their dimension. The correlations are introduced to model the existence of structure in the data, as is common in current Machine-Learning systems. Under a technical hypothesis, our results are exact in the large- d limit, and characterize the annealed landscape complexity, namely the logarithm of the expected number of critical points at a given value of the loss. We first address in detail the landscape of the loss function of a single perceptron and then generalize it to the case where two competing data sets with different covariance matrices are present, with the perceptron seeking to discriminate between them. The latter model can be applied to understand the interplay between adversity and non-trivial data structure. For completeness, we also treat the case of a loss function used in training Generalized Linear Models in the presence of correlated input data. Subjects: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (stat.ML) Cite as: arXiv:2503.14403 [cs.LG] (or arXiv:2503.14403v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.14403 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] chnical Report: Aggregation on Learnable Manifolds for Asynchronous Federated Optimization

链接: https://arxiv.org/abs/2503.14396
作者: Archie Licudi
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures. Preliminary technical project report

点击查看摘要

Abstract:In Federated Learning (FL), a primary challenge to the server-side aggregation of client models is device heterogeneity, in both loss landscape geometry and computational capacity. This issue can be particularly pronounced in clinical contexts where variations in data distribution (aggravated by class imbalance), infrastructure requirements, and sample sizes are common. We propose AsyncManifold, a novel asynchronous FL framework to address these issues by taking advantage of underlying solution space geometry, at each of the local training, delay-correction, and aggregation stages. Our proposal is accompanied by a convergence proof in a general form and, motivated thorough exploratory studies of local behaviour, a proof-of-concept algorithm which performs aggregation along non-linear mode connections and hence avoids barriers to convergence that techniques based on linear interpolation will encounter.

[LG-5] On the clustering behavior of sliding windows

链接: https://arxiv.org/abs/2503.14393
作者: Boris Alexeev,Wenyan Luo,Dustin G. Mixon,Yan X Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Things can go spectacularly wrong when clustering timeseries data that has been preprocessed with a sliding window. We highlight three surprising failures that emerge depending on how the window size compares with the timeseries length. In addition to computational examples, we present theoretical explanations for each of these failure modes.

[LG-6] Evaluating Machine Learning Approaches for ASCII Art Generation

链接: https://arxiv.org/abs/2503.14375
作者: Sai Coumar,Zachary Kingston
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, 3 tables. Code available at this https URL

点击查看摘要

Abstract:Generating structured ASCII art using computational techniques demands a careful interplay between aesthetic representation and computational precision, requiring models that can effectively translate visual information into symbolic text characters. Although Convolutional Neural Networks (CNNs) have shown promise in this domain, the comparative performance of deep learning architectures and classical machine learning methods remains unexplored. This paper explores the application of contemporary ML and DL methods to generate structured ASCII art, focusing on three key criteria: fidelity, character classification accuracy, and output quality. We investigate deep learning architectures, including Multilayer Perceptrons (MLPs), ResNet, and MobileNetV2, alongside classical approaches such as Random Forests, Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN), trained on an augmented synthetic dataset of ASCII characters. Our results show that complex neural network architectures often fall short in producing high-quality ASCII art, whereas classical machine learning classifiers, despite their simplicity, achieve performance similar to CNNs. Our findings highlight the strength of classical methods in bridging model simplicity with output quality, offering new insights into ASCII art synthesis and machine learning on image data with low dimensionality.

[LG-7] Wasserstein-based Kernels for Clustering: Application to Power Distribution Graphs

链接: https://arxiv.org/abs/2503.14357
作者: Alfredo Oneto,Blazhe Gjorgiev,Giovanni Sansavini
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Many data clustering applications must handle objects that cannot be represented as vector data. In this context, the bag-of-vectors representation can be leveraged to describe complex objects through discrete distributions, and the Wasserstein distance can effectively measure the dissimilarity between them. Additionally, kernel methods can be used to embed data into feature spaces that are easier to analyze. Despite significant progress in data clustering, a method that simultaneously accounts for distributional and vectorial dissimilarity measures is still lacking. To tackle this gap, this work explores kernel methods and Wasserstein distance metrics to develop a computationally tractable clustering framework. The compositional properties of kernels allow the simultaneous handling of different metrics, enabling the integration of both vectors and discrete distributions for object representation. This approach is flexible enough to be applied in various domains, such as graph analysis and image processing. The framework consists of three main components. First, we efficiently approximate pairwise Wasserstein distances using multiple reference distributions. Second, we employ kernel functions based on Wasserstein distances and present ways of composing kernels to express different types of information. Finally, we use the kernels to cluster data and evaluate the quality of the results using scalable and distance-agnostic validity indices. A case study involving two datasets of 879 and 34,920 power distribution graphs demonstrates the framework’s effectiveness and efficiency.

[LG-8] Benchmarking community drug response prediction models: datasets models tools and metrics for cross-dataset generalization analysis

链接: https://arxiv.org/abs/2503.14356
作者: Alexander Partin(1),Priyanka Vasanthakumari(1),Oleksandr Narykov(1),Andreas Wilke(1),Natasha Koussa(2),Sara E. Jones(2),Yitan Zhu(1),Jamie C. Overbeek(1),Rajeev Jain(1),Gayara Demini Fernando(3),Cesar Sanchez-Villalobos(4),Cristina Garcia-Cardona(5),Jamaludin Mohd-Yusof(5),Nicholas Chia(1),Justin M. Wozniak(1),Souparno Ghosh(3),Ranadip Pal(4),Thomas S. Brettin(1),M. Ryan Weil(2),Rick L. Stevens(1 and 6) ((1) Division of Data Science and Learning, Argonne National Laboratory, Lemont, IL, USA, (2) Frederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Frederick, MD, USA, (3) Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE, USA, (4) Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, USA, (5) Division of Computer, Computational and Statistical Sciences, Los Alamos National Laboratory, Los Alamos, NM, USA, (6) Department of Computer Science, The University of Chicago, Chicago, IL, USA)
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Deep learning (DL) and machine learning (ML) models have shown promise in drug response prediction (DRP), yet their ability to generalize across datasets remains an open question, raising concerns about their real-world applicability. Due to the lack of standardized benchmarking approaches, model evaluations and comparisons often rely on inconsistent datasets and evaluation criteria, making it difficult to assess true predictive capabilities. In this work, we introduce a benchmarking framework for evaluating cross-dataset prediction generalization in DRP models. Our framework incorporates five publicly available drug screening datasets, six standardized DRP models, and a scalable workflow for systematic evaluation. To assess model generalization, we introduce a set of evaluation metrics that quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results), enabling a more comprehensive assessment of model transferability. Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments. While several models demonstrate relatively strong cross-dataset generalization, no single model consistently outperforms across all datasets. Furthermore, we identify CTRPv2 as the most effective source dataset for training, yielding higher generalization scores across target datasets. By sharing this standardized evaluation framework with the community, our study aims to establish a rigorous foundation for model comparison, and accelerate the development of robust DRP models for real-world applications.

[LG-9] End-to-End Optimal Detector Design with Mutual Information Surrogates

链接: https://arxiv.org/abs/2503.14342
作者: Kinga Anna Wozniak,Stephen Mulligan,Jan Kieseler,Markus Klute,Francois Fleuret,Tobias Golling
类目: Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注:

点击查看摘要

Abstract:We introduce a novel approach for end-to-end black-box optimization of high energy physics (HEP) detectors using local deep learning (DL) surrogates. These surrogates approximate a scalar objective function that encapsulates the complex interplay of particle-matter interactions and physics analysis goals. In addition to a standard reconstruction-based metric commonly used in the field, we investigate the information-theoretic metric of mutual information. Unlike traditional methods, mutual information is inherently task-agnostic, offering a broader optimization paradigm that is less constrained by predefined targets. We demonstrate the effectiveness of our method in a realistic physics analysis scenario: optimizing the thicknesses of calorimeter detector layers based on simulated particle interactions. The surrogate model learns to approximate objective gradients, enabling efficient optimization with respect to energy resolution. Our findings reveal three key insights: (1) end-to-end black-box optimization using local surrogates is a practical and compelling approach for detector design, providing direct optimization of detector parameters in alignment with physics analysis goals; (2) mutual information-based optimization yields design choices that closely match those from state-of-the-art physics-informed methods, indicating that these approaches operate near optimality and reinforcing their reliability in HEP detector design; and (3) information-theoretic methods provide a powerful, generalizable framework for optimizing scientific instruments. By reframing the optimization process through an information-theoretic lens rather than domain-specific heuristics, mutual information enables the exploration of new avenues for discovery beyond conventional approaches.

[LG-10] Higher-Order Graphon Neural Networks: Approximation and Cut Distance ICLR2025

链接: https://arxiv.org/abs/2503.14338
作者: Daniel Herbst,Stefanie Jegelka
类目: Machine Learning (cs.LG)
*备注: 51 pages, 6 figures, ICLR 2025

点击查看摘要

Abstract:Graph limit models, like graphons for limits of dense graphs, have recently been used to study size transferability of graph neural networks (GNNs). While most literature focuses on message passing GNNs (MPNNs), in this work we attend to the more powerful higher-order GNNs. First, we extend the k -WL test for graphons (Böker, 2023) to the graphon-signal space and introduce signal-weighted homomorphism densities as a key tool. As an exemplary focus, we generalize Invariant Graph Networks (IGNs) to graphons, proposing Invariant Graphon Networks (IWNs) defined via a subset of the IGN basis corresponding to bounded linear operators. Even with this restricted basis, we show that IWNs of order k are at least as powerful as the k -WL test, and we establish universal approximation results for graphon-signals in L^p distances. This significantly extends the prior work of Cai Wang (2022), showing that IWNs–a subset of their IGN-small–retain effectively the same expressivity as the full IGN basis in the limit. In contrast to their approach, our blueprint of IWNs also aligns better with the geometry of graphon space, for example facilitating comparability to MPNNs. We highlight that, while typical higher-order GNNs are discontinuous w.r.t. cut distance–which causes their lack of convergence and is inherently tied to the definition of k -WL–their transferability remains comparable to MPNNs.

[LG-11] Consumer-grade EEG-based Eye Tracking

链接: https://arxiv.org/abs/2503.14322
作者: Tiago Vasconcelos Afonso,Florian Heinrichs
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Data descriptor, 13 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Electroencephalography-based eye tracking (EEG-ET) leverages eye movement artifacts in EEG signals as an alternative to camera-based tracking. While EEG-ET offers advantages such as robustness in low-light conditions and better integration with brain-computer interfaces, its development lags behind traditional methods, particularly in consumer-grade settings. To support research in this area, we present a dataset comprising simultaneous EEG and eye-tracking recordings from 113 participants across 116 sessions, amounting to 11 hours and 45 minutes of recordings. Data was collected using a consumer-grade EEG headset and webcam-based eye tracking, capturing eye movements under four experimental paradigms with varying complexity. The dataset enables the evaluation of EEG-ET methods across different gaze conditions and serves as a benchmark for assessing feasibility with affordable hardware. Data preprocessing includes handling of missing values and filtering to enhance usability. In addition to the dataset, code for data preprocessing and analysis is available to support reproducibility and further research.

[LG-12] FeNeC: Enhancing Continual Learning via Feature Clustering with Neighbor- or Logit-Based Classification

链接: https://arxiv.org/abs/2503.14301
作者: Kamil Książek,Hubert Jastrzębski,Bartosz Trojan,Krzysztof Pniaczek,Michał Karp,Jacek Tabor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The ability of deep learning models to learn continuously is essential for adapting to new data categories and evolving data distributions. In recent years, approaches leveraging frozen feature extractors after an initial learning phase have been extensively studied. Many of these methods estimate per-class covariance matrices and prototypes based on backbone-derived feature representations. Within this paradigm, we introduce FeNeC (Feature Neighborhood Classifier) and FeNeC-Log, its variant based on the log-likelihood function. Our approach generalizes the existing concept by incorporating data clustering to capture greater intra-class variability. Utilizing the Mahalanobis distance, our models classify samples either through a nearest neighbor approach or trainable logit values assigned to consecutive classes. Our proposition may be reduced to the existing approaches in a special case while extending them with the ability of more flexible adaptation to data. We demonstrate that two FeNeC variants achieve competitive performance in scenarios where task identities are unknown and establish state-of-the-art results on several benchmarks.

[LG-13] Unveiling the Role of Randomization in Multiclass Adversarial Classification: Insights from Graph Theory AISTATS2025

链接: https://arxiv.org/abs/2503.14299
作者: Lucas Gnecco-Heredia,Matteo Sammut,Muni Sreenivas Pydi,Rafael Pinot,Benjamin Negrevergne,Yann Chevaleyre
类目: Machine Learning (cs.LG)
*备注: 9 pages (main), 30 in total. Camera-ready version, accepted at AISTATS 2025

点击查看摘要

Abstract:Randomization as a mean to improve the adversarial robustness of machine learning models has recently attracted significant attention. Unfortunately, much of the theoretical analysis so far has focused on binary classification, providing only limited insights into the more complex multiclass setting. In this paper, we take a step toward closing this gap by drawing inspiration from the field of graph theory. Our analysis focuses on discrete data distributions, allowing us to cast the adversarial risk minimization problems within the well-established framework of set packing problems. By doing so, we are able to identify three structural conditions on the support of the data distribution that are necessary for randomization to improve robustness. Furthermore, we are able to construct several data distributions where (contrarily to binary classification) switching from a deterministic to a randomized solution significantly reduces the optimal adversarial risk. These findings highlight the crucial role randomization can play in enhancing robustness to adversarial attacks in multiclass classification.

[LG-14] Improved Scalable Lipschitz Bounds for Deep Neural Networks

链接: https://arxiv.org/abs/2503.14297
作者: Usman Syed,Bin Hu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Computing tight Lipschitz bounds for deep neural networks is crucial for analyzing their robustness and stability, but existing approaches either produce relatively conservative estimates or rely on semidefinite programming (SDP) formulations (namely the LipSDP condition) that face scalability issues. Building upon ECLipsE-Fast, the state-of-the-art Lipschitz bound method that avoids SDP formulations, we derive a new family of improved scalable Lipschitz bounds that can be combined to outperform ECLipsE-Fast. Specifically, we leverage more general parameterizations of feasible points of LipSDP to derive various closed-form Lipschitz bounds, avoiding the use of SDP solvers. In addition, we show that our technique encompasses ECLipsE-Fast as a special case and leads to a much larger class of scalable Lipschitz bounds for deep neural networks. Our empirical study shows that our bounds improve ECLipsE-Fast, further advancing the scalability and precision of Lipschitz estimation for large neural networks.

[LG-15] apered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLM s

链接: https://arxiv.org/abs/2503.14286
作者: Nicolas Le Roux,Marc G. Bellemare,Jonathan Lebensold,Arnaud Bergeron,Joshua Greaves,Alex Fréchette,Carolyne Pelletier,Eric Thibodeau-Laufer Sándor Toth,Samantha Work
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference’’ that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE’s baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

[LG-16] XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

链接: https://arxiv.org/abs/2503.14281
作者: Adam Štorek,Mukur Gupta,Noopur Bhatt,Aditya Gupta,Janie Kim,Prashast Srivastava,Suman Jana
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:AI coding assistants are widely used for tasks like code generation, bug detection, and comprehension. These tools now require large and complex contexts, automatically sourced from various origins \unicodex2014 across files, projects, and contributors \unicodex2014 forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant’s outputs, potentially generating vulnerable code, overlooking flaws, or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is particularly challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these correlations since the semantics of the code remain correct, making it appear legitimate. This allows attackers to manipulate code assistants into producing incorrect outputs, including vulnerabilities or backdoors, while shifting the blame to the victim developer or tester. We introduce a novel, task-agnostic black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving an 83.09% attack success rate on average across five tasks and eleven models, including GPT-4o and Claude 3.5 Sonnet v2 used by many popular AI coding assistants. Furthermore, existing defenses, including adversarial fine-tuning, are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.

[LG-17] Quantization-Free Autoregressive Action Transformer

链接: https://arxiv.org/abs/2503.14259
作者: Ziyad Sheebaelhamd,Michael Tschannen,Michael Muehlebach,Claire Vernade
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.

[LG-18] Persistent Homology-induced Graph Ensembles for Time Series Regressions

链接: https://arxiv.org/abs/2503.14240
作者: Viet The Nguyen,Duy Anh Pham,An Thai Le,Jans Peter,Gunther Gust
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effectiveness of Spatio-temporal Graph Neural Networks (STGNNs) in time-series applications is often limited by their dependence on fixed, hand-crafted input graph structures. Motivated by insights from the Topological Data Analysis (TDA) paradigm, of which real-world data exhibits multi-scale patterns, we construct several graphs using \textitPersistent Homology Filtration – a mathematical framework describing the multiscale structural properties of data points. Then, we use the constructed graphs as an input to create an ensemble of Graph Neural Networks. The ensemble aggregates the signals from the individual learners via an attention-based routing mechanism, thus systematically encoding the inherent multiscale structures of data. Four different real-world experiments on seismic activity prediction and traffic forecasting (PEMS-BAY, METR-LA) demonstrate that our approach consistently outperforms single-graph baselines while providing interpretable insights.

[LG-19] Predicting Cardiopulmonary Exercise Testing Outcomes in Congenital Heart Disease Through Multi-modal Data Integration and Geometric Learning

链接: https://arxiv.org/abs/2503.14239
作者: Muhammet Alkan,Gruschen Veldtman,Fani Deligianni
类目: Machine Learning (cs.LG)
*备注: preprint for Scientific Reports

点击查看摘要

Abstract:Cardiopulmonary exercise testing (CPET) provides a comprehensive assessment of functional capacity by measuring key physiological variables including oxygen consumption ( VO_2 ), carbon dioxide production ( VCO_2 ), and pulmonary ventilation ( VE ) during exercise. Previous research has established that parameters such as peak VO_2 and VE/VCO_2 ratio serve as robust predictors of mortality risk in chronic heart failure patients. In this study, we leverage CPET variables as surrogate mortality endpoints for patients with Congenital Heart Disease (CHD). To our knowledge, this represents the first successful implementation of an advanced machine learning approach that predicts CPET outcomes by integrating electrocardiograms (ECGs) with information derived from clinical letters. Our methodology began with extracting unstructured patient information-including intervention history, diagnoses, and medication regimens-from clinical letters using natural language processing techniques, organizing this data into a structured database. We then digitized ECGs to obtain quantifiable waveforms and established comprehensive data linkages. The core innovation of our approach lies in exploiting the Riemannian geometric properties of covariance matrices derived from both 12-lead ECGs and clinical text data to develop robust regression and classification models. Through extensive ablation studies, we demonstrated that the integration of ECG signals with clinical documentation, enhanced by covariance augmentation techniques in Riemannian space, consistently produced superior predictive performance compared to conventional approaches.

[LG-20] Decision Tree Induction Through LLM s via Semantically-Aware Evolution ICLR2025

链接: https://arxiv.org/abs/2503.14217
作者: Tennison Liu,Nicolas Huynh,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注: *Liu and Huynh contributed equally. Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Decision trees are a crucial class of models offering robust predictive performance and inherent interpretability across various domains, including healthcare, finance, and logistics. However, current tree induction methods often face limitations such as suboptimal solutions from greedy methods or prohibitive computational costs and limited applicability of exact optimization approaches. To address these challenges, we propose an evolutionary optimization method for decision tree induction based on genetic programming (GP). Our key innovation is the integration of semantic priors and domain-specific knowledge about the search space into the optimization algorithm. To this end, we introduce \textttLLEGO , a framework that incorporates semantic priors into genetic search operators through the use of Large Language Models (LLMs), thereby enhancing search efficiency and targeting regions of the search space that yield decision trees with superior generalization performance. This is operationalized through novel genetic operators that work with structured natural language prompts, effectively utilizing LLMs as conditional generative models and sources of semantic knowledge. Specifically, we introduce \textitfitness-guided crossover to exploit high-performing regions, and \textitdiversity-guided mutation for efficient global exploration of the search space. These operators are controlled by corresponding hyperparameters that enable a more nuanced balance between exploration and exploitation across the search space. Empirically, we demonstrate across various benchmarks that \textttLLEGO evolves superior-performing trees compared to existing tree induction methods, and exhibits significantly more efficient search performance compared to conventional GP approaches.

[LG-21] Rolling Forward: Enhancing LightGCN with Causal Graph Convolution for Credit Bond Recommendation

链接: https://arxiv.org/abs/2503.14213
作者: Ashraf Ghiye,Baptiste Barreau,Laurent Carlier,Michalis Vazirgiannis
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 8 pages, published in the international conference for AI in Finance (ACM ICAIF’24)

点击查看摘要

Abstract:Graph Neural Networks have significantly advanced research in recommender systems over the past few years. These methods typically capture global interests using aggregated past interactions and rely on static embeddings of users and items over extended periods of time. While effective in some domains, these methods fall short in many real-world scenarios, especially in finance, where user interests and item popularity evolve rapidly over time. To address these challenges, we introduce a novel extension to Light Graph Convolutional Network (LightGCN) designed to learn temporal node embeddings that capture dynamic interests. Our approach employs causal convolution to maintain a forward-looking model architecture. By preserving the chronological order of user-item interactions and introducing a dynamic update mechanism for embeddings through a sliding window, the proposed model generates well-timed and contextually relevant recommendations. Extensive experiments on a real-world dataset from BNP Paribas demonstrate that our approach significantly enhances the performance of LightGCN while maintaining the simplicity and efficiency of its architecture. Our findings provide new insights into designing graph-based recommender systems in time-sensitive applications, particularly for financial product recommendations.

[LG-22] Layer-wise Adaptive Gradient Norm Penalizing Method for Efficient and Accurate Deep Learning KDD2024

链接: https://arxiv.org/abs/2503.14205
作者: Sunwoo Lee
类目: Machine Learning (cs.LG)
*备注: Published in KDD 2024

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) is known to improve the generalization performance of neural networks. However, it is not widely used in real-world applications yet due to its expensive model perturbation cost. A few variants of SAM have been proposed to tackle such an issue, but they commonly do not alleviate the cost noticeably. In this paper, we propose a lightweight layer-wise gradient norm penalizing method that tackles the expensive computational cost of SAM while maintaining its superior generalization performance. Our study empirically proves that the gradient norm of the whole model can be effectively suppressed by penalizing the gradient norm of only a few critical layers. We also theoretically show that such a partial model perturbation does not harm the convergence rate of SAM, allowing them to be safely adapted in real-world applications. To demonstrate the efficacy of the proposed method, we perform extensive experiments comparing the proposed method to mini-batch SGD and the conventional SAM using representative computer vision and language modeling benchmarks.

[LG-23] Learning on LLM Output Signatures for gray-box LLM Behavior Analysis

链接: https://arxiv.org/abs/2503.14043
作者: Guy Bar-Shalom,Fabrizio Frasca,Derek Lim,Yoav Gelberg,Yftah Ziser,Ran El-Yaniv,Gal Chechik,Haggai Maron
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited, particularly in detecting data contamination and hallucinations. While recently proposed probing techniques provide insights through activation analysis, they require “white-box” access to model internals, often unavailable. Current “gray-box” approaches typically analyze only the probability of the actual tokens in the sequence with simple task-specific heuristics. Importantly, these methods overlook the rich information contained in the full token distribution at each processing step. To address these limitations, we propose that gray-box analysis should leverage the complete observable output of LLMs, consisting of both the previously used token probabilities as well as the complete token distribution sequences - a unified data type we term LOS (LLM Output Signature). To this end, we develop a transformer-based approach to process LOS that theoretically guarantees approximation of existing techniques while enabling more nuanced analysis. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings, significantly outperforming existing baselines. Furthermore, it demonstrates strong transfer capabilities across datasets and LLMs, suggesting that LOS captures fundamental patterns in LLM behavior. Our code is available at: this https URL.

[LG-24] Predicting Human Choice Between Textually Described Lotteries

链接: https://arxiv.org/abs/2503.14004
作者: Eyal Marantz,Ori Plonsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting human decision-making under risk and uncertainty is a long-standing challenge in cognitive science, economics, and AI. While prior research has focused on numerically described lotteries, real-world decisions often rely on textual descriptions. This study conducts the first large-scale exploration of human decision-making in such tasks using a large dataset of one-shot binary choices between textually described lotteries. We evaluate multiple computational approaches, including fine-tuning Large Language Models (LLMs), leveraging embeddings, and integrating behavioral theories of choice under risk. Our results show that fine-tuned LLMs, specifically RoBERTa and GPT-4o outperform hybrid models that incorporate behavioral theory, challenging established methods in numerical settings. These findings highlight fundamental differences in how textual and numerical information influence decision-making and underscore the need for new modeling strategies to bridge this gap.

[LG-25] Empowering LLM s in Decision Games through Algorithmic Data Synthesis

链接: https://arxiv.org/abs/2503.13980
作者: Haolin Wang,Xueyan Li,Yazhe Niu,Shuai Hu,Hongsheng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive capabilities across numerous domains, yet they often struggle with complex reasoning and decision-making tasks. Decision-making games, which inherently require multifaceted reasoning logic, serve as ideal sandboxes for evaluating and enhancing the reasoning abilities of LLMs. In this work, we first explore whether LLMs can master complex decision-making games through targeted post-training. To this end, we design data synthesis strategies and curate extensive offline datasets from two classic games, Doudizhu and Go. We further develop a suite of techniques to effectively incorporate this data into LLM training, resulting in two novel agents: Mastermind-Dou and Mastermind-Go. Our experimental results demonstrate that these Mastermind LLMs achieve competitive performance in their respective games. Additionally, we explore whether integrating decision-making data can enhance the general reasoning abilities of LLMs. Our findings suggest that such post-training improves certain aspects of reasoning, providing valuable insights for optimizing LLM data collection and synthesis strategies.

[LG-26] A CNN-based End-to-End Learning for RIS-assisted Communication System

链接: https://arxiv.org/abs/2503.13976
作者: Nipuni Ginige,Nandana Rajatheva,Matti Latva-aho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reconfigurable intelligent surface (RIS) is an emerging technology that is used to improve the system performance in beyond 5G systems. In this letter, we propose a novel convolutional neural network (CNN)-based autoencoder to jointly optimize the transmitter, the receiver, and the RIS of a RIS-assisted communication system. The proposed system jointly optimizes the sub-tasks of the transmitter, the receiver, and the RIS such as encoding/decoding, channel estimation, phase optimization, and modulation/demodulation. Numerically we have shown that the bit error rate (BER) performance of the CNN-based autoencoder system is better than the theoretical BER performance of the RIS-assisted communication systems.

[LG-27] MDocAgent : A Multi-Modal Multi-Agent Framework for Document Understanding

链接: https://arxiv.org/abs/2503.13964
作者: Siwei Han,Peng Xia,Ruiyi Zhang,Tong Sun,Yun Li,Hongtu Zhu,Huaxiu Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single modal, failing to effectively integrate textual and visual cues. These approaches struggle with complex multi-modal reasoning, limiting their performance on real-world documents. We present MDocAgent (A Multi-Modal Multi-Agent Framework for Document Understanding), a novel RAG and multi-agent framework that leverages both text and image. Our system employs five specialized agents: a general agent, a critical agent, a text agent, an image agent and a summarizing agent. These agents engage in multi-modal context retrieval, combining their individual insights to achieve a more comprehensive understanding of the document’s content. This collaborative approach enables the system to synthesize information from both textual and visual components, leading to improved accuracy in question answering. Preliminary experiments on five benchmarks like MMLongBench, LongDocURL demonstrate the effectiveness of our MDocAgent, achieve an average improvement of 12.1% compared to current state-of-the-art method. This work contributes to the development of more robust and comprehensive DocQA systems capable of handling the complexities of real-world documents containing rich textual and visual information. Our data and code are available at this https URL.

[LG-28] Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding

链接: https://arxiv.org/abs/2503.13954
作者: Tianhao Ni,Bingjie Li,Zhigang Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the dual challenges of the curse of dimensionality and the difficulty in separating intra-cluster and inter-cluster structures in high-dimensional manifold embedding, we proposes an Adaptive Multi-Scale Manifold Embedding (AMSME) algorithm. By introducing ordinal distance to replace traditional Euclidean distances, we theoretically demonstrate that ordinal distance overcomes the constraints of the curse of dimensionality in high-dimensional spaces, effectively distinguishing heterogeneous samples. We design an adaptive neighborhood adjustment method to construct similarity graphs that simultaneously balance intra-cluster compactness and inter-cluster separability. Furthermore, we develop a two-stage embedding framework: the first stage achieves preliminary cluster separation while preserving connectivity between structurally similar clusters via the similarity graph, and the second stage enhances inter-cluster separation through a label-driven distance reweighting. Experimental results demonstrate that AMSME significantly preserves intra-cluster topological structures and improves inter-cluster separation on real-world datasets. Additionally, leveraging its multi-resolution analysis capability, AMSME discovers novel neuronal subtypes in the mouse lumbar dorsal root ganglion scRNA-seq dataset, with marker gene analysis revealing their distinct biological roles.

[LG-29] Structured Knowledge Accumulation: An Autonomous Framework for Layer-Wise Entropy Reduction in Neural Learning

链接: https://arxiv.org/abs/2503.13942
作者: Bouarfa Mahi Quantiota
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:We introduce the Structured Knowledge Accumulation (SKA) framework, which reinterprets entropy as a dynamic, layer-wise measure of knowledge alignment in neural networks. Instead of relying on traditional gradient-based optimization, SKA defines entropy in terms of knowledge vectors and their influence on decision probabilities across multiple layers. This formulation naturally leads to the emergence of activation functions such as the sigmoid as a consequence of entropy minimization. Unlike conventional backpropagation, SKA allows each layer to optimize independently by aligning its knowledge representation with changes in decision probabilities. As a result, total network entropy decreases in a hierarchical manner, allowing knowledge structures to evolve progressively. This approach provides a scalable, biologically plausible alternative to gradient-based learning, bridging information theory and artificial intelligence while offering promising applications in resource-constrained and parallel computing environments.

[LG-30] Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

链接: https://arxiv.org/abs/2503.13925
作者: Da Kuang,Guanwen Qiu,Junhyong Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How a single fertilized cell gives rise to a complex array of specialized cell types in development is a central question in biology. The cells grow, divide, and acquire differentiated characteristics through poorly understood molecular processes. A key approach to studying developmental processes is to infer the tree graph of cell lineage division and differentiation histories, providing an analytical framework for dissecting individual cells’ molecular decisions during replication and differentiation. Although genetically engineered lineage-tracing methods have advanced the field, they are either infeasible or ethically constrained in many organisms. In contrast, modern single-cell technologies can measure high-content molecular profiles (e.g., transcriptomes) in a wide range of biological systems. Here, we introduce CellTreeQM, a novel deep learning method based on transformer architectures that learns an embedding space with geometric properties optimized for tree-graph inference. By formulating lineage reconstruction as a tree-metric learning problem, we have systematically explored supervised, weakly supervised, and unsupervised training settings and present a Lineage Reconstruction Benchmark to facilitate comprehensive evaluation of our learning method. We benchmarked the method on (1) synthetic data modeled via Brownian motion with independent noise and spurious signals and (2) lineage-resolved single-cell RNA sequencing datasets. Experimental results show that CellTreeQM recovers lineage structures with minimal supervision and limited data, offering a scalable framework for uncovering cell lineage relationships in challenging animal models. To our knowledge, this is the first method to cast cell lineage inference explicitly as a metric learning task, paving the way for future computational models aimed at uncovering the molecular dynamics of cell lineage. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.13925 [cs.LG] (or arXiv:2503.13925v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.13925 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels

链接: https://arxiv.org/abs/2503.13917
作者: Yujia Tong,Yuze Wang,Jingling Yuan,Chuang Hu
类目: Machine Learning (cs.LG)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch during data processing, and 2) Gradient imbalance between forgotten and retained data during training. These issues are exacerbated by quantized models’ constrained parameter space and discrete optimization. We propose Q-MUL, the first dedicated unlearning framework for quantized models. Our method introduces two key innovations: 1) Similar Labels assignment replaces random labels with semantically consistent alternatives to minimize noise injection, and 2) Adaptive Gradient Reweighting dynamically aligns parameter update contributions from forgotten and retained data. Through systematic analysis of quantized model vulnerabilities, we establish theoretical foundations for these mechanisms. Extensive evaluations on benchmark datasets demonstrate Q-MUL’s superiority over existing approaches.

[LG-32] Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning

链接: https://arxiv.org/abs/2503.13911
作者: Ruobing Jiang,Yacong Li,Haobing Liu,Yanwei Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous graphs (HGs) are composed of multiple types of nodes and edges, making it more effective in capturing the complex relational structures inherent in the real world. However, in real-world scenarios, labeled data is often difficult to obtain, which limits the applicability of semi-supervised approaches. Self-supervised learning aims to enable models to automatically learn useful features from data, effectively addressing the challenge of limited labeling data. In this paper, we propose a novel contrastive learning framework for heterogeneous graphs (ASHGCL), which incorporates three distinct views, each focusing on node attributes, high-order and low-order structural information, respectively, to effectively capture attribute information, high-order structures, and low-order structures for node representation learning. Furthermore, we introduce an attribute-enhanced positive sample selection strategy that combines both structural information and attribute information, effectively addressing the issue of sampling bias. Extensive experiments on four real-world datasets show that ASHGCL outperforms state-of-the-art unsupervised baselines and even surpasses some supervised benchmarks.

[LG-33] Quantification of Uncertainties in Probabilistic Deep Neural Network by Implementing Boosting of Variational Inference

链接: https://arxiv.org/abs/2503.13909
作者: Pavia Bera,Sanjukta Bhanja
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern neural network architectures have achieved remarkable accuracies but remain highly dependent on their training data, often lacking interpretability in their learned mappings. While effective on large datasets, they tend to overfit on smaller ones. Probabilistic neural networks, such as those utilizing variational inference, address this limitation by incorporating uncertainty estimation through weight distributions rather than point estimates. However, standard variational inference often relies on a single-density approximation, which can lead to poor posterior estimates and hinder model performance. We propose Boosted Bayesian Neural Networks (BBNN), a novel approach that enhances neural network weight distribution approximations using Boosting Variational Inference (BVI). By iteratively constructing a mixture of densities, BVI expands the approximating family, enabling a more expressive posterior that leads to improved generalization and uncertainty estimation. While this approach increases computational complexity, it significantly enhances accuracy an essential tradeoff, particularly in high-stakes applications such as medical diagnostics, where false negatives can have severe consequences. Our experimental results demonstrate that BBNN achieves ~5% higher accuracy compared to conventional neural networks while providing superior uncertainty quantification. This improvement highlights the effectiveness of leveraging a mixture-based variational family to better approximate the posterior distribution, ultimately advancing probabilistic deep learning.

[LG-34] Learning local neighborhoods of non-Gaussian graphical models: A measure transport approach AAAI2025

链接: https://arxiv.org/abs/2503.13899
作者: Sarah Liaw,Rebecca Morrison,Youssef Marzouk,Ricardo Baptista
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注: Accepted in AAAI 2025: 23 pages, 9 figures

点击查看摘要

Abstract:Identifying the Markov properties or conditional independencies of a collection of random variables is a fundamental task in statistics for modeling and inference. Existing approaches often learn the structure of a probabilistic graphical model, which encodes these dependencies, by assuming that the variables follow a distribution with a simple parametric form. Moreover, the computational cost of many algorithms scales poorly for high-dimensional distributions, as they need to estimate all the edges in the graph simultaneously. In this work, we propose a scalable algorithm to infer the conditional independence relationships of each variable by exploiting the local Markov property. The proposed method, named Localized Sparsity Identification for Non-Gaussian Distributions (L-SING), estimates the graph by using flexible classes of transport maps to represent the conditional distribution for each variable. We show that L-SING includes existing approaches, such as neighborhood selection with Lasso, as a special case. We demonstrate the effectiveness of our algorithm in both Gaussian and non-Gaussian settings by comparing it to existing methods. Lastly, we show the scalability of the proposed approach by applying it to high-dimensional non-Gaussian examples, including a biological dataset with more than 150 variables.

[LG-35] Multi-label feature selection based on binary hashing learning and dynamic graph constraints

链接: https://arxiv.org/abs/2503.13874
作者: Cong Guo,Changqin Huang,Wenhua Zhou,Xiaodi Huang
类目: Machine Learning (cs.LG)
*备注: 21 pages,19 figures

点击查看摘要

Abstract:Multi-label learning poses significant challenges in extracting reliable supervisory signals from the label space. Existing approaches often employ continuous pseudo-labels to replace binary labels, improving supervisory information representation. However, these methods can introduce noise from irrelevant labels and lead to unreliable graph structures. To overcome these limitations, this study introduces a novel multi-label feature selection method called Binary Hashing and Dynamic Graph Constraint (BHDG), the first method to integrate binary hashing into multi-label learning. BHDG utilizes low-dimensional binary hashing codes as pseudo-labels to reduce noise and improve representation robustness. A dynamically constrained sample projection space is constructed based on the graph structure of these binary pseudo-labels, enhancing the reliability of the dynamic graph. To further enhance pseudo-label quality, BHDG incorporates label graph constraints and inner product minimization within the sample space. Additionally, an l_2,1 -norm regularization term is added to the objective function to facilitate the feature selection process. The augmented Lagrangian multiplier (ALM) method is employed to optimize binary variables effectively. Comprehensive experiments on 10 benchmark datasets demonstrate that BHDG outperforms ten state-of-the-art methods across six evaluation metrics. BHDG achieves the highest overall performance ranking, surpassing the next-best method by an average of at least 2.7 ranks per metric, underscoring its effectiveness and robustness in multi-label feature selection.

[LG-36] Empirical Calibration and Metric Differential Privacy in Language Models

链接: https://arxiv.org/abs/2503.13872
作者: Pedro Faustini,Natasha Fernandes,Annabelle McIver,Mark Dras
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:NLP models trained with differential privacy (DP) usually adopt the DP-SGD framework, and privacy guarantees are often reported in terms of the privacy budget \epsilon . However, \epsilon does not have any intrinsic meaning, and it is generally not possible to compare across variants of the framework. Work in image processing has therefore explored how to empirically calibrate noise across frameworks using Membership Inference Attacks (MIAs). However, this kind of calibration has not been established for NLP. In this paper, we show that MIAs offer little help in calibrating privacy, whereas reconstruction attacks are more useful. As a use case, we define a novel kind of directional privacy based on the von Mises-Fisher (VMF) distribution, a metric DP mechanism that perturbs angular distance rather than adding (isotropic) Gaussian noise, and apply this to NLP architectures. We show that, even though formal guarantees are incomparable, empirical privacy calibration reveals that each mechanism has different areas of strength with respect to utility-privacy trade-offs.

[LG-37] BurTorch: Revisiting Training from First Principles by Coupling Autodiff Math Optimization and Systems

链接: https://arxiv.org/abs/2503.13795
作者: Konstantin Burlachenko,Peter Richtárik
类目: Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: 46 pages, 7 figures, 19 tables

点击查看摘要

Abstract:In this work, we introduce BurTorch, a compact high-performance framework designed to optimize Deep Learning (DL) training on single-node workstations through an exceptionally efficient CPU-based backpropagation (Rumelhart et al., 1986; Linnainmaa, 1970) implementation. Although modern DL frameworks rely on compilerlike optimizations internally, BurTorch takes a different path. It adopts a minimalist design and demonstrates that, in these circumstances, classical compiled programming languages can play a significant role in DL research. By eliminating the overhead of large frameworks and making efficient implementation choices, BurTorch achieves orders-of-magnitude improvements in performance and memory efficiency when computing \nabla f(x) on a CPU. BurTorch features a compact codebase designed to achieve two key goals simultaneously. First, it provides a user experience similar to script-based programming environments. Second, it dramatically minimizes runtime overheads. In large DL frameworks, the primary source of memory overhead for relatively small computation graphs f(x) is due to feature-heavy implementations. We benchmarked BurTorch against widely used DL frameworks in their execution modes: JAX (Bradbury et al., 2018), PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016); and several standalone libraries: Autograd (Maclaurin et al., 2015), Micrograd (Karpathy, 2020), Apple MLX (Hannun et al., 2023). For small compute graphs, BurTorch outperforms best-practice solutions by up to \times 2000 in runtime and reduces memory consumption by up to \times 3500 . For a miniaturized GPT-3 model (Brown et al., 2020), BurTorch achieves up to a \times 20 speedup and reduces memory up to \times 80 compared to PyTorch.

[LG-38] A finite-sample bound for identifying partially observed linear switched systems from a single trajectory

链接: https://arxiv.org/abs/2503.13766
作者: Daniel Racz,Mihaly Petreczky,Balint Daroczy
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We derive a finite-sample probabilistic bound on the parameter estimation error of a system identification algorithm for Linear Switched Systems. The algorithm estimates Markov parameters from a single trajectory and applies a variant of the Ho-Kalman algorithm to recover the system matrices. Our bound guarantees statistical consistency under the assumption that the true system exhibits quadratic stability. The proof leverages the theory of weakly dependent processes. To the best of our knowledge, this is the first finite-sample bound for this algorithm in the single-trajectory setting.

[LG-39] Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

链接: https://arxiv.org/abs/2503.13764
作者: Mohammad Partohaghighi,Roummel Marcia,YangQuan Chen
类目: Machine Learning (cs.LG)
*备注: IEEE L-CSS submitted

点击查看摘要

Abstract:Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.

[LG-40] Neural Edge Histogram Descriptors for Underwater Acoustic Target Recognition

链接: https://arxiv.org/abs/2503.13763
作者: Atharva Agashe,Davelle Carreiro,Alexandra Van Dine,Joshua Peeples
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: 6 pages, 5 figures. This work has been accepted to IEEE OCEANS 2025

点击查看摘要

Abstract:Numerous maritime applications rely on the ability to recognize acoustic targets using passive sonar. While there is a growing reliance on pre-trained models for classification tasks, these models often require extensive computational resources and may not perform optimally when transferred to new domains due to dataset variations. To address these challenges, this work adapts the neural edge histogram descriptors (NEHD) method originally developed for image classification, to classify passive sonar signals. We conduct a comprehensive evaluation of statistical and structural texture features, demonstrating that their combination achieves competitive performance with large pre-trained models. The proposed NEHD-based approach offers a lightweight and efficient solution for underwater target recognition, significantly reducing computational costs while maintaining accuracy.

[LG-41] Multi-modal Time Series Analysis: A Tutorial and Survey

链接: https://arxiv.org/abs/2503.13709
作者: Yushan Jiang,Kanghui Ning,Zijie Pan,Xuyang Shen,Jingchao Ni,Wenchao Yu,Anderson Schneider,Haifeng Chen,Yuriy Nevmyvaka,Dongjin Song
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal time series analysis has recently emerged as a prominent research area in data mining, driven by the increasing availability of diverse data modalities, such as text, images, and structured tabular data from real-world sources. However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise. Recent advancements in multi-modal time series methods have exploited the multi-modal context via cross-modal interactions based on deep learning methods, significantly enhancing various downstream tasks. In this tutorial and survey, we present a systematic and up-to-date overview of multi-modal time series datasets and methods. We first state the existing challenges of multi-modal time series analysis and our motivations, with a brief introduction of preliminaries. Then, we summarize the general pipeline and categorize existing methods through a unified cross-modal interaction framework encompassing fusion, alignment, and transference at different levels (\textiti.e., input, intermediate, output), where key concepts and ideas are highlighted. We also discuss the real-world applications of multi-modal analysis for both standard and spatial time series, tailored to general and specific domains. Finally, we discuss future research directions to help practitioners explore and exploit multi-modal time series. The up-to-date resources are provided in the GitHub repository: this https URL

[LG-42] Mitigating Spectral Bias in Neural Operators via High-Frequency Scaling for Physical Systems

链接: https://arxiv.org/abs/2503.13695
作者: Siavash Khodakarami,Vivek Oommen,Aniruddha Bora,George Em Karniadakis
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Neural operators have emerged as powerful surrogates for modeling complex physical problems. However, they suffer from spectral bias making them oblivious to high-frequency modes, which are present in multiscale physical systems. Therefore, they tend to produce over-smoothed solutions, which is particularly problematic in modeling turbulence and for systems with intricate patterns and sharp gradients such as multi-phase flow systems. In this work, we introduce a new approach named high-frequency scaling (HFS) to mitigate spectral bias in convolutional-based neural operators. By integrating HFS with proper variants of UNet neural operators, we demonstrate a higher prediction accuracy by mitigating spectral bias in single and two-phase flow problems. Unlike Fourier-based techniques, HFS is directly applied to the latent space, thus eliminating the computational cost associated with the Fourier transform. Additionally, we investigate alternative spectral bias mitigation through diffusion models conditioned on neural operators. While the diffusion model integrated with the standard neural operator may still suffer from significant errors, these errors are substantially reduced when the diffusion model is integrated with a HFS-enhanced neural operator.

[LG-43] PrETi: Predicting Execution Time in Early Stage with LLVM and Machine Learning

链接: https://arxiv.org/abs/2503.13679
作者: Risheng Xu,Philipp Sieweck,Hermann von Hasseln,Dirk Nowotka
类目: Performance (cs.PF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce preti, a novel framework for predicting software execution time during the early stages of development. preti leverages an LLVM-based simulation environment to extract timing-related runtime information, such as the count of executed LLVM IR instructions. This information, combined with historical execution time data, is utilized to train machine learning models for accurate time prediction. To further enhance prediction accuracy, our approach incorporates simulations of cache accesses and branch prediction. The evaluations on public benchmarks demonstrate that preti achieves an average Absolute Percentage Error (APE) of 11.98%, surpassing state-of-the-art methods. These results underscore the effectiveness and efficiency of preti as a robust solution for early-stage timing analysis.

[LG-44] Spectrally-Corrected and Regularized QDA Classifier for Spiked Covariance Model

链接: https://arxiv.org/abs/2503.13582
作者: Wenya Luo,Hua Li,Zhidong Bai,Zhijun Liu
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Quadratic discriminant analysis (QDA) is a widely used method for classification problems, particularly preferable over Linear Discriminant Analysis (LDA) for heterogeneous data. However, QDA loses its effectiveness in high-dimensional settings, where the data dimension and sample size tend to infinity. To address this issue, we propose a novel QDA method utilizing spectral correction and regularization techniques, termed SR-QDA. The regularization parameters in our method are selected by maximizing the Fisher-discriminant ratio. We compare SR-QDA with QDA, regularized quadratic discriminant analysis (R-QDA), and several other competitors. The results indicate that SR-QDA performs exceptionally well, especially in moderate and high-dimensional situations. Empirical experiments across diverse datasets further support this conclusion.

[LG-45] When Should We Orchestrate Multiple Agents ?

链接: https://arxiv.org/abs/2503.13577
作者: Umang Bhatt,Sanyam Kapoor,Mihir Upadhyay,Ilia Sucholutsky,Francesco Quinzan,Katherine M. Collins,Adrian Weller,Andrew Gordon Wilson,Muhammad Bilal Zafar
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Strategies for orchestrating the interactions between multiple agents, both human and artificial, can wildly overestimate performance and underestimate the cost of orchestration. We design a framework to orchestrate agents under realistic conditions, such as inference costs or availability constraints. We show theoretically that orchestration is only effective if there are performance or cost differentials between agents. We then empirically demonstrate how orchestration between multiple agents can be helpful for selecting agents in a simulated environment, picking a learning strategy in the infamous Rogers’ Paradox from social science, and outsourcing tasks to other agents during a question-answer task in a user study.

[LG-46] VeriContaminated: Assessing LLM -Driven Verilog Coding for Data Contamination

链接: https://arxiv.org/abs/2503.13572
作者: Zeng Wang,Minghao Shao,Jitendra Bhandari,Likhitha Mankali,Ramesh Karri,Ozgur Sinanoglu,Muhammad Shafique,Johann Knechtel
类目: Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized code generation, achieving exceptional results on various established benchmarking frameworks. However, concerns about data contamination - where benchmark data inadvertently leaks into pre-training or fine-tuning datasets - raise questions about the validity of these evaluations. While this issue is known, limiting the industrial adoption of LLM-driven software engineering, hardware coding has received little to no attention regarding these risks. For the first time, we analyze state-of-the-art (SOTA) evaluation frameworks for Verilog code generation (VerilogEval and RTLLM), using established methods for contamination detection (CCD and Min-K% Prob). We cover SOTA commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-1,2,3.1, GPT-2,3.5,4o, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen). Our study confirms that data contamination is a critical concern. We explore mitigations and the resulting trade-offs for code quality vs fairness (i.e., reducing contamination toward unbiased benchmarking).

[LG-47] Dynamical Mode Recognition of Turbulent Flames in a Swirl-stabilized Annular Combustor by a Time-series Learning Approach

链接: https://arxiv.org/abs/2503.13559
作者: Tao Yang,Weiming Xu,Liangliang Xu,Peng Zhang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Thermoacoustic instability in annular combustors, essential to aero engines and modern gas turbines, can severely impair operational stability and efficiency, accurately recognizing and understanding various combustion modes is the prerequisite for understanding and controlling combustion instabilities. However, the high-dimensional spatial-temporal dynamics of turbulent flames typically pose considerable challenges to mode recognition. Based on the bidirectional temporal and nonlinear dimensionality reduction models, this study introduces a two-layer bidirectional long short-term memory variational autoencoder, Bi-LSTM-VAE model, to effectively recognize dynamical modes in annular combustion systems. Specifically, leveraging 16 pressure signals from a swirl-stabilized annular combustor, the model maps complex dynamics into a low-dimensional latent space while preserving temporal dependency and nonlinear behavior features through the recurrent neural network structure. The results show that the novel Bi-LSTM-VAE method enables a clear representation of combustion states in two-dimensional state space. Analysis of latent variable distributions reveals distinct patterns corresponding to a wide range of equivalence ratios and premixed fuel and air mass flow rates, offering novel insights into mode classification and transitions, highlighting this model’s potential for deciphering complex thermoacoustic phenomena.

[LG-48] Optimization on black-box function by parameter-shift rule

链接: https://arxiv.org/abs/2503.13545
作者: Vu Tuan Hai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning has been widely applied in many aspects, but training a machine learning model is increasingly difficult. There are more optimization problems named “black-box” where the relationship between model parameters and outcomes is uncertain or complex to trace. Currently, optimizing black-box models that need a large number of query observations and parameters becomes difficult. To overcome the drawbacks of the existing algorithms, in this study, we propose a zeroth-order method that originally came from quantum computing called the parameter-shift rule, which has used a lesser number of parameters than previous methods.

[LG-49] Semi-Decision-Focused Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization ICLR2025

链接: https://arxiv.org/abs/2503.13544
作者: Juhyeong Kim
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP); Portfolio Management (q-fin.PM)
*备注: ICLR 2025 Advances in Financial AI Workshop

点击查看摘要

Abstract:I propose Semi-Decision-Focused Learning, a practical adaptation of Decision-Focused Learning for portfolio optimization. Rather than directly optimizing complex financial metrics, I employ simple target portfolios (Max-Sortino or One-Hot) and train models with a convex, cross-entropy loss. I further incorporate Deep Ensemble methods to reduce variance and stabilize performance. Experiments on two universes (one upward-trending and another range-bound) show consistent outperformance over baseline portfolios, demonstrating the effectiveness and robustness of my approach. Code is available at this https URL

[LG-50] FedTilt: Towards Multi-Level Fairness-Preserving and Robust Federated Learning

链接: https://arxiv.org/abs/2503.13537
作者: Binghui Zhang,Luis Mares De La Cruz,Binghui Wang
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

[LG-51] Multi-output Classification for Compound Fault Diagnosis in Motor under Partially Labeled Target Domain

链接: https://arxiv.org/abs/2503.13534
作者: Wonjun Yi,Yong-Hwa Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel multi-output classification (MOC) framework designed for domain adaptation in fault diagnosis, addressing challenges posed by partially labeled (PL) target domain dataset and coexisting faults in rotating machinery. Unlike conventional multi-class classification (MCC) approaches, the MOC framework independently classifies the severity of each fault, enhancing diagnostic accuracy. By integrating multi-kernel maximum mean discrepancy loss (MKMMD) and entropy minimization loss (EM), the proposed method improves feature transferability between source and target domains, while frequency layer normalization (FLN) effectively handles stationary vibration signals by leveraging mechanical characteristics. Experimental evaluations across six domain adaptation cases, encompassing partially labeled (PL) scenarios, demonstrate the superior performance of the MOC approach over baseline methods in terms of macro F1 score.

[LG-52] he Role of Hyperparameters in Predictive Multiplicity

链接: https://arxiv.org/abs/2503.13506
作者: Mustafa Cavus,Katarzyna Woźnica,Przemysław Biecek
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:This paper investigates the critical role of hyperparameters in predictive multiplicity, where different machine learning models trained on the same dataset yield divergent predictions for identical inputs. These inconsistencies can seriously impact high-stakes decisions such as credit assessments, hiring, and medical diagnoses. Focusing on six widely used models for tabular data - Elastic Net, Decision Tree, k-Nearest Neighbor, Support Vector Machine, Random Forests, and Extreme Gradient Boosting - we explore how hyperparameter tuning influences predictive multiplicity, as expressed by the distribution of prediction discrepancies across benchmark datasets. Key hyperparameters such as lambda in Elastic Net, gamma in Support Vector Machines, and alpha in Extreme Gradient Boosting play a crucial role in shaping predictive multiplicity, often compromising the stability of predictions within specific algorithms. Our experiments on 21 benchmark datasets reveal that tuning these hyperparameters leads to notable performance improvements but also increases prediction discrepancies, with Extreme Gradient Boosting exhibiting the highest discrepancy and substantial prediction instability. This highlights the trade-off between performance optimization and prediction consistency, raising concerns about the risk of arbitrary predictions. These findings provide insight into how hyperparameter optimization leads to predictive multiplicity. While predictive multiplicity allows prioritizing domain-specific objectives such as fairness and reduces reliance on a single model, it also complicates decision-making, potentially leading to arbitrary or unjustified outcomes.

[LG-53] Foundation Models for Spatio-Temporal Data Science: A Tutorial and Survey

链接: https://arxiv.org/abs/2503.13502
作者: Yuxuan Liang,Haomin Wen,Yutong Xia,Ming Jin,Bin Yang,Flora Salim,Qingsong Wen,Shirui Pan,Gao Cong
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-54] A Causal Inference Approach for Quantifying Research Impact

链接: https://arxiv.org/abs/2503.13485
作者: Keiichi Ochiai,Yutaka Matsuo
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has had a great impact on various fields of computer science by enabling data-driven representation learning in a decade. Because science and technology policy decisions for a nation can be made on the impact of each technology, quantifying research impact is an important task. The number of citations and impact factor can be used to measure the impact for individual research. What would have happened without the research, however, is fundamentally a counterfactual phenomenon. Thus, we propose an approach based on causal inference to quantify the research impact of a specific technical topic. We leverage difference-in-difference to quantify the research impact by applying to bibliometric data. First, we identify papers of a specific technical topic using keywords or category tags from Microsoft Academic Graph, which is one of the largest academic publication dataset. Next, we build a paper citation network between each technical field. Then, we aggregate the cross-field citation count for each research field. Finally, the impact of a specific technical topic for each research field is estimated by applying difference-in-difference. Evaluation results show that deep learning significantly affects computer vision and natural language processing. Besides, deep learning significantly affects cross-field citation especially for speech recognition to computer vision and natural language processing to computer vision. Moreover, our method revealed that the impact of deep learning was 3.1 times of the impact of interpretability for ML models.

[LG-55] EnQode: Fast Amplitude Embedding for Quantum Machine Learning Using Classical Data

链接: https://arxiv.org/abs/2503.14473
作者: Jason Han,Nicholas S. DiBrita,Younghyun Cho,Hengrui Luo,Tirthak Patel
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: EnQode will appear in the Proceedings of the Design Automation Conference (DAC), 2025

点击查看摘要

Abstract:Amplitude embedding (AE) is essential in quantum machine learning (QML) for encoding classical data onto quantum circuits. However, conventional AE methods suffer from deep, variable-length circuits that introduce high output error due to extensive gate usage and variable error rates across samples, resulting in noise-driven inconsistencies that degrade model accuracy. We introduce EnQode, a fast AE technique based on symbolic representation that addresses these limitations by clustering dataset samples and solving for cluster mean states through a low-depth, machine-specific ansatz. Optimized to reduce physical gates and SWAP operations, EnQode ensures all samples face consistent, low noise levels by standardizing circuit depth and composition. With over 90% fidelity in data mapping, EnQode enables robust, high-performance QML on noisy intermediate-scale quantum (NISQ) devices. Our open-source solution provides a scalable and efficient alternative for integrating classical data with quantum models.

[LG-56] Doubly robust identification of treatment effects from multiple environments ICLR

链接: https://arxiv.org/abs/2503.14459
作者: Piersilvio De Bartolomeis,Julia Kostin,Javier Abad,Yixin Wang,Fanny Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted for presentation at the International Conference on Learning Representations (ICLR) 2025

点击查看摘要

[LG-57] Online Conformal Probabilistic Numerics via Adaptive Edge-Cloud Offloading

链接: https://arxiv.org/abs/2503.14453
作者: Qiushuo Hou,Sangwoo Park,Matteo Zecchin,Yunlong Cai,Guanding Yu,Osvaldo Simeone
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: This paper has been submitted to a conference

点击查看摘要

Abstract:Consider an edge computing setting in which a user submits queries for the solution of a linear system to an edge processor, which is subject to time-varying computing availability. The edge processor applies a probabilistic linear solver (PLS) so as to be able to respond to the user’s query within the allotted time and computing budget. Feedback to the user is in the form of an uncertainty set. Due to model misspecification, the uncertainty set obtained via a direct application of PLS does not come with coverage guarantees with respect to the true solution of the linear system. This work introduces a new method to calibrate the uncertainty sets produced by PLS with the aim of guaranteeing long-term coverage requirements. The proposed method, referred to as online conformal prediction-PLS (OCP-PLS), assumes sporadic feedback from cloud to edge. This enables the online calibration of uncertainty thresholds via online conformal prediction (OCP), an online optimization method previously studied in the context of prediction models. The validity of OCP-PLS is verified via experiments that bring insights into trade-offs between coverage, prediction set size, and cloud usage.

[LG-58] Optimizing High-Dimensional Oblique Splits

链接: https://arxiv.org/abs/2503.14381
作者: Chien-Ming Chi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 79 pages, 9 tables

点击查看摘要

[LG-59] C(NN)FD – Deep Learning Modelling of Multi-Stage Axial Compressors Aerodynamics

链接: https://arxiv.org/abs/2503.14369
作者: Giuseppe Bruni,Sepehr Maleki,Senthil K Krishnababu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The field of scientific machine learning and its applications to numerical analyses such as CFD has recently experienced a surge in interest. While its viability has been demonstrated in different domains, it has not yet reached a level of robustness and scalability to make it practical for industrial applications in the turbomachinery field. The highly complex, turbulent, and three-dimensional flows of multi-stage axial compressors for gas turbine applications represent a remarkably challenging case. This is due to the high-dimensionality of the regression of the flow-field from geometrical and operational variables, and the high computational cost associated with the large scale of the CFD domains. This paper demonstrates the development and application of a generalized deep learning framework for predictions of the flow field and aerodynamic performance of multi-stage axial compressors, also potentially applicable to any type of turbomachinery. A physics-based dimensionality reduction unlocks the potential for flow-field predictions for large-scale domains, re-formulating the regression problem from an unstructured to a structured one. The relevant physical equations are used to define a multi-dimensional physical loss function. Compared to “black-box” approaches, the proposed framework has the advantage of physically explainable predictions of overall performance, as the corresponding aerodynamic drivers can be identified on a 0D/1D/2D/3D level. An iterative architecture is employed, improving the accuracy of the predictions, as well as estimating the associated uncertainty. The model is trained on a series of dataset including manufacturing and build variations, different geometries, compressor designs and operating conditions. This demonstrates the capability to predict the flow-field and the overall performance in a generalizable manner, with accuracy comparable to the benchmark.

[LG-60] Unified Analysis of Decentralized Gradient Descent: a Contraction Mapping Framework

链接: https://arxiv.org/abs/2503.14353
作者: Erik G. Larsson,Nicolo Michelusi
类目: ignal Processing (eess.SP); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: submitted to the IEEE Open Journal of Signal Processing

点击查看摘要

[LG-61] Consumer-grade EEG-based Eye Tracking

链接: https://arxiv.org/abs/2503.14322
作者: Tiago Vasconcelos Afonso,Florian Heinrichs
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Data descriptor, 13 pages, 8 figures, 5 tables

点击查看摘要

[LG-62] Automating Experimental Optics with Sample Efficient Machine Learning Methods

链接: https://arxiv.org/abs/2503.14260
作者: Arindam Saha,Baramee Charoensombutamon,Thibault Michel,V. Vijendran,Lachlan Walker,Akira Furusawa,Syed M. Assad,Ben C. Buchler,Ping Koy Lam,Aaron D. Tranter
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-63] CINNAMON: A hybrid approach to change point detection and parameter estimation in single-particle tracking data

链接: https://arxiv.org/abs/2503.14253
作者: Jakub Malinowski,Marcin Kostrzewa,Michał Balcerek,Weronika Tomczuk,Janusz Szwabiński
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Change point detection has become an important part of the analysis of the single-particle tracking data, as it allows one to identify moments, in which the motion patterns of observed particles undergo significant changes. The segmentation of diffusive trajectories based on those moments may provide insight into various phenomena in soft condensed matter and biological physics. In this paper, we propose CINNAMON, a hybrid approach to classifying single-particle tracking trajectories, detecting change points within them, and estimating diffusion parameters in the segments between the change points. Our method is based on a combination of neural networks, feature-based machine learning, and statistical techniques. It has been benchmarked in the second Anomalous Diffusion Challenge. The method offers a high level of interpretability due to its analytical and feature-based components. A potential use of features from topological data analysis is also discussed.

[LG-64] Fundamental Limits of Matrix Sensing: Exact Asymptotics Universality and Applications

链接: https://arxiv.org/abs/2503.14121
作者: Yizhou Xu,Antoine Maillard,Lenka Zdeborová,Florent Krzakala
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Information Theory (cs.IT); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: (i) we prove universality properties to handle structured sensing matrices, related to the ‘‘Gaussian equivalence’’ phenomenon in statistical learning, (ii) we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and (iii) we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in [ETB+24] regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in [MTM+24] on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.

[LG-65] PET-MAD a universal interatomic potential for advanced materials modeling

链接: https://arxiv.org/abs/2503.14118
作者: Arslan Mazitov,Filippo Bigi,Matthias Kellner,Paolo Pegolo,Davide Tisi,Guillaume Fraux,Sergey Pozdnyakov,Philip Loche,Michele Ceriotti
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Machine-learning interatomic potentials (MLIPs) have greatly extended the reach of atomic-scale simulations, offering the accuracy of first-principles calculations at a fraction of the effort. Leveraging large quantum mechanical databases and expressive architectures, recent “universal” models deliver qualitative accuracy across the periodic table but are often biased toward low-energy configurations. We introduce PET-MAD, a generally applicable MLIP trained on a dataset combining stable inorganic and organic solids, systematically modified to enhance atomic diversity. Using a moderate but highly-consistent level of electronic-structure theory, we assess PET-MAD’s accuracy on established benchmarks and advanced simulations of six materials. PET-MAD rivals state-of-the-art MLIPs for inorganic solids, while also being reliable for molecules, organic materials, and surfaces. It is stable and fast, enabling, out-of-the-box, the near-quantitative study of thermal and quantum mechanical fluctuations, functional properties, and phase transitions. It can be efficiently fine-tuned to deliver full quantum mechanical accuracy with a minimal number of targeted calculations.

[LG-66] owards Location-Specific Precipitation Projections Using Deep Neural Networks

链接: https://arxiv.org/abs/2503.14095
作者: Bipin Kumar,Bhvisy Kumar Yadav,Soumypdeep Mukhopadhyay,Rakshit Rohan,Bhupendra Bahadur Singh,Rajib Chattopadhyay,Nagraju Chilukoti,Atul Kumar Sahai
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 21 pages, 9 figures

点击查看摘要

[LG-67] Semantic Communication in Dynamic Channel Scenarios: Collaborative Optimization of Dual-Pipeline Joint Source-Channel Coding and Personalized Federated Learning

链接: https://arxiv.org/abs/2503.14084
作者: Xingrun Yan,Shiyuan Zuo,Yifeng Lyu,Rongfei Fan,Han Hu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semantic communication is designed to tackle issues like bandwidth constraints and high latency in communication systems. However, in complex network topologies with multiple users, the enormous combinations of client data and channel state information (CSI) pose significant challenges for existing semantic communication architectures. To improve the generalization ability of semantic communication models in complex scenarios while meeting the personalized needs of each user in their local environments, we propose a novel personalized federated learning framework with dual-pipeline joint source-channel coding based on channel awareness model (PFL-DPJSCCA). Within this framework, we present a method that achieves zero optimization gap for non-convex loss functions. Experiments conducted under varying SNR distributions validate the outstanding performance of our framework across diverse datasets.

[LG-68] Modular Distributed Nonconvex Learning with Error Feedback

链接: https://arxiv.org/abs/2503.14055
作者: Guido Carnevale,Nicola Bastianello
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we design a novel distributed learning algorithm using stochastic compressed communications. In detail, we pursue a modular approach, merging ADMM and a gradient-based approach, benefiting from the robustness of the former and the computational efficiency of the latter. Additionally, we integrate a stochastic integral action (error feedback) enabling almost sure rejection of the compression error. We analyze the resulting method in nonconvex scenarios and guarantee almost sure asymptotic convergence to the set of stationary points of the problem. This result is obtained using system-theoretic tools based on stochastic timescale separation. We corroborate our findings with numerical simulations in nonconvex classification.

[LG-69] Empirical risk minimization algorithm for multiclass classification of S.D.E. paths

链接: https://arxiv.org/abs/2503.14045
作者: Christophe Denis(SAMM),Eddy Ella Mintsa(LAMA)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the multiclass classification problem for stochastic diffusion paths, assuming that the classes are distinguished by their drift functions, while the diffusion coefficient remains common across all classes. In this setting, we propose a classification algorithm that relies on the minimization of the L 2 risk. We establish rates of convergence for the resulting predictor. Notably, we introduce a margin assumption under which we show that our procedure can achieve fast rates of convergence. Finally, a simulation study highlights the numerical performance of our classification algorithm.

[LG-70] Causal Discovery from Data Assisted by Large Language Models

链接: https://arxiv.org/abs/2503.13833
作者: Kamyar Barakati,Alexander Molak,Chris Nelson,Xiaohang Zhang,Ichiro Takeuchi,Sergei V. Kalinin
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

[LG-71] ROCK: A variational formulation for occupation kernel methods in Reproducing Kernel Hilbert Spaces

链接: https://arxiv.org/abs/2503.13791
作者: Victor Rielly,Kamel Lahouel,Chau Nguyen,Bruno Jedynak
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a Representer Theorem result for a large class of weak formulation problems. We provide examples of applications of our formulation both in traditional machine learning and numerical methods as well as in new and emerging techniques. Finally we apply our formulation to generalize the multivariate occupation kernel (MOCK) method for learning dynamical systems from data proposing the more general Riesz Occupation Kernel (ROCK) method. Our generalized methods are both more computationally efficient and performant on most of the benchmarks we test against.

[LG-72] Bayesian Kernel Regression for Functional Data

链接: https://arxiv.org/abs/2503.13676
作者: Minoru Kusaba,Megumi Iwayama,Ryo Yoshida
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-73] SRBB-Based Quantum State Preparation

链接: https://arxiv.org/abs/2503.13647
作者: Giacomo Belli,Marco Mordacci,Michele Amoretti
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 8 figures, 6 tables

点击查看摘要

[LG-74] Quantum EigenGame for excited state calculation

链接: https://arxiv.org/abs/2503.13644
作者: David Quiroga,Jason Han,Anastasios Kyrillidis
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at CPAL 2025, 28 pages

点击查看摘要

[LG-75] Classification of power quality events in the transmission grid: comparative evaluation of different machine learning models

链接: https://arxiv.org/abs/2503.13566
作者: Umut Güvengir,Dilek Küçük,Serkan Buhan,Cuma Ali Mantaş,Murathan Yeniceli
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Presented at CIGRE SEERC 2023 Conference

点击查看摘要

[LG-76] Internet of Things-Based Smart Precision Farming in Soilless Agriculture: Opportunities and Challenges for Global Food Security

链接: https://arxiv.org/abs/2503.13528
作者: Monica Dutta,Deepali Gupta,Sumegh Tharewal,Deepam Goyal,Jasminder Kaur Sandhu,Manjit Kaur,Ahmad Ali Alzubi,Jazem Mutared Alanazi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-77] Positivity sets of hinge functions

链接: https://arxiv.org/abs/2503.13512
作者: Josef Schicho,Ayush Kumar Tewari,Audie Warren
类目: Machine Learning (stat.ML); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Combinatorics (math.CO); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In this paper we investigate which subsets of the real plane are realisable as the set of points on which a one-layer ReLU neural network takes a positive value. In the case of cones we give a full characterisation of such sets. Furthermore, we give a necessary condition for any subset of \mathbb R^d . We give various examples of such one-layer neural networks.

[LG-78] Is Limited Participant Diversity Impeding EEG-based Machine Learning?

链接: https://arxiv.org/abs/2503.13497
作者: Philipp Bomatter,Henry Gouk
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-79] Finger-to-Chest Style Transfer-assisted Deep Learning Method For Photoplethysmogram Waveform Restoration with Timing Preservation

链接: https://arxiv.org/abs/2503.13496
作者: Sara Maria Pagotto,Federico Tognoni,Matteo Rossi,Dario Bovio,Caterina Salito,Luca Mainardi,Pietro Cerveri
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

[LG-80] ransECG: Leverag ing Transformers for Explainable ECG Re-identification Risk Analysis

链接: https://arxiv.org/abs/2503.13495
作者: Ziyu Wang,Elahe Khatibi,Kianoosh Kazemi,Iman Azimi,Sanaz Mousavi,Shaista Malik,Amir M. Rahmani
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) signals are widely shared across multiple clinical applications for diagnosis, health monitoring, and biometric authentication. While valuable for healthcare, they also carry unique biometric identifiers that pose privacy risks, especially when ECG data shared across multiple entities. These risks are amplified in shared environments, where re-identification threats can compromise patient privacy. Existing deep learning re-identification models prioritize accuracy but lack explainability, making it challenging to understand how the unique biometric characteristics encoded within ECG signals are recognized and utilized for identification. Without these insights, despite high accuracy, developing secure and trustable ECG data-sharing frameworks remains difficult, especially in diverse, multi-source environments. In this work, we introduce TransECG, a Vision Transformer (ViT)-based method that uses attention mechanisms to pinpoint critical ECG segments associated with re-identification tasks like gender, age, and participant ID. Our approach demonstrates high accuracy (89.9% for gender, 89.9% for age, and 88.6% for ID re-identification) across four real-world datasets with 87 participants. Importantly, we provide key insights into ECG components such as the R-wave, QRS complex, and P-Q interval in re-identification. For example, in the gender classification, the R wave contributed 58.29% to the model’s attention, while in the age classification, the P-R interval contributed 46.29%. By combining high predictive performance with enhanced explainability, TransECG provides a robust solution for privacy-conscious ECG data sharing, supporting the development of secure and trusted healthcare data environment.

[LG-81] Analysis of Learning-based Offshore Wind Power Prediction Models with Various Feature Combinations

链接: https://arxiv.org/abs/2503.13493
作者: Linhan Fang,Fan Jiang,Ann Mary Toms,Xingpeng Li
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Accurate wind speed prediction is crucial for designing and selecting sites for offshore wind farms. This paper investigates the effectiveness of various machine learning models in predicting offshore wind power for a site near the Gulf of Mexico by analyzing meteorological data. After collecting and preprocessing meteorological data, nine different input feature combinations were designed to assess their impact on wind power predictions at multiple heights. The results show that using wind speed as the output feature improves prediction accuracy by approximately 10% compared to using wind power as the output. In addition, the improvement of multi-feature input compared with single-feature input is not obvious mainly due to the poor correlation among key features and limited generalization ability of models. These findings underscore the importance of selecting appropriate output features and highlight considerations for using machine learning in wind power forecasting, offering insights that could guide future wind power prediction models and conversion techniques.

[LG-82] FLP-XR: Future Location Prediction on Extreme Scale Maritime Data in Real-time

链接: https://arxiv.org/abs/2503.13491
作者: George S. Theodoropoulos,Andreas Patakis,Andreas Tritsarolis,Yannis Theodoridis
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-83] Cascade of one-class classifier ensemble and dynamic naive Bayes classifier applied to the myoelectric-based upper limb prosthesis control with contaminated channels detection

链接: https://arxiv.org/abs/2503.13490
作者: Pawel Trajdos,Marek Kurzynski
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-84] Statistical Study of Sensor Data and Investigation of ML-based Calibration Algorithms for Inexpensive Sensor Modules: Experiments from Cape Point

链接: https://arxiv.org/abs/2503.13487
作者: Travis Barrett,Amit Kumar Mishra
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper we present the statistical analysis of data from inexpensive sensors. We also present the performance of machine learning algorithms when used for automatic calibration such sensors. In this we have used low-cost Non-Dispersive Infrared CO _2 sensor placed at a co-located site at Cape Point, South Africa (maintained by Weather South Africa). The collected low-cost sensor data and site truth data are investigated and compared. We compare and investigate the performance of Random Forest Regression, Support Vector Regression, 1D Convolutional Neural Network and 1D-CNN Long Short-Term Memory Network models as a method for automatic calibration and the statistical properties of these model predictions. In addition, we also investigate the drift in performance of these algorithms with time.

[LG-85] Machine learning for triage of strokes with large vessel occlusion using photoplethysmography biomarkers

链接: https://arxiv.org/abs/2503.13486
作者: Márton Á. Goda,Helen Badge,Jasmeen Khan,Yosef Solewicz,Moran Davoodi,Rumbidzai Teramayi,Dennis Cordato,Longting Lin,Lauren Christie,Christopher Blair,Gagan Sharma,Mark Parsons,Joachim A. Behar
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-86] WVEmbs with its Masking: A Method For Radar Signal Sorting

链接: https://arxiv.org/abs/2503.13480
作者: Xianan Hu,Fu Li,Kairui Niu,Peihan Qi,Zhiyong Liang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-87] A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling

链接: https://arxiv.org/abs/2503.13468
作者: Keying Guo,Ruisi He,Mi Yang,Yuxin Zhang,Bo Ai,Haoxiang Zhang,Jiahui Han,Ruifeng Chen
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages,7 figures

点击查看摘要

信息检索

[IR-0] owards a Barrier-free GeoQA Portal: Natural Language Interaction with Geospatial Data Using Multi-Agent LLM s and Semantic Search

链接: https://arxiv.org/abs/2503.14251
作者: Yu Feng,Puzhen Zhang,Guohui Xiao,Linfang Ding,Liqiu Meng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A Barrier-Free GeoQA Portal: Enhancing Geospatial Data Accessibility with a Multi-Agent LLM Framework Geoportals are vital for accessing and analyzing geospatial data, promoting open spatial data sharing and online geo-information management. Designed with GIS-like interaction and layered visualization, they often challenge non-expert users with complex functionalities and overlapping layers that obscure spatial relationships. We propose a GeoQA Portal using a multi-agent Large Language Model framework for seamless natural language interaction with geospatial data. Complex queries are broken into subtasks handled by specialized agents, retrieving relevant geographic data efficiently. Task plans are shown to users, boosting transparency. The portal supports default and custom data inputs for flexibility. Semantic search via word vector similarity aids data retrieval despite imperfect terms. Case studies, evaluations, and user tests confirm its effectiveness for non-experts, bridging GIS complexity and public access, and offering an intuitive solution for future geoportals. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.14251 [cs.IR] (or arXiv:2503.14251v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.14251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] A Comprehensive Survey on Cross-Domain Recommendation: Taxonomy Progress and Prospects

链接: https://arxiv.org/abs/2503.14110
作者: Hao Zhang,Mingyue Cheng,Qi Liu,Junzhe Jiang,Xianquan Wang,Rujiao Zhang,Chenyi Lei,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems (RS) have become crucial tools for information filtering in various real world scenarios. And cross domain recommendation (CDR) has been widely explored in recent years in order to provide better recommendation results in the target domain with the help of other domains. The CDR technology has developed rapidly, yet there is a lack of a comprehensive survey summarizing recent works. Therefore, in this paper, we will summarize the progress and prospects based on the main procedure of CDR, including Cross Domain Relevance, Cross Domain Interaction, Cross Domain Representation Enhancement and Model Optimization. To help researchers better understand and engage in this field, we also organize the applications and resources, and highlight several current important challenges and future directions of CDR. More details of the survey articles are available at this https URL Recommendation-Papers-and-Resources.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-19

目录

概览 (2025-03-19)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

目录

概览 (2025-03-19)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载

微信扫一扫：分享