Arxiv今日论文 | 2025-04-01

本篇博文主要内容为 2025-04-01 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决现有端到端智能体在复杂开放世界环境中推理（Reasoning）与想象力（Imagination）能力集成不足的问题，这些问题限制了策略的学习效率和泛化能力。论文的关键创新在于提出了一个名为RIG（Reasoning and Imagination Generalist Policy）的端到端通用策略，将推理与想象力协同整合。为实现这一点，研究构建了一个数据管道，逐步融合并丰富从已有智能体轨迹中收集的想象与推理内容，并通过联合学习推理与下一图像生成，显式建模推理、动作与环境动态之间的内在关联性。这种方法相较于以往工作展示了超过17倍的样本效率提升及更强的泛化能力。关键解决方案在于通过推理预测潜在动作、模拟结果，并在实际执行前提供自我审查与校正的机会，从而显著增强通用策略的鲁棒性、泛化性和互操作性。

链接: https://arxiv.org/abs/2503.24388
作者: Zhonghan Zhao,Wenwei Zhang,Haian Huang,Kuikun Liu,Jianfei Gao,Gaoang Wang,Kai Chen
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning before action and imagining potential outcomes (i.e., world models) are essential for embodied agents operating in complex open-world environments. Yet, prior work either incorporates only one of these abilities in an end-to-end agent or integrates multiple specialized models into an agent system, limiting the learning efficiency and generalization of the policy. Thus, this paper makes the first attempt to synergize Reasoning and Imagination in an end-to-end Generalist policy, termed RIG. To train RIG in an end-to-end manner, we construct a data pipeline that progressively integrates and enriches the content of imagination and reasoning in the trajectories collected from existing agents. The joint learning of reasoning and next image generation explicitly models the inherent correlation between reasoning, action, and dynamics of environments, and thus exhibits more than 17\times sample efficiency improvements and generalization in comparison with previous works. During inference, RIG first reasons about the next action, produces potential action, and then predicts the action outcomes, which offers the agent a chance to review and self-correct based on the imagination before taking real actions. Experimental results show that the synergy of reasoning and imagination not only improves the robustness, generalization, and interoperability of generalist policy but also enables test-time scaling to enhance overall performance.
zh

[NLP-1] Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中性能与计算成本之间的权衡问题，即推理经济性（reasoning economy）。传统系统2推理（System 2 reasoning）虽然能够提高任务准确性，但因慢速推理和可能的低效或不必要的推理行为导致显著的计算开销；而系统1推理（System 1 reasoning）虽计算高效，却难以保证最优性能。论文的关键在于分析推理低效的原因（如推理模式的行为特征），并通过提出潜在解决方案来优化推理经济性，包括改进后训练阶段和测试时间推理阶段的设计。最终目标是通过提供可操作的见解并揭示开放挑战，为提升LLMs的推理效率提供指导，并推动相关研究领域的发展。论文还维护了一个公共存储库以持续跟踪该快速发展的领域的进展。

链接: https://arxiv.org/abs/2503.24377
作者: Rui Wang,Hongru Wang,Boyang Xue,Jianhui Pang,Shudong Liu,Yi Chen,Jiahao Qiu,Derek Fai Wong,Heng Ji,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); University of Macau (澳门大学); The University of Hong Kong (香港大学); Princeton University (普林斯顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: In Progress; Paper list Repo: this https URL

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field.
zh

[NLP-2] Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在需要同时处理感知与逻辑推理的任务中表现不足的问题。为实现这一目标，论文引入了SEED-Bench-R1基准测试集，用于系统性评估MLLMs在视频理解中的后训练方法。解决方案的关键在于通过强化学习（Reinforcement Learning, RL）提升模型的数据效率及其在分布内与分布外任务上的性能，同时通过多层级泛化能力测试验证模型的有效性。实验结果显示，尽管RL在视觉感知方面表现出色，但其生成的推理链逻辑连贯性较差，存在不一致推理及忽略视觉线索等问题。论文进一步分析了这些问题，并建议未来改进方向包括优化基础模型推理能力、完善奖励建模以及增强RL对噪声信号的鲁棒性。

链接: https://arxiv.org/abs/2503.24376
作者: Yi Chen,Yuying Ge,Rui Wang,Yixiao Ge,Lu Qiu,Ying Shan,Xihui Liu
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report (In Progress); Code released at: this https URL

点击查看摘要

Abstract:Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL’s data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
zh

[NLP-3] Effectively Controlling Reasoning Models through Thinking Intervention

【速读】：该论文试图解决如何实现对大型语言模型（Large Language Models, LLMs）推理行为的更精细控制的问题。解决方案的关键在于提出了一种名为“Thinking Intervention”的新型范式，通过战略性地插入或修订特定的思维标记（thinking tokens），显式引导LLMs的内部推理过程，从而实现对模型行为的精准干预与优化。

链接: https://arxiv.org/abs/2503.24370
作者: Tong Wu,Chong Xiang,Jiachen T. Wang,Prateek Mittal
机构: Princeton University (普林斯顿大学); NVIDIA (英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning-enhanced large language models (LLMs) explicitly generate intermediate reasoning steps prior to generating final answers, helping the model excel in complex problem-solving. In this paper, we demonstrate that this emerging generation framework offers a unique opportunity for more fine-grained control over model behavior. We propose Thinking Intervention, a novel paradigm designed to explicitly guide the internal reasoning processes of LLMs by strategically inserting or revising specific thinking tokens. We conduct comprehensive evaluations across multiple tasks, including instruction following on IFEval, instruction hierarchy on SEP, and safety alignment on XSTest and SORRY-Bench. Our results demonstrate that Thinking Intervention significantly outperforms baseline prompting approaches, achieving up to 6.7% accuracy gains in instruction-following scenarios, 15.4% improvements in reasoning about instruction hierarchies, and a 40.0% increase in refusal rates for unsafe prompts using open-source DeepSeek R1 models. Overall, our work opens a promising new research avenue for controlling reasoning LLMs.
zh

[NLP-4] Query and Conquer: Execution-Guided SQL Generation

【速读】：该论文旨在解决文本到SQL（Text-to-SQL）任务中复杂输出生成准确性不足的问题。其关键解决方案是提出一种新方法，通过利用执行结果从多个候选查询中选择语义一致性最高的查询，从而实现更高效且成本更低的SQL生成。这种方法不仅使较小的模型在性能上超越了计算密集型方法（如o1、o3-mini和DeepSeek R1），还大幅降低了推理成本高达30倍，同时能够无缝集成至现有模型中，提供实用且可扩展的顶级SQL生成路径。

链接: https://arxiv.org/abs/2503.24364
作者: Łukasz Borchmann,Marek Wydmuch
机构: Snowflake AI Research
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel approach for generating complex outputs that significantly improves accuracy in text-to-SQL tasks. Our method leverages execution results to select the most semantically consistent query from multiple candidates, enabling smaller, cost-effective models to surpass computationally intensive reasoning methods such as o1, o3-mini, and DeepSeek R1 while reducing inference cost by as much as 30 times. It integrates effortlessly with existing models, offering a practical and scalable pathway to state-of-the-art SQL generation.
zh

[NLP-5] SQuat: Subspace-orthogonal KV Cache Quantization

【速读】：该论文旨在解决基于关键值（KV）缓存的大型语言模型（LLMs）解码过程中因存储先前生成标记的KV张量而导致内存使用增加的问题。现有方法通过将KV张量压缩为低比特表示来缓解这一开销，但随着更多标记的生成，量化误差可能累积，从而可能导致不理想的输出。论文提出的关键解决方案是SQuat（子空间正交KV缓存量化），其核心在于首先构建由查询张量张成的子空间以捕获与任务相关的最关键信息。在关键张量量化过程中，强制使量化和原始关键张量之间的差异保持在此子空间的正交性，从而最小化量化误差对注意力机制输出的影响。此方法无需模型微调，也无需额外的离线校准数据集，并且基于作者开发的理论框架。通过数值实验表明，该方法可将峰值内存减少2.17到2.82倍，吞吐量提高2.45到3.60倍，并在基准测试中获得更优的成绩。

链接: https://arxiv.org/abs/2503.24358
作者: Hao Wang,Ligong Han,Kai Xu,Akash Srivastava
机构: Red Hat AI Innovation (红帽人工智能创新)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism’s outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
zh

[NLP-6] ORAL: Prompting Your Large-Scale LoRAs via Conditional Recurrent Diffusion

【速读】：该论文旨在解决现有低秩适应（Low-Rank Adaptation, LoRA）方法在处理不断演化的大型语言模型（Large Language Models, LLMs）时，无法同时实现可扩展性和可控性的关键挑战。论文提出了一种名为\texttt{ORAL}的新框架，其核心在于引入一种条件循环扩散（conditional recurrent diffusion）机制，该机制能够整合模型架构与文本任务规范，从而生成针对特定任务的LoRA参数，并确保这些参数能够在不同版本的基础模型间无缝迁移。通过将该方法扩展到数十亿参数规模的LLMs并保持其可控性，\texttt{ORAL}在七个语言任务、四个视觉任务以及三个多模态任务的实验中证明了其生成的LoRA参数性能与传统训练方法相当甚至更优。

链接: https://arxiv.org/abs/2503.24354
作者: Rana Muhammad Shahroz Khan,Dongwen Tang,Pingzhi Li,Kai Wang,Tianlong Chen
机构: The University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter generation has emerged as a novel paradigm for neural network development, offering an alternative to traditional neural network training by synthesizing high-quality model weights directly. In the context of Low-Rank Adaptation (LoRA) for evolving ( \textiti.e. , constantly updated) large language models (LLMs), this approach promises efficient adaptation without costly retraining. However, existing methods face critical limitations in simultaneously achieving scalability and controllability. In this paper, we introduce \textttORAL , a novel \textbfconditional recurrent diffusion framework that addresses these challenges. \textttORAL incorporates a novel conditioning mechanism that integrates model architecture and textual task specifications, enabling the generation of task-specific LoRA parameters that can seamlessly transfer across evolving foundation models. Our approach successfully scales to billions-of-parameter LLMs and maintains controllability. Through extensive experiments across seven language tasks, four vision tasks, and three multimodal tasks using five pre-trained LLMs, we demonstrate that \textttORAL generates high-quality LoRA parameters that achieve comparable or superior performance to vanilla trained counterparts.
zh

[NLP-7] BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在Bias（偏见）、Ethics（伦理）、Fairness（公平性）和Factuality（事实性）方面的评估问题。论文的关键贡献在于提出了BEATS框架及其对应的偏见基准测试，该基准通过29个不同的度量指标全面衡量LLMs的表现，这些指标涵盖了人口统计学、认知和社会偏见，以及伦理推理、群体公平性和与事实相关的信息误传风险等多方面特性。通过这套量化评估体系，BEATS能够有效识别LLMs生成响应中可能延续社会偏见并加剧系统性不平等的程度。论文的实验结果显示，当前行业领先模型中有37.65%的输出包含某种形式的偏见，这凸显了在关键决策系统中使用这些模型的风险。BEATS框架的核心优势在于提供了一种可扩展且统计严谨的方法来评估LLMs、诊断偏差驱动因素，并制定缓解策略，从而推动更具社会责任感和伦理一致性的AI模型的发展。

链接: https://arxiv.org/abs/2503.24310
作者: Alok Abhishek,Lisa Erickson,Tushar Bandopadhyay
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 33 figures, preprint version

点击查看摘要

Abstract:In this research, we introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs). Building upon the BEATS framework, we present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk. These metrics enable a quantitative assessment of the extent to which LLM generated responses may perpetuate societal prejudices that reinforce or expand systemic inequities. To achieve a high score on this benchmark a LLM must show very equitable behavior in their responses, making it a rigorous standard for responsible AI evaluation. Empirical results based on data from our experiment show that, 37.65% of outputs generated by industry leading models contained some form of bias, highlighting a substantial risk of using these models in critical decision making systems. BEATS framework and benchmark offer a scalable and statistically rigorous methodology to benchmark LLMs, diagnose factors driving biases, and develop mitigation strategies. With the BEATS framework, our goal is to help the development of more socially responsible and ethically aligned AI models.
zh

[NLP-8] A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG

【速读】：该论文旨在系统比较三种基于大型语言模型（Large Language Models, LLMs）分析心理健康文本的方法：提示工程（prompt engineering）、检索增强生成（retrieval augmented generation, RAG）以及微调（fine-tuning）。论文以LLaMA 3为实验基础，在两个数据集上评估这些方法在情感分类和心理健康状况检测任务中的表现。关键在于，微调方法虽然在准确性上表现最佳（情感分类91%，心理健康状况检测80%），但其需要大量的计算资源和大规模训练数据；而提示工程与RAG则提供了更高的部署灵活性，性能适中（准确率40%-68%）。研究结果为基于LLM的心理健康应用提供了实用见解，强调了准确性、计算需求与部署灵活性之间的权衡关系。

链接: https://arxiv.org/abs/2503.24307
作者: Arshia Kermani,Veronica Perez-Rosas,Vangelis Metsis
机构: Texas State University (德克萨斯州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets. Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy). Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility.
zh

[NLP-9] Is analogy enough to draw novel adjective-noun inferences?

【速读】：该论文试图解决的问题是：人类和大型语言模型（LLMs）在处理新颖形容词-名词组合时所展现的泛化能力是否可以通过类比推理（analogical reasoning）而非基于组合机制（compositionality）来实现。论文的关键解决方案在于构建一个基于词汇项相似性的类比推理模型，并通过实验验证人类参与者是否能够通过类比方式推导出与已知推论一致的结果。研究发现，尽管类比策略对大部分数据集有效，但对于某些新颖组合，人类和LLMs仍能得出一致推论，而这些情况无法很好地通过类比方法解释，从而表明人类和LLMs在这些场景中的泛化机制不能完全归结为类比推理，可能涉及更复杂的组合过程。

链接: https://arxiv.org/abs/2503.24293
作者: Hayley Ross,Kathryn Davidson,Najoung Kim
机构: Harvard University (哈佛大学); Boston University (波士顿大学)
类目: Computation and Language (cs.CL)
备注: 8 pages (16 pages with appendix). Submitted to SCiL 2025

点击查看摘要

Abstract:Recent work (Ross et al., 2025, 2024) has argued that the ability of humans and LLMs respectively to generalize to novel adjective-noun combinations shows that they each have access to a compositional mechanism to determine the phrase’s meaning and derive inferences. We study whether these inferences can instead be derived by analogy to known inferences, without need for composition. We investigate this by (1) building a model of analogical reasoning using similarity over lexical items, and (2) asking human participants to reason by analogy. While we find that this strategy works well for a large proportion of the dataset of Ross et al. (2025), there are novel combinations for which both humans and LLMs derive convergent inferences but which are not well handled by analogy. We thus conclude that the mechanism humans and LLMs use to generalize in these cases cannot be fully reduced to analogy, and likely involves composition.
zh

[NLP-10] Open-Reason er-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

【速读】：本文旨在解决大规模推理导向的强化学习（Reinforcement Learning, RL）训练在可扩展性、简易性和易用性方面的挑战。论文的关键在于提出了一种极简主义的方法，即采用未经调整的PPO算法结合GAE（\lambda=1, \gamma=1）以及简单的基于规则的奖励机制，无需KL正则化，即可显著提升模型的响应长度和基准性能，表现与DeepSeek-R1-Zero相当。通过使用与DeepSeek-R1-Zero-Qwen-32B相同的基座模型，Open-Reasoner-Zero在AIME2024、MATH500及GPQA Diamond基准测试中实现了卓越的性能，同时仅需其十分之一的训练步数，展示了出色的效率。该方法的核心创新在于通过简化训练框架实现高效且强大的性能提升。

链接: https://arxiv.org/abs/2503.24290
作者: Jingcheng Hu,Yinmin Zhang,Qi Han,Daxin Jiang,Xiangyu Zhang,Heung-Yeung Shum
机构: StepFun(步骤功能); Tsinghua University (清华大学); Open-Reasoner-Zero
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ( \lambda=1 , \gamma=1 ) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency – requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.
zh

[NLP-11] Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

【速读】：该论文试图解决将大型语言模型（Large Language Models, LLMs）与推荐系统有效结合的问题，具体目标是通过闭环优化提升LLMs在推荐任务中的表现。论文提出的关键解决方案是Rec-R1框架，它直接利用固定黑盒推荐模型的反馈来优化LLMs的生成过程，而非依赖于合成的监督微调（Supervised Fine-Tuning, SFT）数据或专有模型（如GPT-4o）。这种方法避免了数据蒸馏带来的高昂成本，并在产品搜索和序列推荐等任务中验证了其有效性，同时保持了LLMs的通用能力，而非损害其指令跟随和推理能力。

链接: https://arxiv.org/abs/2503.24289
作者: Jiacheng Lin,Tian Wang,Kun Qian
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.
zh

[NLP-12] MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

【速读】：该论文试图解决现代代码生成在可维护性（Maintainability）方面的不足，强调现有系统虽在功能正确性和执行效率上取得进展，但在动态需求变化下的代码适应能力及可维护性优化方面表现欠佳。论文提出MaintainCoder作为解决方案，其关键在于整合瀑布模型（Waterfall model）、设计模式（design patterns）以及多智能体协作（multi-agent collaboration），通过系统化提升代码内聚性（cohesion）、降低耦合性（coupling）并增强适应性（adaptability），从而有效应对需求动态变化，减少重复开发工作。此外，论文还引入MaintainBench基准测试，验证了MaintainCoder相较于传统方法在可维护性指标上的显著提升（14%-30%），同时保持更高的正确性（pass@k）。

链接: https://arxiv.org/abs/2503.24260
作者: Zhengren Wang,Rui Ling,Chufan Wang,Yongan Yu,Zhiyu Li,Feiyu Xiong,Wentao Zhang
机构: Center for Data Science, Peking University (北京大学数据科学中心); McGill University (麦吉尔大学); Center for LLM, Institute for Advanced Algorithms Research, Shanghai (上海先进算法研究中心大语言模型中心)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: maintainability. To handle dynamic requirements with minimal rework, we propose MaintainCoder as a pioneering solution. It integrates Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, and improve adaptability. We also introduce MaintainBench, a benchmark comprising requirement changes and corresponding dynamic metrics on maintainance effort. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves maintainability metrics by 14-30% with even higher correctness, i.e. pass@k. Our work not only provides the foundation of maintainable code generation, but also highlights the need for more holistic code quality research. Resources: this https URL.
zh

[NLP-13] Enhancing Large Language Models (LLM s) for Telecommunications using Knowledge Graphs and Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在应用于电信等特定领域时面临的挑战，这些挑战包括缺乏对专业化知识的深入理解和对快速演进的标准的适应能力。为了解决这些问题，论文提出了一种结合知识图谱（Knowledge Graph, KG）和检索增强生成（Retrieval-Augmented Generation, RAG）的新框架——KG-RAG。该解决方案的关键在于通过知识图谱捕获电信领域的结构化专业知识，并将知识图谱与RAG技术相结合，使LLMs能够在响应生成过程中动态访问和利用最相关且最新的知识，从而显著提升模型的准确性、适应性和领域特定的理解能力。

链接: https://arxiv.org/abs/2503.24245
作者: Dun Yuan,Hao Zhou,Di Wu,Xue Liu,Hao Chen,Yan Xin,Jianzhong(Charlie)Zhang
机构: School of Computer Science (计算机科学学院), McGill University (麦吉尔大学); Department of Electrical and Computer Engineering (电气与计算机工程系), McGill University (麦吉尔大学); Standards and Mobility Innovation Lab (标准与移动创新实验室), Samsung Research America (三星研究院美国)
类目: Computation and Language (cs.CL)
备注: This work has been accepted to ICC 2025 IEEE International Conference on Communications. copyright 2025 IEEE

点击查看摘要

Abstract:Large language models (LLMs) have made significant progress in general-purpose natural language processing tasks. However, LLMs are still facing challenges when applied to domain-specific areas like telecommunications, which demands specialized expertise and adaptability to evolving standards. This paper presents a novel framework that combines knowledge graph (KG) and retrieval-augmented generation (RAG) techniques to enhance LLM performance in the telecom domain. The framework leverages a KG to capture structured, domain-specific information about network protocols, standards, and other telecom-related entities, comprehensively representing their relationships. By integrating KG with RAG, LLMs can dynamically access and utilize the most relevant and up-to-date knowledge during response generation. This hybrid approach bridges the gap between structured knowledge representation and the generative capabilities of LLMs, significantly enhancing accuracy, adaptability, and domain-specific comprehension. Our results demonstrate the effectiveness of the KG-RAG framework in addressing complex technical queries with precision. The proposed KG-RAG model attained an accuracy of 88% for question answering tasks on a frequently used telecom-specific dataset, compared to 82% for the RAG-only and 48% for the LLM-only approaches.
zh

[NLP-14] What How Where and How Well? A Survey on Test-Time Scaling in Large Language Models

【速读】：该论文旨在解决测试时扩展（Test-Time Scaling, TTS）领域的系统性理解不足的问题。尽管近年来在这一领域已有大量研究努力，但尚未有全面的综述来提供对该主题的结构化认识。论文的关键在于提出了一种统一且多维度的框架，从四个核心维度——扩展什么（what to scale）、如何扩展（how to scale）、在哪里扩展（where to scale）以及扩展效果如何（how well to scale）——对TTS研究进行分类。基于此分类体系，作者详细回顾了相关方法、应用场景及评估方面，并阐明了各技术在TTS整体框架中的独特功能角色。通过这种分析，论文总结了TTS的发展轨迹，并提供了实际部署的实用指南。此外，还指出了几个开放性挑战，并对未来方向提出了见解，包括进一步扩展规模、明确技术的功能本质、推广至更多任务以及增强归因等。

链接: https://arxiv.org/abs/2503.24235
作者: Qiyuan Zhang,Fuyuan Lyu,Zexu Sun,Lei Wang,Weixu Zhang,Zhihan Guo,Yufei Wang,Irwin King,Xue Liu,Chen Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing’’ has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended QA. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions.
zh

[NLP-15] PAARS: Persona Aligned Agent ic Retail Shoppers

【速读】：该论文试图解决电子商务中基于行为数据决策成本高且效率低的问题，提出了一种利用大型语言模型（LLM）驱动的代理（agents）模拟人类群体行为的替代方案。然而，由于LLMs可能存在品牌偏见、评分偏见以及对某些人群的代表性不足等偏差，需要对其进行仔细的基准测试与校准。论文的核心目标是合成一组代理群体，并验证其整体是否能够近似真实的人类样本。

解决方案的关键在于提出一个框架，包括：(i) 基于匿名化的历史购物数据自动生成人物角色（personas）以创建合成购物代理；(ii) 为代理配备零售专用工具以模拟购物会话；(iii) 引入一种新的对齐套件，在群体层面而非传统的个体层面上衡量人类与购物代理之间的分布差异。实验结果表明，使用人物角色可以提高对齐套件的性能，但仍存在与人类行为之间的差距。此外，论文展示了该框架在自动化代理A/B测试中的初步应用，并将结果与人类行为进行了比较。最后，讨论了该方法的应用场景、局限性和挑战，为未来研究奠定了基础。

链接: https://arxiv.org/abs/2503.24228
作者: Saab Mansour,Leonardo Perelli,Lorenzo Mainetti,George Davidson,Stefano D’Amato
机构: Amazon
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans. To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional “individual” level. Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour. We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results. Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.
zh

[NLP-16] BAR-Analytics: A Web-based Platform for Analyzing Information Spreading Barriers in News: Comparative Analysis Across Multiple Barriers and Events

【速读】：该论文试图解决如何跨地理、经济、政治和文化边界分析新闻传播模式的问题。解决方案的关键在于开发了一个名为BAR-Analytics的基于Web的开源平台，并结合四种分析方法：传播分析、趋势分析、情感分析和时间主题建模。通过收集和分析超过350,000篇文章，该平台利用元数据增强技术重点关注经济差异和地理影响，采用连贯性、情感极性、主题频率和趋势变化作为评估关键指标，从而揭示不同冲突中媒体报道的显著特征和模式。

链接: https://arxiv.org/abs/2503.24220
作者: Abdul Sittar,Dunja Mladenic,Alenka Gucek,Marko Grobelnik
机构: 未知
类目: Computation and Language (cs.CL)
备注: 46 pages

点击查看摘要

Abstract:This paper presents BAR-Analytics, a web-based, open-source platform designed to analyze news dissemination across geographical, economic, political, and cultural boundaries. Using the Russian-Ukrainian and Israeli-Palestinian conflicts as case studies, the platform integrates four analytical methods: propagation analysis, trend analysis, sentiment analysis, and temporal topic modeling. Over 350,000 articles were collected and analyzed, with a focus on economic disparities and geographical influences using metadata enrichment. We evaluate the case studies using coherence, sentiment polarity, topic frequency, and trend shifts as key metrics. Our results show distinct patterns in news coverage: the Israeli-Palestinian conflict tends to have more negative sentiment with a focus on human rights, while the Russia-Ukraine conflict is more positive, emphasizing election interference. These findings highlight the influence of political, economic, and regional factors in shaping media narratives across different conflicts.
zh

[NLP-17] MB-ORES: A Multi-Branch Object Reason er for Visual Grounding in Remote Sensing

【速读】：本文旨在解决遥感图像中同时支持目标检测（Object Detection, OD）和视觉定位（Visual Grounding, VG）的问题。为实现这一目标，论文提出了一种统一框架，通过利用指代表达数据微调开放集目标检测器来支持传统OD任务，并为VG任务建立直观先验。解决方案的关键在于构建了一个包含物体查询、类别嵌入和建议位置的图像图表示，并设计了一种任务感知架构处理此图以完成VG任务。该架构包括一个多分支网络用于整合空间、视觉和语义特征生成任务感知建议，以及一个对象推理网络结合软选择机制实现最终的指代物体定位。通过这种方式，模型在OPT-RSVG和DIOR-RSVG数据集上表现出色，显著优于现有方法且保留了经典OD能力。

链接: https://arxiv.org/abs/2503.24219
作者: Karim Radouane,Hanane Azzag,Mustapha lebbah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \urlthis https URL.
zh

[NLP-18] Synthetic News Generation for Fake News Classification

【速读】：该论文旨在解决通过基于事实操纵的大语言模型（Large Language Models, LLMs）生成合成假新闻及其评估的问题，并探索合成数据在假新闻分类中的应用。论文的关键解决方案在于提出了一种创新的方法，即从真实文章中提取关键事实、修改这些事实并重新生成内容以模拟假新闻，同时保持语义连贯性。此外，论文引入了连贯性、差异性和正确性三组评价指标来评估生成内容的质量，并发现事实验证特征在区分合成假新闻方面表现最为突出。研究还表明，尤其是基于变压器的模型如BERT，能够有效利用合成数据进行假新闻检测，并且即使在较小比例的合成数据下也能展现性能提升。关键在于改进合成数据生成方法，以进一步增强假新闻检测系统的性能。

链接: https://arxiv.org/abs/2503.24206
作者: Abdul Sittar,Luka Golob,Mateja Smiljanic
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:This study explores the generation and evaluation of synthetic fake news through fact based manipulations using large language models (LLMs). We introduce a novel methodology that extracts key facts from real articles, modifies them, and regenerates content to simulate fake news while maintaining coherence. To assess the quality of the generated content, we propose a set of evaluation metrics coherence, dissimilarity, and correctness. The research also investigates the application of synthetic data in fake news classification, comparing traditional machine learning models with transformer based models such as BERT. Our experiments demonstrate that transformer models, especially BERT, effectively leverage synthetic data for fake news detection, showing improvements with smaller proportions of synthetic data. Additionally, we find that fact verification features, which focus on identifying factual inconsistencies, provide the most promising results in distinguishing synthetic fake news. The study highlights the potential of synthetic data to enhance fake news detection systems, offering valuable insights for future research and suggesting that targeted improvements in synthetic data generation can further strengthen detection models.
zh

[NLP-19] wT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers Guidance

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在增强推理能力后导致推理阶段输出令牌数量增加、计算成本上升的问题。为应对这一挑战，论文提出了一种名为“TwT (Thinking without Tokens)”的方法。其关键在于通过多教师引导的习惯性推理蒸馏（Habitual Reasoning Distillation）减少推理时间的成本，同时保持高性能。具体而言，该方法引入了一种基于教师指导压缩策略的习惯性推理蒸馏技术，将显式的推理过程内化为模型的内在行为，灵感来源于人类认知。此外，还提出了双准则拒绝采样（Dual-Criteria Rejection Sampling, DCRS）技术，利用多个教师模型生成高质量且多样化的蒸馏数据集，从而使得该方法适用于无监督场景。实验结果表明，TwT 方法在减少推理成本的同时保持了卓越性能，在较少输出令牌的情况下实现了高达 13.6% 的准确性提升，为高效部署 LLM 提供了实用方案。

链接: https://arxiv.org/abs/2503.24198
作者: Jingxian Xu,Mengyu Zhou,Weichang Liu,Hanbing Liu,Shi Han,Dongmei Zhang
机构: Nankai University (南开大学); Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学); Microsoft Research (微软研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers’ guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model’s habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.
zh

[NLP-20] Implicit In-Context Learning: Evidence from Artificial Language Experiments

【速读】：该论文试图解决的问题是：LLMs（大型语言模型）在推理阶段是否表现出与人类类似的语言模式识别能力。论文的关键解决方案是通过改编三种经典的基于形态学、形态句法和句法的人工语言学习实验，系统性地评估两个最先进的OpenAI模型（gpt-4o 和 o3-mini）在推理阶段的隐性学习能力，并揭示模型与人类行为在特定语言领域的对齐情况。

链接: https://arxiv.org/abs/2503.24190
作者: Xiaomeng Ma,Qihui Xu
机构: AWS (亚马逊云科技); Department of Psychology, Ohio State University (俄亥俄州立大学心理学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans acquire language through implicit learning, absorbing complex patterns without explicit awareness. While LLMs demonstrate impressive linguistic capabilities, it remains unclear whether they exhibit human-like pattern recognition during in-context learning at inferencing level. We adapted three classic artificial language learning experiments spanning morphology, morphosyntax, and syntax to systematically evaluate implicit learning at inferencing level in two state-of-the-art OpenAI models: gpt-4o and o3-mini. Our results reveal linguistic domain-specific alignment between models and human behaviors, o3-mini aligns better in morphology while both models align in syntax.
zh

[NLP-21] Multi-Task Learning for Extracting Menstrual Characteristics from Clinical Notes

【速读】：该论文旨在解决临床记录中缺乏结构化月经特征详细数据的问题。为填补这一空白，论文提出了一种新颖的自然语言处理（NLP）流水线，用于提取关键的月经周期属性，包括痛经、规律性、流量体积以及间期出血。解决方案的关键在于结合多任务提示学习的GatorTron模型，并通过混合检索预处理步骤增强，以识别相关文本片段。这种结合不仅显著提升了性能，平均F1分数达到90%，还通过检索步骤提高了所有方法的表现，使模型能够专注于临床笔记中最相关的部分。这些结果表明，将多任务学习与检索相结合可以有效提升月经特征自动化提取的泛化能力和表现，从而支持女性健康研究。

链接: https://arxiv.org/abs/2503.24116
作者: Anna Shopova,Cristoph Lippert,Leslee J. Shaw,Eugenia Alleva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Menstrual health is a critical yet often overlooked aspect of women’s healthcare. Despite its clinical relevance, detailed data on menstrual characteristics is rarely available in structured medical records. To address this gap, we propose a novel Natural Language Processing pipeline to extract key menstrual cycle attributes – dysmenorrhea, regularity, flow volume, and intermenstrual bleeding. Our approach utilizes the GatorTron model with Multi-Task Prompt-based Learning, enhanced by a hybrid retrieval preprocessing step to identify relevant text segments. It out- performs baseline methods, achieving an average F1-score of 90% across all menstrual characteristics, despite being trained on fewer than 100 annotated clinical notes. The retrieval step consistently improves performance across all approaches, allowing the model to focus on the most relevant segments of lengthy clinical notes. These results show that combining multi-task learning with retrieval improves generalization and performance across menstrual charac- teristics, advancing automated extraction from clinical notes and supporting women’s health research.
zh

[NLP-22] AntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

【速读】：该论文旨在解决电信诈骗检测中高质量多模态训练数据匮乏的问题，特别是缺乏将音频信号与注重推理的文本分析相结合的数据。解决方案的关键在于提出了TeleAntiFraud-28k，这是一个专门设计用于自动化电信诈骗分析的开源音频-文本慢思考数据集。该数据集通过三种策略构建：(1) 使用自动语音识别（ASR）转录的通话录音生成隐私保护的文本-真相样本，并通过文本到语音（TTS）模型再生确保真实世界一致性；(2) 借助大型语言模型（LLM）基于自指导采样的语义增强，以扩展场景覆盖范围；(3) 多智能体对抗合成，通过预定义的通信场景和诈骗类型模拟新兴的诈骗手段。此外，还构建了TeleAntiFraud-Bench评估基准，并贡献了一个在混合真实/合成数据上训练的生产优化监督微调（SFT）模型，同时开源数据处理框架以促进社区驱动的数据集扩展。这一工作为多模态反欺诈研究奠定了基础框架，同时解决了数据隐私和场景多样性方面的关键挑战。

链接: https://arxiv.org/abs/2503.24115
作者: Zhiming Ma,Peidong Wang,Minhua Huang,Jingpeng Wang,Kai Wu,Xiangzhao Lv,Yachun Pang,Yin Yang,Wenjie Tang,Yuchen Kang
机构: China Mobile Internet Company Ltd. (中国移动互联网有限公司); Guangzhou (广州); Guangdong (广东); China
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at this https URL.
zh

[NLP-23] Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition

【速读】：该论文旨在解决现有具身人工智能（Embodied AI）系统在捕捉人类自然用于理解和交互环境的基本概念结构方面存在的不足。论文的关键在于提出一种新框架，通过利用图像图式（Image Schemas）的形式化描述，将具身认知理论与智能体系统相结合。图像图式被定义为结构人类认知的感官运动经验的重复模式。关键解决方案是定制大型语言模型（LLMs），使其能够将自然语言描述转换为基于这些感官运动模式的正式表示，从而构建一个神经符号系统，使智能体的理解植根于基本的概念结构中。这种方法不仅提高了效率和可解释性，还通过共享的具身理解促进了更直观的人机交互。

链接: https://arxiv.org/abs/2503.24110
作者: François Olivier,Zied Bouraoui
机构: CRIL (CRIL实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite advances in embodied AI, agent reasoning systems still struggle to capture the fundamental conceptual structures that humans naturally use to understand and interact with their environment. To address this, we propose a novel framework that bridges embodied cognition theory and agent systems by leveraging a formal characterization of image schemas, which are defined as recurring patterns of sensorimotor experience that structure human cognition. By customizing LLMs to translate natural language descriptions into formal representations based on these sensorimotor patterns, we will be able to create a neurosymbolic system that grounds the agent’s understanding in fundamental conceptual structures. We argue that such an approach enhances both efficiency and interpretability while enabling more intuitive human-agent interactions through shared embodied understanding.
zh

[NLP-24] Is LLM the Silver Bullet to Low-Resource Languages Machine Translation?

【速读】：该论文旨在解决低资源语言（Low-Resource Languages, LRLs）在自然语言处理任务中因数据匮乏和代表性不足导致的性能瓶颈问题，尤其是在翻译任务中的表现差距。论文通过系统评估当前大型语言模型（Large Language Models, LLMs）在200种语言上的局限性，并利用FLORES-200等基准进行分析。解决方案的关键在于探索替代数据源（如新闻文章和双语词典），结合知识蒸馏技术从大规模预训练模型中迁移知识，以显著提升小规模低资源语言模型的翻译能力。此外，论文还研究了多种微调策略，发现渐进式的改进能够有效缩小小型LLMs在低资源语言上的性能差距。

链接: https://arxiv.org/abs/2503.24102
作者: Yewei Song,Lujun Li,Cedric Lothritz,Saad Ezzini,Lama Sleem,Niccolo Gentile,Radu State,Tegawendé F. Bissyandé,Jacques Klein
机构: University of Luxembourg (卢森堡大学); Luxembourg Institute of Science and Technology (卢森堡科学技术研究院); King Fahd University of Petroleum and Minerals (法赫德国王石油矿业大学); Foyer S.A. (未知中文)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Resource Languages (LRLs) present significant challenges in natural language processing due to their limited linguistic resources and underrepresentation in standard datasets. While recent advancements in Large Language Models (LLMs) and Neural Machine Translation (NMT) have substantially improved translation capabilities for high-resource languages, performance disparities persist for LRLs, particularly impacting privacy-sensitive and resource-constrained scenarios. This paper systematically evaluates the limitations of current LLMs across 200 languages using benchmarks such as FLORES-200. We also explore alternative data sources, including news articles and bilingual dictionaries, and demonstrate how knowledge distillation from large pre-trained models can significantly improve smaller LRL translations. Additionally, we investigate various fine-tuning strategies, revealing that incremental enhancements markedly reduce performance gaps on smaller LLMs.
zh

[NLP-25] Artificial Conversations Real Results: Fostering Language Detection with Synthetic Data

【速读】：该论文试图解决高质量训练数据获取成本高且耗时的问题，特别是在非英语语言（如意大利语）领域。为应对这一挑战，论文提出了一种利用大语言模型（Large Language Models, LLMs）生成合成数据的解决方案。解决方案的关键在于设计了一个生成合成数据的流水线，并系统性地研究影响LLMs生成的合成数据有效性的因素，包括提示策略（prompt strategy）、文本长度以及目标位置等，通过在特定任务（如意大利工作广告中的包容性语言检测）中的实证分析，验证了基于合成数据微调模型的有效性。

链接: https://arxiv.org/abs/2503.24062
作者: Fatemeh Mohammadi,Tommaso Romano,Samira Maghool,Paolo Ceravolo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Collecting high-quality training data is essential for fine-tuning Large Language Models (LLMs). However, acquiring such data is often costly and time-consuming, especially for non-English languages such as Italian. Recently, researchers have begun to explore the use of LLMs to generate synthetic datasets as a viable alternative. This study proposes a pipeline for generating synthetic data and a comprehensive approach for investigating the factors that influence the validity of synthetic data generated by LLMs by examining how model performance is affected by metrics such as prompt strategy, text length and target position in a specific task, i.e. inclusive language detection in Italian job advertisements. Our results show that, in most cases and across different metrics, the fine-tuned models trained on synthetic data consistently outperformed other models on both real and synthetic test datasets. The study discusses the practical implications and limitations of using synthetic data for language detection tasks with LLMs.
zh

[NLP-26] Crossing Boundaries: Leverag ing Semantic Divergences to Explore Cultural Novelty in Cooking Recipes

【速读】：本文旨在解决计算框架中量化和理解文化差异的问题，特别是缺乏有效度量文化新颖性的挑战。为应对这一问题，论文提出了一种跨学科框架，结合社会学与管理学知识。该方案的关键在于构建了一个名为GlobalFusion的新数据集，包含来自超过150个国家的500道菜品及其约100,000份烹饪食谱，用于捕捉文化适应过程。此外，通过引入Jensen-Shannon散度作为新颖性度量指标，利用此数据集分析不同文化背景下社区间食谱修改所导致的文本差异。研究结果表明，所提出的文化新颖性度量与基于语言、宗教及地理距离的传统文化衡量标准之间存在显著相关性。这表明所提框架能够促进人工智能领域内对文化多样性的理解和测量能力提升。

链接: https://arxiv.org/abs/2503.24027
作者: Florian Carichon,Romain Rampa,Golnoosh Farnadi
机构: MILA (蒙特利尔学习算法研究所), McGill University (麦吉尔大学); École de technologie supérieure (ÉTS) (魁北克省高等技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Novelty modeling and detection is a core topic in Natural Language Processing (NLP), central to numerous tasks such as recommender systems and automatic summarization. It involves identifying pieces of text that deviate in some way from previously known information. However, novelty is also a crucial determinant of the unique perception of relevance and quality of an experience, as it rests upon each individual’s understanding of the world. Social factors, particularly cultural background, profoundly influence perceptions of novelty and innovation. Cultural novelty arises from differences in salience and novelty as shaped by the distance between distinct communities. While cultural diversity has garnered increasing attention in artificial intelligence (AI), the lack of robust metrics for quantifying cultural novelty hinders a deeper understanding of these divergences. This gap limits quantifying and understanding cultural differences within computational frameworks. To address this, we propose an interdisciplinary framework that integrates knowledge from sociology and management. Central to our approach is GlobalFusion, a novel dataset comprising 500 dishes and approximately 100,000 cooking recipes capturing cultural adaptation from over 150 countries. By introducing a set of Jensen-Shannon Divergence metrics for novelty, we leverage this dataset to analyze textual divergences when recipes from one community are modified by another with a different cultural background. The results reveal significant correlations between our cultural novelty metrics and established cultural measures based on linguistic, religious, and geographical distances. Our findings highlight the potential of our framework to advance the understanding and measurement of cultural diversity in AI.
zh

[NLP-27] You Cannot Feed Two Birds with One Score: the Accuracy-Naturalness Tradeoff in Translation

【速读】：该论文试图解决的问题是：当前机器翻译系统评估方法依赖单一分数（如BLEU），无法全面反映系统的性能，尤其是在准确性（accuracy）与自然性（naturalness）之间的权衡。论文指出，这种单一分数无法同时准确衡量翻译结果在保持源语言语义一致性和目标语言表达自然性方面的综合表现。

解决方案的关键在于：基于信息论的最新进展，论文从数学上证明了准确性与自然性之间存在权衡关系，并通过实证研究验证了这一结论。具体而言，作者利用WMT24共享任务的提交结果展示了这种权衡现象，并解释了优化特定准确性指标（如BLEU）可能导致自然性下降的现象。论文建议采用一种新的评估方法，即不再使用单一分数，而是通过“准确性-自然性平面”来比较不同翻译系统的表现，从而更全面地评价翻译系统的性能。

链接: https://arxiv.org/abs/2503.24013
作者: Gergely Flamich,David Vilar,Jan-Thorsten Peter,Markus Freitag
机构: Imperial College London (帝国理工学院); Google (谷歌)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The goal of translation, be it by human or by machine, is, given some text in a source language, to produce text in a target language that simultaneously 1) preserves the meaning of the source text and 2) achieves natural expression in the target language. However, researchers in the machine translation community usually assess translations using a single score intended to capture semantic accuracy and the naturalness of the output simultaneously. In this paper, we build on recent advances in information theory to mathematically prove and empirically demonstrate that such single-score summaries do not and cannot give the complete picture of a system’s true performance. Concretely, we prove that a tradeoff exists between accuracy and naturalness and demonstrate it by evaluating the submissions to the WMT24 shared task. Our findings help explain well-known empirical phenomena, such as the observation that optimizing translation systems for a specific accuracy metric (like BLEU) initially improves the system’s naturalness, while ``overfitting’’ the system to the metric can significantly degrade its naturalness. Thus, we advocate for a change in how translations are evaluated: rather than comparing systems using a single number, they should be compared on an accuracy-naturalness plane.
zh

[NLP-28] Comparing representations of long clinical texts for the task of patient note-identification

【速读】：该论文旨在解决患者-笔记匹配（Patient-Note Identification）的问题，即准确地将匿名化的临床笔记与其对应的患者进行匹配，其中患者由一组相关的笔记表示。这一任务在检测重复记录和患者相似性分析等广泛应用中具有重要意义。为实现患者级别的鲁棒表征，论文探索了多种嵌入方法（如层次注意力网络HAN、三层级Transformer HTN、LongFormer及先进的基于BERT的模型），重点评估这些方法在处理中长篇临床文本方面的有效性。同时，论文还研究了不同的池化策略（均值pooling、最大值pooling及均值-最大值pooling）以及滑动窗口对模型性能的影响。论文的关键在于采用基于BERT的嵌入方法，并结合均值-最大值池化策略，以有效捕捉冗长临床笔记中的细微特征并优化患者-笔记匹配性能。此外，实验结果在MIMIC数据集和Necker医院数据仓库上的可复现性表明，所提出的方法在实际应用中具有良好的泛化能力。

链接: https://arxiv.org/abs/2503.24006
作者: Safa Alsaidi,Marc Vincent,Olivia Boyer,Nicolas Garcelon,Miguel Couceiro,Adrien Coulet
机构: Inria(法国国家信息与自动化研究所), Inserm(法国国家健康与医学研究院), UPC(巴黎大学), HeKA U1346 (Paris, France); Data Science Platform, Imagine Institute, INSERM U1163, UPC, Paris, France; Néphrologie Pédiatrique, Centre de Référence MARHEA, Hôpital Universitaire Necker-Enfants Malades, Assistance Publique - Hôpitaux de Paris (APHP), Institut Imagine, INSERM U1163, UPC, Paris, France; Université de Lorraine, CNRS(法国国家科学研究中心), LORIA, Nancy, France; INESC-ID, Instituto Superior Técnico, Universidade de Lisboa (葡萄牙)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we address the challenge of patient-note identification, which involves accurately matching an anonymized clinical note to its corresponding patient, represented by a set of related notes. This task has broad applications, including duplicate records detection and patient similarity analysis, which require robust patient-level representations. We explore various embedding methods, including Hierarchical Attention Networks (HAN), three-level Hierarchical Transformer Networks (HTN), LongFormer, and advanced BERT-based models, focusing on their ability to process mediumto-long clinical texts effectively. Additionally, we evaluate different pooling strategies (mean, max, and mean_max) for aggregating wordlevel embeddings into patient-level representations and we examine the impact of sliding windows on model performance. Our results indicate that BERT-based embeddings outperform traditional and hierarchical models, particularly in processing lengthy clinical notes and capturing nuanced patient representations. Among the pooling strategies, mean_max pooling consistently yields the best results, highlighting its ability to capture critical features from clinical notes. Furthermore, the reproduction of our results on both MIMIC dataset and Necker hospital data warehouse illustrates the generalizability of these approaches to real-world applications, emphasizing the importance of both embedding methods and aggregation strategies in optimizing patient-note identification and enhancing patient-level modeling.
zh

[NLP-29] BeMERC: Behavior-Aware MLLM -based Framework for Multimodal Emotion Recognition in Conversation

【速读】：本文旨在解决多模态情感识别在对话（MERC）中的现有问题，即当前基于多层语言模型（MLLM）的方法主要关注于提取说话者的文本或语音特征，而忽略了视频中丰富的面部表情、肢体语言及姿态等行为信息的重要性。这些行为信号能够为模型提供更精确的情感预测线索。论文的关键创新在于提出了一种新的行为感知型多层语言模型框架（BeMERC），通过将说话者的细微面部微表情、肢体语言及姿态融入到传统的MLLM-based MERC模型中，从而更好地建模对话过程中的情感动态变化。此外，BeMERC采用两阶段指令微调策略以扩展模型至对话场景，并实现端到端的MERC预测器训练。实验结果表明，相比现有最先进的方法，BeMERC在两个基准数据集上取得了更优性能，同时详细讨论了视频衍生行为信息在MERC任务中的重要性。

链接: https://arxiv.org/abs/2503.23990
作者: Yumeng Fu,Junjie Wu,Zhongjie Wang,Meishan Zhang,Yulin Wu,Bingquan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker’s textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker’s behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.
zh

[NLP-30] Model Hemorrhage and the Robustness Limits of Large Language Models

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在通过量化、剪枝或解码策略调整等手段进行部署优化时普遍出现的性能显著下降问题，定义此现象为“模型出血”（model hemorrhage），即因参数改动和架构变化导致的性能衰退。研究通过系统分析多种LLMs框架，识别出关键脆弱性模式，并发现Transformer架构具有决定不同修改类型出血严重程度的内在鲁棒性阈值。论文的关键解决方案包括三种缓解策略：梯度感知剪枝保留重要权重路径，动态量化缩放保持激活完整性，以及解码校准使生成轨迹与原始模型分布对齐。这些方法为评估模型在适应过程中的稳定性提供了基础指标，同时为在保证性能的同时实现高效LLMs部署提供了实用指导。本研究深化了对神经网络在架构变换下韧性的理解，特别是在大规模语言模型领域。

链接: https://arxiv.org/abs/2503.23924
作者: Ziyang Ma,Zuchao Li,Lefei Zhang,Gui-Song Xia,Bo Du,Liangpei Zhang,Dacheng Tao
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 33 pages, 18 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment through quantization, pruning, or decoding strategy adjustments. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes. Through systematic analysis of various LLM frameworks, we identify key vulnerability patterns: layer expansion frequently disrupts attention mechanisms, compression techniques induce information loss cascades, and decoding adjustments amplify prediction divergences. Our investigation reveals transformer architectures exhibit inherent robustness thresholds that determine hemorrhage severity across modification types. We propose three mitigation strategies: gradient-aware pruning preserves critical weight pathways, dynamic quantization scaling maintains activation integrity, and decoding calibration aligns generation trajectories with original model distributions. This work establishes foundational metrics for evaluating model stability during adaptation, providing practical guidelines for maintaining performance while enabling efficient LLM deployment. Our findings advance understanding of neural network resilience under architectural transformations, particularly for large-scale language models.
zh

[NLP-31] Entropy-Based Adaptive Weighting for Self-Training

【速读】：该论文旨在解决利用自生成推理路径优化大型语言模型数学问题求解能力的问题，特别关注如何有效利用自训练过程中产生的数据。论文的关键在于提出了一种基于熵的自适应加权策略（Entropy-Based Adaptive Weighting for Self-Training, EAST），通过引入一个可调节参数的映射函数，动态调整训练数据的权重，优先关注模型表现不确定的数据。这种方法引导模型集中学习更具有信息量且更具挑战性的样本，从而提升其推理能力。实验结果表明，相比传统自训练方法，EAST在MATH数据集上实现了约1%的性能提升，在GSM8K数据集上进一步带来了1-2%的性能增益。

链接: https://arxiv.org/abs/2503.23913
作者: Xiaoxuan Wang,Yihe Deng,Mingyu Derek Ma,Wei Wang
机构: University of California Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The mathematical problem-solving capabilities of large language models have become a focal point of research, with growing interests in leveraging self-generated reasoning paths as a promising way to refine and enhance these models. These paths capture step-by-step logical processes while requiring only the correct answer for supervision. The self-training method has been shown to be effective in reasoning tasks while eliminating the need for external models and manual annotations. However, optimizing the use of self-generated data for model training remains an open challenge. In this work, we propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty. This approach guides the model to focus on more informative and challenging examples, thereby enhancing its reasoning ability. We evaluate our approach on GSM8K and MATH benchmarks. Empirical results show that, while the vanilla method yields virtually no improvement (0%) on MATH, EAST achieves around a 1% gain over backbone model. On GSM8K, EAST attains a further 1-2% performance boost compared to the vanilla method.
zh

[NLP-32] Rubriks Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset ACL2025

【速读】：该论文旨在解决大型语言模型（Large-Language Models, LLMs）在生成解释任务中的可靠性不足问题，即用户难以区分高质量与低质量解释的现象。为应对这一挑战，论文提出了Rubrik的CUBE（一个受教育启发的评分标准以及一个人类与多种开源及闭源LLMs共同编写的包含26k条解释的数据集）。CUBE数据集涵盖两种推理任务和两种语言任务，以确保评分标准的有效性测试。研究发现，解释的质量受到任务类型和感知难度的影响，而低质量解释主要源于LLM生成解释的冗长性，而非连贯性和词汇选择。解决方案的关键在于引入CUBE评分标准及其相关数据集，以系统化评估和提升LLMs生成解释的可靠性。

链接: https://arxiv.org/abs/2503.23899
作者: Diana Galvan-Sosa,Gabrielle Gaudeau,Pride Kavumba,Yunmeng Li,Hongyi gu,Zheng Yuan,Keisuke Sakaguchi,Paula Buttery
机构: University of Cambridge (剑桥大学); Tohoku University (东北大学); RIKEN; King’s College London (国王学院伦敦)
类目: Computation and Language (cs.CL)
备注: 9 main pages (21 appendix pages), 7 figures, submitted to ACL 2025

点击查看摘要

Abstract:The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik’s CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.
zh

[NLP-33] Better wit than wealth: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement

【速读】：该论文旨在解决 Retrieval-augmented generation (RAG) 模型在推理成本随上下文长度增加而显著提升以及 RAG 幻觉（hallucination）问题的核心挑战。此外，还关注 Parametric RAG (PRAG) 方法因高昂的训练和存储成本及有限的泛化能力导致的实际应用限制。论文的关键解决方案是提出了一种名为 Dynamic Parametric RAG (DyPRAG) 的新型框架，通过引入一个轻量级的参数翻译模型，高效地将文档转换为参数化知识，并动态生成知识以增强大型语言模型 (LLMs) 的能力，在测试阶段实现无缝的知识融合与冲突化解，同时显著降低推理、训练和存储成本，从而提供了一个实用且有效的 RAG 范式。

链接: https://arxiv.org/abs/2503.23895
作者: Yuqiao Tan,Shizhu He,Huanxuan Liao,Jun Zhao,Kang Liu
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems (认知与决策智能复杂系统重点实验室), Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所), Beijing, China; School of Artificial Intelligence (人工智能学院), University of Chinese Academy of Sciences (中国科学院大学), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at this https URL.
zh

[NLP-34] SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

【速读】：该论文旨在解决高质量语音对话数据集获取困难的问题，现有方法面临高成本、隐私问题以及合成数据缺乏对话真实性的挑战。论文的关键解决方案在于提出了一种名为“SpeechDialogueFactory”的生产级框架，该框架通过包含元数据生成、对话脚本编写、富含副语言特征的语句模拟以及基于语音克隆的自然语音合成的综合管道，高效生成自然的语音对话。此外，系统还提供了交互式用户界面用于详细样本检查，并支持高吞吐量批量合成模式。

链接: https://arxiv.org/abs/2503.23848
作者: Minghan Wang,Ye Bai,Yuxia Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
机构: Department of Data Science & AI, Monash University (蒙纳士大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textscSpeechDialogueFactory, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.
zh

[NLP-35] Expanding RL with Verifiable Rewards Across Diverse Domains

【速读】：本文旨在扩展可验证奖励强化学习（RLVR）在更广泛领域的适用性，特别是医学、化学、心理学和经济学等多样化领域。研究发现，当存在客观参考答案时，不同大型语言模型（LLMs）在二元判断上的高度一致性挑战了为训练特定领域的奖励模型而进行大规模标注的必要性。为了解决处理非结构化参考答案时二元奖励的局限性，进一步引入基于模型的软评分机制以增强RLVR的灵活性。关键解决方案在于开发了一种蒸馏生成的奖励模型，该模型能够作为有效的跨域验证器，在无需领域特定标注的情况下提供可靠的奖励信号。通过使用多种强化学习算法针对此奖励模型微调基础7B模型，所获得的策略在自由形式答案设置下，显著优于当前最先进的开源对齐LLMs，如Qwen2.5-72B-Instruct和DeepSeek-R1-Distill-Qwen-32B，这不仅增强了RLVR的鲁棒性和可扩展性，还凸显了其在实际应用中的潜力，即使面对噪声或弱标签的情况。

链接: https://arxiv.org/abs/2503.23829
作者: Yi Su,Dian Yu,Linfeng Song,Juntao Li,Haitao Mi,Zhaopeng Tu,Min Zhang,Dong Yu
机构: Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR’s robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.
zh

[NLP-36] Did ChatGPT or Copilot use alter the style of internet news headlines? A time series regression analysis

【速读】：该论文试图解决的问题是：探究大型语言模型（Large Language Models, LLMs）如ChatGPT和Copilot的发布是否与全球新闻网站头条和链接的写作风格变化存在关联。论文的关键解决方案在于使用中断时间序列分析（Interrupted Time Series Analysis）方法，对包含4.51亿条头条/链接数据集中的175个自然语言处理（NLP）特征逐一评估，以确定这些特征在LLMs发布后是否存在统计学上的显著且持续的变化。通过这种方法，论文识别出部分NLP特征确实发生了显著变化，而另一些则未显示出明显影响，从而初步表明这些语言模型对新闻标题/链接的写作风格仅在特定NLP衡量指标上产生了有限的影响。

链接: https://arxiv.org/abs/2503.23811
作者: Chris Brogly,Connor McElroy
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The release of advanced Large Language Models (LLMs) such as ChatGPT and Copilot is changing the way text is created and may influence the content that we find on the web. This study investigated whether the release of these two popular LLMs coincided with a change in writing style in headlines and links on worldwide news websites. 175 NLP features were obtained for each text in a dataset of 451 million headlines/links. An interrupted time series analysis was applied for each of the 175 NLP features to evaluate whether there were any statistically significant sustained changes after the release dates of ChatGPT and/or Copilot. There were a total of 44 features that did not appear to have any significant sustained change after the release of ChatGPT/Copilot. A total of 91 other features did show significant change with ChatGPT and/or Copilot although significance with earlier control LLM release dates (GPT-1/2/3, Gopher) removed them from consideration. This initial analysis suggests these language models may have had a limited impact on the style of individual news headlines/links, with respect to only some NLP measures.
zh

[NLP-37] Get the Agents Drunk: Memory Perturbations in Autonomous Agent -based Recommender Systems

【速读】：该论文旨在解决基于大型语言模型的推荐系统代理（Agent4RSs）在实际应用中的鲁棒性问题，特别是针对其记忆机制的攻击与防御。论文提出了一种名为DrunkAgent的新型实用攻击框架，通过扰动代理的记忆，不仅揭示了Agent4RSs的潜在局限性，还为其安全性与鲁棒性提升提供了改进方向，从而推动更安全可靠的AI代理发展。在黑盒设置下发起攻击更具实践意义，同时需确保攻击的隐蔽性以最大化影响。DrunkAgent的关键在于其由生成模块、策略模块和代理模块组成的架构：生成模块负责创建有效的对抗性文本触发器；策略模块设计用于“让目标代理喝醉”，使它们在交互过程中无法有效更新记忆，从而使触发器发挥最佳作用；这两个模块均在代理模块上进行优化，以提高攻击的迁移性和不可察觉性。通过识别和分析这些漏洞，论文为构建更安全、更具韧性的Agent4RSs提供了重要见解，并通过多种真实数据集上的广泛实验验证了DrunkAgent的有效性。

链接: https://arxiv.org/abs/2503.23804
作者: Shiyi Yang,Zhibo Hu,Chen Wang,Tong Yu,Xiwei Xu,Liming Zhu,Lina Yao
机构: The University of New South Wales (新南威尔士大学); Data61, CSIRO (澳大利亚联邦科学与工业研究组织); Adobe Research (Adobe 研究院)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model-based agents are increasingly used in recommender systems (Agent4RSs) to achieve personalized behavior modeling. Specifically, Agent4RSs introduces memory mechanisms that enable the agents to autonomously learn and self-evolve from real-world interactions. However, to the best of our knowledge, how robust Agent4RSs are remains unexplored. As such, in this paper, we propose the first work to attack Agent4RSs by perturbing agents’ memories, not only to uncover their limitations but also to enhance their security and robustness, ensuring the development of safer and more reliable AI agents. Given the security and privacy concerns, it is more practical to launch attacks under a black-box setting, where the accurate knowledge of the victim models cannot be easily obtained. Moreover, the practical attacks are often stealthy to maximize the impact. To this end, we propose a novel practical attack framework named DrunkAgent. DrunkAgent consists of a generation module, a strategy module, and a surrogate module. The generation module aims to produce effective and coherent adversarial textual triggers, which can be used to achieve attack objectives such as promoting the target items. The strategy module is designed to `get the target agents drunk’ so that their memories cannot be effectively updated during the interaction process. As such, the triggers can play the best role. Both of the modules are optimized on the surrogate module to improve the transferability and imperceptibility of the attacks. By identifying and analyzing the vulnerabilities, our work provides critical insights that pave the way for building safer and more resilient Agent4RSs. Extensive experiments across various real-world datasets demonstrate the effectiveness of DrunkAgent. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multiagent Systems (cs.MA) Cite as: arXiv:2503.23804 [cs.CR] (or arXiv:2503.23804v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.23804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-38] Adaptive Layer-skipping in Pre-trained LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成不同tokens时计算需求变化的根本问题。现有层跳过方法（layer-skipping methods）虽已提出以加速token生成，但忽视了不同tokens生成所需的计算开销差异这一基本问题。为应对这一挑战，论文提出了FlexiDepth，这是一种动态调整Transformer层数的方法。其关键在于通过引入插件式路由器（plug-in router）和适配器（adapter），使FlexiDepth能够在不修改LLMs原始参数的情况下实现自适应层跳过。实验表明，FlexiDepth显著减少了计算开销，同时保持了基准性能的完整性，并揭示了LLMs中计算需求随token类型而变化的现象，如重复token或固定短语生成所需层数较少，而复杂或高不确定性token生成则需要更多层数。这一自适应分配模式与人类直觉相一致。

链接: https://arxiv.org/abs/2503.23798
作者: Xuan Luo,Weizhi Wang,Xifeng Yan
机构: Department of Computer Science, UC Santa Barbara (加州大学圣塔芭芭拉分校计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, they have overlooked a fundamental question: How do computational demands vary across the generation of different tokens? In this work, we introduce FlexiDepth, a method that dynamically adjusts the number of Transformer layers used in text generation. By incorporating a plug-in router and adapter, FlexiDepth enables adaptive layer-skipping in LLMs without modifying their original parameters. Introducing FlexiDepth to Llama-3-8B model achieves layer skipping of 8 layers out of 32, and meanwhile maintains the full 100% benchmark performance. Experimental results with FlexiDepth demonstrate that computational demands in LLMs significantly vary based on token type. Specifically, generating repetitive tokens or fixed phrases requires fewer layers, whereas producing tokens involving computation or high uncertainty requires more layers. Interestingly, this adaptive allocation pattern aligns with human intuition. To advance research in this area, we open sourced FlexiDepth and a dataset documenting FlexiDepth’s layer allocation patterns for future exploration.
zh

[NLP-39] WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization

【速读】：该论文旨在探讨如何利用Winograd Schema挑战来评估大型语言模型（LLMs）的常识推理能力，并试图揭示现有基准测试（如WinoGrande）对LLMs性能评估的潜在偏差。论文的关键解决方案在于引入了一个新的语义等价数据集WinoWhat，通过对WinoGrande验证集的改述，提供更细粒度的分析，以识别LLMs在哪些常识知识类别上表现较差。此外，作者通过匹配基准实例与LLM训练数据，构建了两个测试套件，验证了模型性能是否受基准记忆效应的影响。研究发现，所有模型在WinoWhat上的表现显著下降，表明现有基准可能高估了LLMs的推理能力，而这种现象并非主要由记忆效应引起。

链接: https://arxiv.org/abs/2503.23779
作者: Ine Gevers,Victor De Marez,Luna De Bruyne,Walter Daelemans
机构: CLiPS (CLiPS); University of Antwerp (安特卫普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we take a closer look at how Winograd schema challenges can be used to evaluate common sense reasoning in LLMs. Specifically, we evaluate generative models of different sizes on the popular WinoGrande benchmark. We release WinoWhat, a new corpus, in which each instance of the WinoGrande validation set is paraphrased. Additionally, we evaluate the performance on the challenge across five common sense knowledge categories, giving more fine-grained insights on what types of knowledge are more challenging for LLMs. Surprisingly, all models perform significantly worse on WinoWhat, implying that LLM reasoning capabilities are overestimated on WinoGrande. To verify whether this is an effect of benchmark memorization, we match benchmark instances to LLM trainingdata and create two test-suites. We observe that memorization has a minimal effect on model performance on WinoGrande.
zh

[NLP-40] CONGRAD:Conflicting Gradient Filtering for Multilingual Preference Alignment

【速读】：该论文旨在解决多语言偏好对齐中因语言间负干扰导致的性能退化问题，这一现象在多语言训练中已知但尚未深入研究。论文的关键解决方案是提出CONGRAD方法，这是一种可扩展且有效的过滤技术，通过选择跨语言梯度冲突最小的高质量偏好样本，保留与聚合的多语言更新方向一致的样本，并结合子线性梯度压缩策略以降低内存开销。实验表明，CONGRAD在已见及未见语言上均优于强基准模型，同时保持较低的对齐税。

链接: https://arxiv.org/abs/2503.23777
作者: Jiangnan Li,Thuy-Trang Vu,Christian Herold,Amirhossein Tebbifakhr,Shahram Khadivi,Gholamreza Haffari
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Naive joint training of large language models (LLMs) for multilingual preference alignment can suffer from negative interference. This is a known issue in multilingual training, where conflicting objectives degrade overall performance. However, the impact of this phenomenon in the context of multilingual preference alignment remains largely underexplored. To address this issue, we propose CONGRAD, a scalable and effective filtering method that selects high-quality preference samples with minimal gradient conflicts across languages. Our method leverages gradient surgery to retain samples aligned with an aggregated multilingual update direction. Additionally, we incorporate a sublinear gradient compression strategy that reduces memory overhead during gradient accumulation. We integrate CONGRAD into self-rewarding framework and evaluate on LLaMA3-8B and Gemma2-2B across 10 languages. Results show that CONGRAD consistently outperforms strong baselines in both seen and unseen languages, with minimal alignment tax.
zh

[NLP-41] xture or Semantics? Vision-Language Models Get Lost in Font Recognition

【速读】：该论文试图解决的问题是现代视觉-语言模型（Vision-Language Models, VLMs）是否真正具备字体识别的能力。为探究这一问题，论文设计了一个名为Font Recognition Benchmark (FRB) 的紧凑且结构良好的数据集，包含15种常用字体，并设置了易版和难版两种任务场景，其中难版引入Stroop效应以挑战模型的感知能力。论文通过在字体识别任务上对多种VLMs进行广泛评估，揭示了当前模型在字体识别上的局限性，并分析了Few-shot学习和Chain-of-Thought (CoT) 提示对提升准确性的作用有限，同时通过注意力分析揭示了VLMs在捕捉语义特征方面的固有局限性。因此，该研究的关键在于通过构建FRB数据集并系统评估VLMs的性能，明确指出当前VLMs在字体识别任务中的不足及其根本原因。

链接: https://arxiv.org/abs/2503.23768
作者: Zhecheng Li,Guoxian Song,Yujun Cai,Zhen Xiong,Junsong Yuan,Yiwei Wang
机构: University of California, San Diego (加州大学圣地亚哥分校); ByteDance (字节跳动); The University of Queensland (昆士兰大学); University of Southern California (南加州大学); University at Buffalo (布法罗大学); University of California, Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
zh

[NLP-42] owards a cognitive architecture to enable natural language interaction in co-constructive task learning

【速读】：本文旨在探讨认知架构为了利用自然语言在协同建构任务学习（Co-Constructive Task Learning, CCTL）中的优势，必须具备哪些特性。论文首先从交互式任务学习（Interactive Task Learning, ITL）、人类记忆系统的机制以及自然语言与多模态的重要性出发提供背景。接着分析现有认知架构的能力，以基于多种来源的概念来阐明CCTL的状态。关键在于整合多个研究领域的见解，构建一个统一的框架，从而实现人机交互（Human-Robot Interaction, HRI）中的CCTL。最终，论文指出了实现CCTL所面临的剩余挑战与需求。

链接: https://arxiv.org/abs/2503.23760
作者: Manuel Scheibl,Birte Richter,Alissa Müller,Michael Beetz,Britta Wrede
机构: Medical Assistance Systems Group, Medical School OWL, Bielefeld University (比勒菲尔德大学); Machine Learning Group, Technical Faculty, Bielefeld University (比勒菲尔德大学); AICOR, Bremen University (不来梅大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 8 pages, 5 figures, submitted to: IEEE RO-MAN 2025

点击查看摘要

Abstract:This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).
zh

[NLP-43] Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model

【速读】：本文旨在解决短视频传播影响评级（Short-video Propagation Influence Rating, SPIR）任务中的挑战。为实现这一目标，论文从数据集构建与方法创新两个方面提出了解决方案。首先，提出了一个名为跨平台短视频（Cross-platform Short-Video, XS-Video）的数据集，该数据集包含来自五大中国主流平台的117,720个视频、381,926个样本及535个主题，并标注了从影响等级0到9的传播影响力，这是首个涵盖跨平台数据且包含全面交互信息（如浏览量、点赞数、分享数、收藏数、粉丝数、评论数及评论内容）的大规模短视频数据集。其次，开发了一种名为NetGPT的大图模型（Large Graph Model, LGM），基于新颖的三阶段训练机制，将异构图结构化数据与大型语言模型（Large Language Models, LLMs）的强大推理能力和知识相结合。NetGPT能够理解并分析短视频传播图谱，从而预测短视频的长期传播影响力。关键在于通过NetGPT模型的创新设计，实现了对复杂异构网络中短视频传播特性的深度理解和精准预测。

链接: https://arxiv.org/abs/2503.23746
作者: Dizhan Xue,Jing Cui,Shengsheng Qian,Chuanrui Hu,Changsheng Xu
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (自动化集成管理与智能系统重点实验室, 中国科学院); University of Chinese Academy of Sciences (中国科学院大学); School of Computer Science and Engineering, Tianjin University of Technology (天津工业大学计算机科学与工程学院); Qihoo 360 AI lab (奇虎360人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Short-video platforms have gained immense popularity, captivating the interest of millions, if not billions, of users globally. Recently, researchers have highlighted the significance of analyzing the propagation of short-videos, which typically involves discovering commercial values, public opinions, user behaviors, etc. This paper proposes a new Short-video Propagation Influence Rating (SPIR) task and aims to promote SPIR from both the dataset and method perspectives. First, we propose a new Cross-platform Short-Video (XS-Video) dataset, which aims to provide a large-scale and real-world short-video propagation network across various platforms to facilitate the research on short-video propagation. Our XS-Video dataset includes 117,720 videos, 381,926 samples, and 535 topics across 5 biggest Chinese platforms, annotated with the propagation influence from level 0 to 9. To the best of our knowledge, this is the first large-scale short-video dataset that contains cross-platform data or provides all of the views, likes, shares, collects, fans, comments, and comment content. Second, we propose a Large Graph Model (LGM) named NetGPT, based on a novel three-stage training mechanism, to bridge heterogeneous graph-structured data with the powerful reasoning ability and knowledge of Large Language Models (LLMs). Our NetGPT can comprehend and analyze the short-video propagation graph, enabling it to predict the long-term propagation influence of short-videos. Comprehensive experimental results evaluated by both classification and regression metrics on our XS-Video dataset indicate the superiority of our method for SPIR.
zh

[NLP-44] LANID: LLM -assisted New Intent Discovery COLING2024 LREC

【速读】：本文旨在解决任务型对话系统（TODS）在面对新意图（new intents）时的发现难题，即新意图发现（New Intent Discovery, NID）。传统方法在适应新意图时常面临语义表示不足或依赖外部知识的问题，而这些外部知识通常难以扩展且缺乏灵活性。尽管大规模语言模型（Large Language Models, LLMs）展现出强大的零样本能力，但其规模可能不适合处理实际应用中的大量查询需求。为克服现有NID方法的局限性，论文提出了一种名为LANID的新框架。LANID的关键在于利用LLMs增强轻量级NID编码器的语义表示能力。具体而言，LANID通过K最近邻算法和基于密度的空间聚类算法（Density-Based Spatial Clustering of Applications with Noise, DBSCAN）从训练集中采样选择性的话语对，并使用LLMs确定这些对之间的关系。由此产生的数据被用于设计对比微调任务，进而以对比三元组损失（contrastive triplet loss）的方式训练小型编码器。实验结果表明，该方法在三个不同的NID数据集上表现出色，在无监督和半监督设置下均超越了基准模型。

链接: https://arxiv.org/abs/2503.23740
作者: Lu Fan,Jiashu Pu,Rongsheng Zhang,Xiao-Ming Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in LREC-COLING 2024

点击查看摘要

Abstract:Task-oriented Dialogue Systems (TODS) often face the challenge of encountering new intents. New Intent Discovery (NID) is a crucial task that aims to identify these novel intents while maintaining the capability to recognize existing ones. Previous efforts to adapt TODS to new intents have struggled with inadequate semantic representation or have depended on external knowledge, which is often not scalable or flexible. Recently, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities; however, their scale can be impractical for real-world applications that involve extensive queries. To address the limitations of existing NID methods by leveraging LLMs, we propose LANID, a framework that enhances the semantic representation of lightweight NID encoders with the guidance of LLMs. Specifically, LANID employs the K -nearest neighbors and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms to sample selective utterance pairs from the training set. It then queries an LLM to ascertain the relationships between these pairs. The data produced from this process is utilized to design a contrastive fine-tuning task, which is then used to train a small encoder with a contrastive triplet loss. Our experimental results demonstrate the efficacy of the proposed method across three distinct NID datasets, surpassing strong baselines in both unsupervised and semi-supervised settings. Our code is available at this https URL.
zh

[NLP-45] AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization CVPR2025

【速读】：该论文试图解决在多模态大语言模型（Multimodal Large Language Models, MLLMs）中实现模型合并的问题，特别是针对其固有的异构特性（如架构差异和参数空间的不对称性）所带来的挑战。传统的模型合并方法主要适用于同构模型，难以直接应用于异构的MLLMs。论文的关键解决方案是提出了一种名为AdaMMS的新型模型合并方法，通过三个步骤实现：首先设计模型之间的映射函数以支持不同架构的MLLMs合并；其次利用线性插值调整模型权重以主动适应异构模型的不对称性；最后引入无监督的超参数选择方法优化合并过程。这一方法无需标注数据即可有效提升多种视觉-语言基准任务的表现。

链接: https://arxiv.org/abs/2503.23733
作者: Yiyang Du,Xiaochen Wang,Chi Chen,Jiabo Ye,Yiru Wang,Peng Li,Ming Yan,Ji Zhang,Fei Huang,Zhifang Sui,Maosong Sun,Yang Liu
机构: Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University (清华大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学); State Key Laboratory of Multimedia Information Processing, Peking University (北京大学); School of Software Microelectronics, Peking University (北京大学); Institute of Intelligent Computing, Alibaba Group (阿里巴巴集团); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); ModelTC Open Source Organization (ModelTC 开源组织)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Recently, model merging methods have demonstrated powerful strengths in combining abilities on various tasks from multiple Large Language Models (LLMs). While previous model merging methods mainly focus on merging homogeneous models with identical architecture, they meet challenges when dealing with Multimodal Large Language Models (MLLMs) with inherent heterogeneous property, including differences in model architecture and the asymmetry in the parameter space. In this work, we propose AdaMMS, a novel model merging method tailored for heterogeneous MLLMs. Our method tackles the challenges in three steps: mapping, merging and searching. Specifically, we first design mapping function between models to apply model merging on MLLMs with different architecture. Then we apply linear interpolation on model weights to actively adapt the asymmetry in the heterogeneous MLLMs. Finally in the hyper-parameter searching step, we propose an unsupervised hyper-parameter selection method for model merging. As the first model merging method capable of merging heterogeneous MLLMs without labeled data, extensive experiments on various model combinations demonstrated that AdaMMS outperforms previous model merging methods on various vision-language benchmarks.
zh

[NLP-46] KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language CVPR

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）评估方法中存在的两个主要问题：一是大多数评估方法因需从预设响应中选择答案而牺牲了开放性；二是依赖裁判模型进行评估导致结果主观且不可靠。此外，论文指出针对韩语的VLM评估基准存在不足，这对与其他常用英语基准分开作为独立度量标准至关重要，因为生成式语言模型的表现会因所使用的语言而显著不同。为了解决这些问题，论文提出了KOFFVQA（一个面向韩语的自由形式视觉问答基准），包含275个精心设计的问题及其对应的图像和评分标准，覆盖了VLM性能的10个不同方面。这些评分标准通过预先确定的规则来评价每个回答，从而消除了不可靠性的难题。关键在于定义了一种客观的评估准则，使得即使是小型开源模型也能可靠地在该基准上评估其他模型。实验验证表明，使用预先设定的评分标准的方法比现有方法更加可靠。

链接: https://arxiv.org/abs/2503.23730
作者: Yoonshik Kim,Jaeyoon Jung
机构: MAUM AI Inc. (MAUM AI股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to CVPRW 2025, Workshop on Benchmarking and Expanding AI Multimodal Approaches

点击查看摘要

Abstract:The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at this https URL
zh

[NLP-47] Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

【速读】：该论文试图解决的问题是：在指令微调（instruction tuning）领域，是否仍然需要依赖人类来源的数据信号，还是可以仅使用由大语言模型（Large Language Models, LLMs）自动生成的数据进行有效微调。论文通过构建高质量的指令微调数据集，将人类编写的指令与LLM生成的响应配对，验证了人类来源信号的价值，并进一步展示了这种结合方式的有效性。解决方案的关键在于提出一种简单而有效的数据构造方法，通过将人类指令与LLM生成的响应相结合，生成高质量的指令微调数据集，并证明这些数据在提升LLM性能方面的优越性，同时探讨了跨语言应用的潜力及其局限性。

链接: https://arxiv.org/abs/2503.23714
作者: Youmi Ma,Sakae Mizuki,Kazuki Fujii,Taishi Nakamura,Masanari Ohi,Hinari Shimada,Taihei Shiotani,Koshiro Saito,Koki Maeda,Kakeru Hattori,Takumi Okamoto,Shigeki Ishida,Rio Yokota,Hiroya Takamura,Naoaki Okazaki
机构: Department of Computer Science, School of Computing, Institute of Science Tokyo (东京工业大学计算机科学学院); National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所); National Institute of Informatics (日本情报通信研究机构)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still need human-originated signals for instruction tuning? This work answers the question affirmatively: we build state-of-the-art instruction-tuning datasets sourced from human-written instructions, by simply pairing them with LLM-generated responses. LLMs fine-tuned on our datasets consistently outperform those fine-tuned on existing ones. Our data construction approach can be easily adapted to other languages; we build datasets for Japanese and confirm that LLMs tuned with our data reach state-of-the-art performance. Analyses suggest that instruction-tuning in a new language allows LLMs to follow instructions, while the tuned models exhibit a notable lack of culture-specific knowledge in that language. The datasets and fine-tuned models will be publicly available. Our datasets, synthesized with open-weight LLMs, are openly distributed under permissive licenses, allowing for diverse use cases.
zh

[NLP-48] Mapping Geopolitical Bias in 11 Large Language Models : A Bilingual Dual-Framing Analysis of U.S.-China Tensions

【速读】：本文旨在系统性分析11个主流大型语言模型（Large Language Models, LLMs）在美中关系七个关键议题上的地缘政治偏见。为实现这一目标，研究采用了双语（英语与中文）及双重框架（肯定与反向）的方法论，生成了19,712条提示词以检测模型输出中的意识形态倾向，并通过标准化评分尺度（-2表示强烈亲华，+2表示强烈亲美）对其进行定量评估，分类为立场、中立及拒绝率。研究的关键在于揭示LLMs的地理起源与其意识形态一致性之间的显著关联：美国本土模型多倾向于亲美立场，而中国起源的模型则表现出明显的亲华偏见。此外，语言选择与提示词框架对模型响应有重大影响，部分LLMs会因提示词极性或语言环境而改变立场。研究还引入了综合指标来评估跨语言和框架条件下的响应一致性，识别模型行为中的变异性与脆弱点。解决方案的关键在于通过全面的模型评估方法，帮助组织和个人选择最符合其运营优先级和地缘政治考量的LLMs，同时揭示特定提示结构和语言变化如何有效引导模型输出，从而为操控和导航LLM响应提供策略指导。

链接: https://arxiv.org/abs/2503.23688
作者: William Guey,Pierrick Bougault,Vitor D. de Moura,Wei Zhang,Jose O. Gomes
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preliminary version,20 pages, 10 figures, 1 table

点击查看摘要

Abstract:This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs) by examining their responses to seven critical topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts designed to detect ideological leanings in model outputs. Responses were quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to +2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and refusal rates. The findings demonstrate significant and consistent ideological alignments correlated with the LLMs’ geographic origins; U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases. Notably, language and prompt framing substantially influenced model responses, with several LLMs exhibiting stance reversals based on prompt polarity or linguistic context. Additionally, we introduced comprehensive metrics to evaluate response consistency across languages and framing conditions, identifying variability and vulnerabilities in model behaviors. These results offer practical insights that can guide organizations and individuals in selecting LLMs best aligned with their operational priorities and geopolitical considerations, underscoring the importance of careful model evaluation in politically sensitive applications. Furthermore, the research highlights specific prompt structures and linguistic variations that can strategically trigger distinct responses from models, revealing methods for effectively navigating and influencing LLM outputs.
zh

[NLP-49] MKA: Leverag ing Cross-Lingual Consensus for Model Abstention ICLR2025

【速读】：该论文试图解决大型语言模型（LLMs）在可靠性方面的挑战，特别是在其输出是否具有事实准确性以及能否正确评估自身响应置信度方面的问题。论文的核心目标是利用LLM的多语言知识，在面对提示时为其提供决策依据，即决定是回答还是放弃作答。解决方案的关键在于开发一个多语言校准管道，通过量化模型的置信度来实现其在不确定情况下的主动弃答机制。研究通过对多种多语言模型进行测试，验证了该管道在不同语言中的表现，并发现其能够普遍提升模型性能，例如使孟加拉语的准确率提升了71.2%，英语也提升了15.5%。这些结果表明，该方法对于提高LLMs的可靠性具有潜力。

链接: https://arxiv.org/abs/2503.23687
作者: Sharad Duwal
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in Building Trust Workshop at ICLR 2025

点击查看摘要

Abstract:Reliability of LLMs is questionable even as they get better at more tasks. A wider adoption of LLMs is contingent on whether they are usably factual. And if they are not, on whether they can properly calibrate their confidence in their responses. This work focuses on utilizing the multilingual knowledge of an LLM to inform its decision to abstain or answer when prompted. We develop a multilingual pipeline to calibrate the model’s confidence and let it abstain when uncertain. We run several multilingual models through the pipeline to profile them across different languages. We find that the performance of the pipeline varies by model and language, but that in general they benefit from it. This is evidenced by the accuracy improvement of 71.2% for Bengali over a baseline performance without the pipeline. Even a high-resource language like English sees a 15.5% improvement. These results hint at possible further improvements.
zh

[NLP-50] Large Language Models Pass the Turing Test

【速读】：该论文试图解决的问题是评估多个大型语言模型（Large Language Models, LLMs）在标准三党图灵测试（standard three-party Turing test）中的表现，以确定是否存在任何人工智能系统能够通过此类测试。论文的关键解决方案在于设计并实施两项随机对照且预先注册的实验，通过让参与者与人类及四种不同系统（ELIZA、GPT-4o、LLaMa-3.1-405B 和 GPT-4.5）进行限时交互后判断哪一方是人类，从而量化各系统的拟人化能力。实验结果表明，在采用拟人化提示的情况下，GPT-4.5 的被误判率为 73%，显著高于真实人类的表现，而 LLaMa-3.1 的表现与人类无显著差异，但基线模型（ELIZA 和 GPT-4o）的胜率显著低于随机水平。这一研究首次提供了人工智能系统通过标准三党图灵测试的实证证据，并引发了关于 LLM 展现何种类型智能以及这些系统可能带来的社会和经济影响的讨论。

链接: https://arxiv.org/abs/2503.23674
作者: Cameron R. Jones,Benjamin K. Bergen
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time – not significantly more or less often than the humans they were being compared to – while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
zh

[NLP-51] WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

【速读】：该论文试图解决生物医学自然语言处理（Biomedical Natural Language Processing, BioNLP）任务中高质量数据稀缺的问题，这一问题限制了大型语言模型正确理解生物实体间关系（如分子与疾病或药物相互作用）的能力，并可能导致对生物医学文档的潜在误解读。当前解决方法通常采用基于合成数据增强（Synthetic Data Augmentation）的策略，但这些方法往往生成对抗性样本（counterfactual data），破坏有意义的词集或产生语义偏离原始上下文的句子，从而无法有效提升模型性能。

解决方案的关键在于提出了一种针对生物医学领域的基于理性（rationale-based）的合成数据增强方法。与仅依赖词典相似度不同，该方法通过测量特定的生物关系相似度（bio-relation similarity），确保增强实例与生物关系具有强相关性，而不仅仅是增加数据多样性。此外，引入多智能体参与的反思机制（reflection mechanism），帮助模型迭代区分相似实体的不同用法，以避免陷入错误替换陷阱。实验结果表明，该方法在BLURB和BigBIO基准测试中的9个常见数据集上均实现了跨任务的一致性能提升，验证了其在缓解数据稀缺问题和提高生物医学NLP模型整体性能方面的有效性。

链接: https://arxiv.org/abs/2503.23673
作者: Zhengyi Zhao,Shubo Zhang,Bin Liang,Binyang Li,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); University of International Relations (国际关系学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.
zh

[NLP-52] CrossFormer: Cross-Segment Semantic Fusion for Document Segmentation

【速读】：该论文旨在解决传统文本语义分割方法因预处理文档为段落以适应输入长度限制而导致跨段关键语义信息丢失的问题。为应对这一挑战，论文提出了一种基于Transformer的模型CrossFormer，其关键创新在于引入了一种新颖的跨段融合模块（cross-segment fusion module），能够动态建模文档段落之间的潜在语义依赖关系，显著提升分割准确性。此外，CrossFormer可替代基于规则的分块方法，用于增强检索增强生成（Retrieval-Augmented Generation, RAG）系统中的分块效果，生成更具语义连贯性的片段，从而提高整体效能。综合评估表明，CrossFormer在公共文本语义分割数据集上表现出最先进的性能，并在RAG基准测试中取得了显著改进。

链接: https://arxiv.org/abs/2503.23671
作者: Tongke Ni,Yang Fan,Junru Zhou,Xiangping Wu,Qingcai Chen
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学（深圳）); Tencent Inc. (腾讯);
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer-based model featuring a novel cross-segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule-based chunk methods within the Retrieval-Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer’s state-of-the-art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.
zh

[NLP-53] he Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR

【速读】：该论文旨在解决代码混合（Code-switching）现象在构建用户友好型语言技术中的数据稀缺问题，特别是探索如何通过合成数据增强（code-switched data augmentation）来提升自然语言处理（NLP）任务的表现。论文的关键在于研究合成数据质量与NLP任务改进之间的关系，并通过扩展先前在机器翻译（Machine Translation, MT）、自动语音识别（Automatic Speech Recognition, ASR）以及级联语音翻译（Cascaded Speech Translation, ST）领域的研究，验证不同增强技术的有效性及其对性能的影响。实验涵盖了词典替换、语言学理论应用及回译等多种技术手段，以期提供更全面的理解和指导。

链接: https://arxiv.org/abs/2503.23576
作者: Injy Hamed,Ngoc Thang Vu,Nizar Habash
机构: MBZUAI; University of Stuttgart; New York University Abu Dhabi
类目: Computation and Language (cs.CL)
备注: Accepted to the Workshop on Computational Approaches to Linguistic Code-Switching (CALCS)

点击查看摘要

Abstract:Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.
zh

[NLP-54] When LLM Therapists Become Salespeople: Evaluating Large Language Models for Ethical Motivational Interviewing

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在应用心理治疗，特别是动机性访谈（Motivational Interviewing, MI）过程中，其伦理意识不足的问题。研究发现，尽管LLMs在MI知识方面具备中等到强的能力，但它们未能遵循MI的精神，在生成符合伦理的MI实践以及检测不道德行为方面表现不佳。论文的关键在于提出了一种名为“Chain-of-Ethic”的提示策略，通过改进LLMs的伦理敏感性，有效提升了生成符合伦理的MI回应及检测不道德回应的能力。这一解决方案强调了开发基于LLMs的心理治疗工具时进行安全性评估与制定伦理指南的重要性。

链接: https://arxiv.org/abs/2503.23566
作者: Haein Kong,Seonghyeon Moon
机构: Rutgers University (罗格斯大学); Roblox
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been actively applied in the mental health field. Recent research shows the promise of LLMs in applying psychotherapy, especially motivational interviewing (MI). However, there is a lack of studies investigating how language models understand MI ethics. Given the risks that malicious actors can use language models to apply MI for unethical purposes, it is important to evaluate their capability of differentiating ethical and unethical MI practices. Thus, this study investigates the ethical awareness of LLMs in MI with multiple experiments. Our findings show that LLMs have a moderate to strong level of knowledge in MI. However, their ethical standards are not aligned with the MI spirit, as they generated unethical responses and performed poorly in detecting unethical responses. We proposed a Chain-of-Ethic prompt to mitigate those risks and improve safety. Finally, our proposed strategy effectively improved ethical MI response generation and detection performance. These findings highlight the need for safety evaluations and guidelines for building ethical LLM-powered psychotherapy.
zh

[NLP-55] NRC VAD Lexicon v2: Norms for Valence Arousal and Dominance for over 55k English Terms

【速读】：该论文旨在构建并扩展一个包含英语单词和短语情感维度（Valence、Arousal、Dominance，简称VAD）评估的词典——NRC VAD Lexicon v2，以支持多领域研究。其试图解决的问题是如何系统性地量化大规模词汇的情感特征，并填补现有资源在覆盖范围（如新增约25,000个单字词及首次加入约10,000个多词短语）上的不足。解决方案的关键在于通过人工标注的方式，为超过55,000个单词和短语提供可靠且一致的VAD评分，并验证这些关联的高度可靠性，从而为心理学、自然语言处理、公共卫生、数字人文以及社会科学等领域的研究提供基础工具。

链接: https://arxiv.org/abs/2503.23547
作者: Saif M. Mohammad
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D) (also referred to in social cognition research as Competence ©). These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. We present here the NRC VAD Lexicon v2, which has human ratings of valence, arousal, and dominance for more than 55,000 English words and phrases. Notably, it adds entries for \sim 25k additional words to v1.0. It also now includes for the first time entries for common multi-word phrases (~10k). We show that the associations are highly reliable. The lexicon enables a wide variety of research in psychology, NLP, public health, digital humanities, and social sciences. The NRC VAD Lexicon v2 is made freely available for research through our project webpage.
zh

[NLP-56] Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

【速读】：该论文试图解决多语言自动语音识别（ASR）系统在处理少数语言时因缺乏足够的语言学区分能力而导致性能下降的问题。解决方案的关键在于将传统的统计语言模型（Statistical Language Models）和大型语言模型（Large Language Models）与经过微调的Whisper模型相结合，通过优化语言模型参数来提升其在低资源语言场景下的表现。这种方法不仅利用了Whisper预训练数据的广泛覆盖性，还增强了其对语言学差异的适应能力，从而显著降低了词错误率（Word Error Rate），特别是在分布内（up to 51%）和分布外（up to 34%）的数据集上均取得了实质性改进。

链接: https://arxiv.org/abs/2503.23542
作者: Xabier de Zuazo,Eva Navas,Ibon Saratxaga,Inma Hernáez Rioja
机构: HiTZ - University of the Basque Country - UPV/EHU (比斯开大学 - UPV/EHU); HiTZ - University of the Basque Country - UPV/EHU (比斯开大学 - UPV/EHU); HiTZ - University of the Basque Country - UPV/EHU (比斯开大学 - UPV/EHU); HiTZ - University of the Basque Country - UPV/EHU (比斯开大学 - UPV/EHU)
类目: Computation and Language (cs.CL)
备注: 26 pages, 6 figures, includes supplementary materials. Will be submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

点击查看摘要

Abstract:Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51% for in-distribution datasets and up to 34% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at this http URL.
zh

[NLP-57] Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理需要外部知识的任务（如知识密集型多选题回答，knowledge-intensive Multiple Choice Question Answering, MCQA）时面临的困难。现有方法通常需要昂贵的微调或检索噪声知识图谱（Knowledge Graphs, KGs）信息，而近期基于图神经网络（Graph Neural Networks, GNNs）生成KG相关输入嵌入作为软提示的方法虽有所改进，但未能充分考虑问题与知识的相关性，导致提示噪声较大。此外，在MCQA任务中，某些答案选项缺乏相关KG知识仍是显著挑战。为了解决这些问题，论文提出了一种名为“基于问题感知的知识图谱提示”（Question-Aware Knowledge Graph Prompting, QAP）的方案。其关键是将问题嵌入整合到GNN聚合过程中，以动态评估KG的相关性，并利用全局注意力捕捉选项间的相互关系，从而丰富软提示中的推断知识。实验结果表明，QAP在多个数据集上的表现优于当前最先进的方法，凸显了其有效性。

链接: https://arxiv.org/abs/2503.23523
作者: Haochen Liu,Song Wang,Chen Chen,Jundong Li
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with tasks requiring external knowledge, such as knowledge-intensive Multiple Choice Question Answering (MCQA). Integrating Knowledge Graphs (KGs) can enhance reasoning; however, existing methods typically demand costly fine-tuning or retrieve noisy KG information. Recent approaches leverage Graph Neural Networks (GNNs) to generate KG-based input embedding prefixes as soft prompts for LLMs but fail to account for question relevance, resulting in noisy prompts. Moreover, in MCQA tasks, the absence of relevant KG knowledge for certain answer options remains a significant challenge. To address these issues, we propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance. QAP employs global attention to capture inter-option relationships, enriching soft prompts with inferred knowledge. Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.
zh

[NLP-58] If an LLM Were a Character Would It Know Its Own Story? Evaluating Lifelong Learning in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多轮、多智能体交互中表现出的“无状态”特性与其逐渐展现的一致性行为之间的矛盾，并评估其是否具备类似人类的终生学习能力。然而，现有基准测试通常未能有效捕捉这些动态变化，主要集中在静态、开放式的评估上。为填补这一空白，论文引入了LIFESTATE-BENCH，这是一个专门用于评估LLMs终生学习能力的新基准。其关键在于设计了包含Hamlet剧本数据集及合成剧本集合的两个情景化数据集，通过事实核查任务来探究模型的自我意识、情景记忆检索以及关系追踪能力，涵盖了参数化与非参数化方法。实验结果表明，非参数化方法在处理状态感知学习方面显著优于参数化方法，但所有模型在长时间交互中均面临灾难性遗忘的问题，这凸显了在LLMs终生学习领域进一步研究的重要性。

链接: https://arxiv.org/abs/2503.23514
作者: Siqi Fan,Xiusheng Huang,Yiqun Yao,Xuezhi Fang,Kang Liu,Peng Han,Shuo Shang,Aixin Sun,Yequan Wang
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of Electronic Science and Technology of China (电子科技大学); Institute of Computing Automation, Chinese Academy of Sciences (中国科学院计算自动化研究所); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can carry out human-like dialogue, but unlike humans, they are stateless due to the superposition property. However, during multi-turn, multi-agent interactions, LLMs begin to exhibit consistent, character-like behaviors, hinting at a form of emergent lifelong learning. Despite this, existing benchmarks often fail to capture these dynamics, primarily focusing on static, open-ended evaluations. To address this gap, we introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in LLMs. It features two episodic datasets: Hamlet and a synthetic script collection, rich in narrative structure and character interactions. Our fact checking evaluation probes models’ self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches. Experiments on models like Llama3.1-8B, GPT-4-turbo, and DeepSeek R1, we demonstrate that nonparametric methods significantly outperform parametric ones in managing stateful learning. However, all models exhibit challenges with catastrophic forgetting as interactions extend, highlighting the need for further advancements in lifelong learning.
zh

[NLP-59] RARE: Retrieval-Augmented Reasoning Modeling

【速读】：该论文旨在解决领域特定智能（Domain-specific Intelligence）中大型语言模型（Large Language Models, LLMs）因参数预算限制导致的知识幻觉（Knowledge Hallucination）和推理能力不足的问题。论文提出的解决方案核心在于提出了一种名为检索增强推理建模（Retrieval-Augmented Reasoning Modeling, RARE）的新范式，通过将知识存储与推理优化解耦来实现。RARE 的关键创新在于外部化领域知识到可检索的来源，并在训练过程中内化领域特定的推理模式，通过将检索到的知识注入训练提示，将学习目标从机械记忆转变为情境化推理应用。这种方法使模型能够避免资源密集型的记忆任务，优先发展更高阶的认知过程。实验结果表明，轻量级的 RARE 训练模型在性能上超越了基于检索增强的 GPT-4 和 Deepseek-R1 蒸馏版本。

链接: https://arxiv.org/abs/2503.23513
作者: Zhengren Wang,Jiayang Yu,Dongsheng Ma,Zhe Chen,Yu Wang,Zhiyu Li,Feiyu Xiong,Yanfeng Wang,Weinan E,Linpeng Tang,Wentao Zhang
机构: Peking University (北京大学); Shanghai Jiao Tong University (上海交通大学); Northeastern University (东北大学); Nankai University (南开大学); Institute for Advanced Algorithms Research, Shanghai (上海先进算法研究所); OriginHub Technology (OriginHub 技术公司); MemTensor (MemTensor); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom’s Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts, RARE transforms learning objectives from rote memorization to contextualized reasoning application. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Our experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and Deepseek-R1 distilled counterparts. RARE establishes a paradigm shift where maintainable external knowledge bases synergize with compact, reasoning-optimized models, collectively driving more scalable domain-specific intelligence. Repo: this https URL
zh

[NLP-60] SCORE: Story Coherence and Retrieval Enhancement for AI Narratives

【速读】：该论文试图解决大型语言模型（LLMs）在复杂故事中长期连贯性和情感一致性方面的不足。解决方案的关键在于提出了一种名为SCORE（故事连贯性与检索增强）的框架，该框架集成了三个核心组件：1）动态状态跟踪（通过符号逻辑监控对象/角色），2）上下文感知摘要（分层情节摘要以支持时间推进），以及3）混合检索（结合基于TF-IDF关键词的相关性和基于余弦相似度的语义嵌入）。此外，系统采用时间对齐的检索增强生成（RAG）管道来验证上下文一致性。这些设计使得SCORE在连贯性、情感一致性和减少幻觉方面显著优于基线GPT模型。

链接: https://arxiv.org/abs/2503.23512
作者: Qiang Yi,Yangfan He,Jianhui Wang,Xinyuan Song,Shiyao Qian,Miao Zhang,Li Sun,Tianyu Shi
机构: University of California, Berkeley(加州大学伯克利分校); University of Minnesota - Twin Cities(明尼苏达大学双城分校); University of Electronic Science and Technology of China(电子科技大学); Emory University(埃默里大学); University of Toronto(多伦多大学); Tsinghua University(清华大学); Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at generating creative narratives but struggle with long-term coherence and emotional consistency in complex stories. To address this, we propose SCORE (Story Coherence and Retrieval Enhancement), a framework integrating three components: 1) Dynamic State Tracking (monitoring objects/characters via symbolic logic), 2) Context-Aware Summarization (hierarchical episode summaries for temporal progression), and 3) Hybrid Retrieval (combining TF-IDF keyword relevance with cosine similarity-based semantic embeddings). The system employs a temporally-aligned Retrieval-Augmented Generation (RAG) pipeline to validate contextual consistency. Evaluations show SCORE achieves 23.6% higher coherence (NCI-2.0 benchmark), 89.7% emotional consistency (EASM metric), and 41.8% fewer hallucinations versus baseline GPT models. Its modular design supports incremental knowledge graph construction for persistent story memory and multi-LLM backend compatibility, offering an explainable solution for industrial-scale narrative systems requiring long-term consistency.
zh

[NLP-61] Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies in Vision-Language Models ICLR2025

【速读】：该论文试图解决视觉-语言模型在多模态推理任务中提示词优化的问题，目标是在不重新训练模型的情况下提升其性能。论文的关键解决方案在于引入了一种基于进化算法的提示词更新方法，通过模拟“适者生存”的迭代机制，引导语言模型在多个进化世代中自主发现解决问题的进阶策略。这种方法使模型能够利用工具调用（如通过系统级XML标签显式触发对Python解释器的访问）生成相关程序，从而显著提升特定视觉任务的性能，实验结果显示相对改进可达约50%。此外，该方法促进了模型在零样本场景下的任务泛化能力。

链接: https://arxiv.org/abs/2503.23503
作者: Sid Bharthulwar,John Rho,Katrina Brown
机构: Harvard College (哈佛学院)
类目: Computation and Language (cs.CL)
备注: Published at ICLR 2025 Workshop on Reasoning and Planning for LLMs

点击查看摘要

Abstract:We present a framework for optimizing prompts in vision-language models to elicit multimodal reasoning without model retraining. Using an evolutionary algorithm to guide prompt updates downstream of visual tasks, our approach improves upon baseline prompt-updating algorithms, which lack evolution-style “survival of the fittest” iteration. Crucially, we find this approach enables the language model to independently discover progressive problem-solving techniques across several evolution generations. For example, the model reasons that to “break down” visually complex spatial tasks, making a tool call to a Python interpreter to perform tasks (such as cropping, image segmentation, or saturation changes) would improve performance significantly. Our experimentation shows that explicitly evoking this “tool calling” call, via system-level XML …\texttttool … \texttt/tool… tags, can effectively flag Python interpreter access for the same language model to generate relevant programs, generating advanced multimodal functionality. This functionality can be crystallized into a system-level prompt that induces improved performance at inference time, and our experimentation suggests up to \approx 50% relative improvement across select visual tasks. Downstream performance is trained and evaluated across subtasks from MathVista, M3CoT, and GeoBench-VLM datasets. Importantly, our approach shows that evolutionary prompt optimization guides language models towards self-reasoning discoveries, which result in improved zero-shot generalization across tasks.
zh

[NLP-62] Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models ACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在系统性推理（systematic reasoning）方面的不足，特别是其在面对分布外（out-of-distribution）问题时的表现不佳。尽管通过强化学习（reinforcement learning）和思维链提示（chain-of-thought prompting）等后训练策略能够提升性能，但这些方法的效果尚未在需要复杂关系组合推理的任务中得到充分验证，尤其是在定性空间和时间推理（qualitative spatial and temporal reasoning）领域。论文的关键在于设计了一类能够精确衡量模型泛化能力的任务实例，并通过这些任务评估当前LLMs和基于强化学习的大型推理模型（Large Reasoning Models, LRMs）的表现，以揭示其在系统性推理任务中的局限性。

链接: https://arxiv.org/abs/2503.23487
作者: Irtaza Khalid,Amir Masoud Nourollah,Steven Schockaert
机构: Cardiff University (卡迪夫大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have been found to struggle with systematic reasoning. Even on tasks where they appear to perform well, their performance often depends on shortcuts, rather than on genuine reasoning abilities, leading them to collapse on out-of-distribution examples. Post-training strategies based on reinforcement learning and chain-of-thought prompting have recently been hailed as a step change. However, little is still known about the potential of the resulting ``Large Reasoning Models’’ (LRMs) beyond problem solving in mathematics and programming, where finding genuine out-of-distribution problems can be difficult. In this paper, we focus on tasks that require systematic reasoning about relational compositions, especially for qualitative spatial and temporal reasoning. These tasks allow us to control the difficulty of problem instances, and measure in a precise way to what extent models can generalise. We find that that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.
zh

[NLP-63] Order Independence With Finetuning ICLR2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理自然语言处理（NLP）任务时存在的顺序依赖性问题，即重新排列语义相同的标记（如多选题中的答案选项）可能导致预测结果不一致。为了解决这一问题，论文提出了结合Set-Based Prompting (SBP) 的微调策略，通过在训练过程中“拉近”以集合格式呈现的提示与模型训练流形之间的距离，从而在去除指定标记子集的顺序信息的同时，避免因输入格式的变化导致的分布外性能下降。关键在于将SBP集成到微调过程中，确保模型在保持广泛语言建模能力的同时，显著提高对答案顺序排列变化的准确性和鲁棒性。

链接: https://arxiv.org/abs/2503.23483
作者: Katrina Brown,Reid McIlroy
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published as a Bi-Align workshop paper at ICLR 2025

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance on many NLP tasks, yet often exhibit order dependence: simply reordering semantically identical tokens (e.g., answer choices in multiple-choice questions) can lead to inconsistent predictions. Recent work proposes Set-Based Prompting (SBP) as a way to remove order information from designated token subsets, thereby mitigating positional biases. However, applying SBP on base models induces an out-of-distribution input format, which can degrade in-distribution performance. We introduce a fine-tuning strategy that integrates SBP into the training process, “pulling” these set-formatted prompts closer to the model’s training manifold. We show that SBP can be incorporated into a model via fine-tuning. Our experiments on in-distribution (MMLU) and out-of-distribution (CSQA, ARC Challenge) multiple-choice tasks show that SBP fine-tuning significantly improves accuracy and robustness to answer-order permutations, all while preserving broader language modeling capabilities. We discuss the broader implications of order-invariant modeling and outline future directions for building fairer, more consistent LLMs.
zh

[NLP-64] Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

【速读】：该论文旨在解决软件测试中因测试用例不足导致的潜在错误（即“误报”问题），特别是针对由大语言模型自动生成代码的软件测试场景。论文的关键在于构建了一个名为Codehacks的数据集，其中包含编程问题及其对应的“hack”（即能够引发错误的测试用例）。这些数据来源于Codeforces在线评测平台，涵盖了288,617个hack实例和5,578个编程问题，同时提供了2,196个可被相应hack破坏的提交代码示例。通过这一数据集，研究者支持了一种以数据驱动的方式创建能够有效发现潜在缺陷的测试套件的方法，从而提升软件评估的可靠性。

链接: https://arxiv.org/abs/2503.23466
作者: Max Hort,Leon Moonen
机构: Simula Research Laboratory (西蒙拉研究实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)

点击查看摘要

Abstract:Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e., “hacks”). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset Comments: Accepted for publication at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2503.23466 [cs.SE] (or arXiv:2503.23466v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.23466 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-65] Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection

【速读】：该论文旨在探索是否可以利用与变异算子类似的语义保持变换（Semantic-Preserving Transformations），来提升缺陷检测工具在测试阶段的性能。现有工作主要关注通过增强训练数据提高模型在语义相同代码上的鲁棒性，而未充分考虑在工具应用过程中利用语义保持变换进行改进的概念，这一概念与元形态测试密切相关。论文的关键在于收集并评估现有的语义保持变换实现，并研究三种不同的集成策略（Ensemble Strategies）在结合缺陷检测模型（如VulBERTa和PLBART）后的有效性。然而，实验结果表明，直接复用共享的语义保持变换存在困难，部分变换甚至可能错误地改变代码语义，从而未能提升缺陷检测模型的准确性。

链接: https://arxiv.org/abs/2503.23448
作者: Max Hort,Linas Vidziunas,Leon Moonen
机构: Simula Research Laboratory (西蒙拉研究实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in Mutation 2025 at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)

点击查看摘要

Abstract:Recent advances in defect detection use language models. Existing works enhanced the training data to improve the models’ robustness when applied to semantically identical code (i.e., predictions should be the same). However, the use of semantically identical code has not been considered for improving the tools during their application - a concept closely related to metamorphic testing. The goal of our study is to determine whether we can use semantic-preserving transformations, analogue to mutation operators, to improve the performance of defect detection tools in the testing stage. We first collect existing publications which implemented semantic-preserving transformations and share their implementation, such that we can reuse them. We empirically study the effectiveness of three different ensemble strategies for enhancing defect detection tools. We apply the collected transformations on the Devign dataset, considering vulnerabilities as a type of defect, and two fine-tuned large language models for defect detection (VulBERTa, PLBART). We found 28 publications with 94 different transformations. We choose to implement 39 transformations from four of the publications, but a manual check revealed that 23 out 39 transformations change code semantics. Using the 16 remaining, correct transformations and three ensemble strategies, we were not able to increase the accuracy of the defect detection models. Our results show that reusing shared semantic-preserving transformation is difficult, sometimes even causing wrongful changes to the semantics. Keywords: defect detection, language model, semantic-preserving transformation, ensemble Comments: Accepted for publication in Mutation 2025 at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2503.23448 [cs.SE] (or arXiv:2503.23448v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.23448 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Leon Moonen [view email] [v1] Sun, 30 Mar 2025 14:00:22 UTC (631 KB)
zh

[NLP-66] Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

【速读】：该论文旨在解决语音对话系统中端-turn检测（End-Turn Detection, ETD）的问题，即系统难以区分用户话语的结束与犹豫。现有系统在这一任务上的不足常导致响应过早或延迟，从而破坏对话流畅性。为解决此问题，论文提出了SpeculativeETD，这是一种新颖的合作推理框架，通过结合轻量级GRU模型在本地设备上实时检测非说话单元以保证效率，同时利用高性能的Wav2vec模型在服务器端进行更复杂的分类任务，从而平衡效率与准确性，提升资源受限环境下的实时ETD性能。

链接: https://arxiv.org/abs/2503.23439
作者: Hyunjong Ok,Suho Yoo,Jaeho Lee
机构: POSTECH (POSTECH); HJ AILAB (HJ AILAB)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint

点击查看摘要

Abstract:Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) – the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.
zh

[NLP-67] CoRanking: Collaborative Ranking with Small and Large Ranking Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在列表式排序任务中因参数规模庞大及重复滑动窗口过程导致的效率挑战。论文提出的解决方案是CoRanking，这是一种结合小规模与大规模排名模型的新颖协作排名框架。其关键是通过一个小规模重排序器初步对候选文档进行预排序，将相关文档提升到列表前部（如前20名），然后仅让LLM列表式重排序器处理这些顶级文档，从而大幅提升整体排序效率。此外，为了缓解LLM列表式重排序器对输入文档顺序存在显著位置偏差的问题，论文引入了一种基于强化学习训练的文档顺序调整器，重新排列来自小规模重排序器的顶级文档，使其更符合LLM的排序偏好。实验结果表明，CoRanking不仅显著提高了效率（减少约70%的排序延迟），同时在三个信息检索基准数据集上的效果甚至优于仅使用LLM列表式重排序器的方法。

链接: https://arxiv.org/abs/2503.23427
作者: Wenhan Liu,Xinyu Ma,Yutao Zhu,Lixin Su,Shuaiqiang Wang,Dawei Yin,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); Baidu Inc. (百度), Beijing, China (北京，中国)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated superior listwise ranking performance. However, their superior performance often relies on large-scale parameters (\eg, GPT-4) and a repetitive sliding window process, which introduces significant efficiency challenges. In this paper, we propose \textbfCoRanking, a novel collaborative ranking framework that combines small and large ranking models for efficient and effective ranking. CoRanking first employs a small-size reranker to pre-rank all the candidate passages, bringing relevant ones to the top part of the list (\eg, top-20). Then, the LLM listwise reranker is applied to only rerank these top-ranked passages instead of the whole list, substantially enhancing overall ranking efficiency. Although more efficient, previous studies have revealed that the LLM listwise reranker have significant positional biases on the order of input passages. Directly feed the top-ranked passages from small reranker may result in the sub-optimal performance of LLM listwise reranker. To alleviate this problem, we introduce a passage order adjuster trained via reinforcement learning, which reorders the top passages from the small reranker to align with the LLM’s preferences of passage order. Extensive experiments on three IR benchmarks demonstrate that CoRanking significantly improves efficiency (reducing ranking latency by about 70%) while achieving even better effectiveness compared to using only the LLM listwise reranker.
zh

[NLP-68] What Makes an Evaluation Useful? Common Pitfalls and Best Practices

【速读】：该论文试图解决人工智能系统安全评估中缺乏明确“良好评估”定义的问题，以及如何支持安全使用和发展的决策。论文的关键在于提出了一套基于已有模型评估工作和网络安全实例的最佳实践，通过连接威胁建模与评估设计的初始思维过程，明确了有效评估的特性和参数，并进一步探讨了从构建特定评估到构建全面评估套件的额外考量。

链接: https://arxiv.org/abs/2503.23424
作者: Gil Gekker,Meirav Segal,Dan Lahav,Omer Nevo
机构: Pattern Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Following the rapid increase in Artificial Intelligence (AI) capabilities in recent years, the AI community has voiced concerns regarding possible safety risks. To support decision-making on the safe use and development of AI systems, there is a growing need for high-quality evaluations of dangerous model capabilities. While several attempts to provide such evaluations have been made, a clear definition of what constitutes a “good evaluation” has yet to be agreed upon. In this practitioners’ perspective paper, we present a set of best practices for safety evaluations, drawing on prior work in model evaluation and illustrated through cybersecurity examples. We first discuss the steps of the initial thought process, which connects threat modeling to evaluation design. Then, we provide the characteristics and parameters that make an evaluation useful. Finally, we address additional considerations as we move from building specific evaluations to building a full and comprehensive evaluation suite.
zh

[NLP-69] An Analysis of Decoding Methods for LLM -based Agents for Faithful Multi-Hop Question Answering

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在知识密集型自然语言处理任务中因产生不准确输出（即幻觉现象）而导致的准确性限制问题。为应对这一挑战，论文关注如何通过结合主动框架（如Reasoning and Acting, ReAct）与外部知识检索能力来提高模型的准确性。然而，LLMs常常无法忠实地遵循检索到的信息，这尤其影响其对检索信息进行推理的能力。为缓解此问题，研究探索了无需训练的解码策略以提升生成结果的忠实性。论文的关键在于系统分析了将ReAct框架与多种解码方法（DeCoRe、DoLa和CAD）相结合对LLM生成答案忠实性的提升效果，并验证了这种组合在多跳问答任务中的有效性，例如在HotpotQA数据集上实现了F1分数从19.5提高到32.6的结果。

链接: https://arxiv.org/abs/2503.23415
作者: Alexander Murphy,Mohd Sanad Zaki Rizvi,Aden Haussmann,Ping Nie,Guifu Liu,Aryo Pradipta Gema,Pasquale Minervini
机构: University of Edinburgh (爱丁堡大学); Independent (独立研究者); Miniml.AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) frequently produce factually inaccurate outputs - a phenomenon known as hallucination - which limits their accuracy in knowledge-intensive NLP tasks. Retrieval-augmented generation and agentic frameworks such as Reasoning and Acting (ReAct) can address this issue by giving the model access to external knowledge. However, LLMs often fail to remain faithful to retrieved information. Mitigating this is critical, especially if LLMs are required to reason about the retrieved information. Recent research has explored training-free decoding strategies to improve the faithfulness of model generations. We present a systematic analysis of how the combination of the ReAct framework and decoding strategies (i.e., DeCoRe, DoLa, and CAD) can influence the faithfulness of LLM-generated answers. Our results show that combining an agentic framework for knowledge retrieval with decoding methods that enhance faithfulness can increase accuracy on the downstream Multi-Hop Question Answering tasks. For example, we observe an F1 increase from 19.5 to 32.6 on HotpotQA when using ReAct and DoLa.
zh

[NLP-70] oRL: Scaling Tool-Integrated RL

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在自主使用计算工具方面的效率与效果问题。传统方法依赖于监督微调（Supervised Fine-Tuning），但其局限性在于难以适应复杂任务需求。为了解决这一挑战，论文提出了ToRL（Tool-Integrated Reinforcement Learning）框架，通过强化学习（Reinforcement Learning）使模型能够自主探索并发现最优的工具使用策略。解决方案的关键在于将奖励驱动的学习机制与计算工具集成，从而实现模型在工具使用上的自我调节、无效代码的动态修正以及分析推理与计算推理之间的灵活切换，最终显著提升了模型在特定任务上的性能，如AIME~24数据集上达到了43.3%的准确率。

链接: https://arxiv.org/abs/2503.23383
作者: Xuefeng Li,Haoyang Zou,Pengfei Liu
机构: SJTU; Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14% and the best existing Tool-Integrated Reasoning (TIR) model by 17%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.
zh

[NLP-71] FeRG-LLM : Feature Engineering by Reason Generation Large Language Models NAACL2025

【速读】：该论文旨在解决机器学习中表格式数据特征工程任务中依赖大量人工专业知识和领域知识的问题，这使得特征工程成为一项劳动密集型工作。为了解决这一问题，论文提出了一种名为\textbfFeRG-LLM（通过推理生成大规模语言模型进行特征工程）的新框架，这是一种具有80亿参数的大规模语言模型，能够自动执行特征工程。解决方案的关键在于构建了一个两阶段的对话系统，使语言模型能够分析机器学习任务并发现新特征，同时展示其链式思维（Chain-of-Thought, CoT）能力。通过这些对话，论文对Llama 3.1 8B模型进行了微调，并结合直接偏好优化（Direct Preference Optimization, DPO）以提高新特征的质量和模型性能。实验结果表明，FeRG-LLM在大多数数据集上的表现与Llama 3.1 70B相当甚至更好，同时资源消耗更少且推理时间更短，在分类任务中优于其他研究，回归任务中也表现出色。此外，由于FeRG-LLM无需依赖像GPT-4这样需要额外API成本的云托管LLMs，因此可以本地部署，解决了安全性问题。

链接: https://arxiv.org/abs/2503.23371
作者: Jeonghyun Ko,Gyeongyun Park,Donghoon Lee,Kyunam Lee
机构: Korea University(高丽大学); SK Telecom(韩国电信)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025 Findings

点击查看摘要

Abstract:One of the key tasks in machine learning for tabular data is feature engineering. Although it is vital for improving the performance of models, it demands considerable human expertise and deep domain knowledge, making it labor-intensive endeavor. To address this issue, we propose a novel framework, \textbfFeRG-LLM (\textbfFeature engineering by \textbfReason \textbfGeneration \textbfLarge \textbfLanguage \textbfModels), a large language model designed to automatically perform feature engineering at an 8-billion-parameter scale. We have constructed two-stage conversational dialogues that enable language models to analyze machine learning tasks and discovering new features, exhibiting their Chain-of-Thought (CoT) capabilities. We use these dialogues to fine-tune Llama 3.1 8B model and integrate Direct Preference Optimization (DPO) to receive feedback improving quality of new features and the model’s performance. Our experiments show that FeRG-LLM performs comparably to or better than Llama 3.1 70B on most datasets, while using fewer resources and achieving reduced inference time. It outperforms other studies in classification tasks and performs well in regression tasks. Moreover, since it does not rely on cloud-hosted LLMs like GPT-4 with extra API costs when generating features, it can be deployed locally, addressing security concerns.
zh

[NLP-72] Large Language Models Are Better Logical Fallacy Reason ers with Counterargument Explanation and Goal-Aware Prompt Formulation NAACL2025

【速读】：该论文旨在解决逻辑谬误检测（Logical Fallacy Detection）这一挑战性任务，特别是在大语言模型（Large Language Models, LLMs）背景下，准确识别逻辑谬误仍面临显著困难。论文提出了一种新颖且有效的提示构建方法（prompt formulation approach），适用于有监督（微调，fine-tuned）和无监督（零样本，zero-shot）两种场景。解决方案的关键在于通过富集输入文本（enriching input text），引入隐含的上下文信息，包括反论点（counterarguments）、解释（explanations）和目标（goals），并将这些信息在论证语境下查询其有效性，随后基于置信度分数对查询结果进行排序以辅助分类决策。这种方法在多个领域的数据集上验证，覆盖了29种不同的谬误类型，取得了显著优于现有技术的结果，尤其在零样本设置下F1分数提升高达0.60，在微调模型中提升达0.45。

链接: https://arxiv.org/abs/2503.23363
作者: Jiwon Jeong,Hyeju Jang,Hogun Park
机构: Sungkyunkwan University (成均馆大学); Indiana University Indianapolis (印第安纳大学普渡大学印第安纳波利斯联合校区)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025 Findings

点击查看摘要

Abstract:The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text incorporating implicit contextual information – counterarguments, explanations, and goals – which we query for validity within the context of the argument. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from the GPT and LLaMA series. The results show substantial improvements over state-of-the-art models, with F1 score increases of up to 0.60 in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels.
zh

[NLP-73] Mixture of Routers

【速读】：该论文旨在解决现有方法在利用Mixture-of-Experts (MoE) 提升大模型微调性能时存在的路由机制缺陷问题，如错误分配和专家负载不平衡等。为应对这些挑战，论文提出了一种名为Mixture of Routers (MoR) 的高效微调方法。其关键是将Mixture of Experts的概念融入路由机制，并通过多个子路由器联合选择以及一个可学习的主要路由器确定子路由器权重的方式改进路由策略，从而实现更精确的任务分配与性能提升。实验结果表明，MoR在大多数任务上优于基线模型，平均性能提升1%，且作为一种即插即用、参数高效的微调方法具有广泛适用性。

链接: https://arxiv.org/abs/2503.23362
作者: Jia-Chen Zhang,Yu-Jie Xiong,Xi-He Qiu,Chun-Ming Xia,Fei Dai
机构: School of Electronic and Electrical Engineering, Shanghai University of Engineering Science (上海工程技术大学电子电气工程学院); Fudan University, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (复旦大学计算神经科学与类脑智能重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages,4 figures

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a milestone in aligning large language models with human instructions and adapting them to downstream tasks. In particular, Low-Rank Adaptation (LoRA) has gained widespread attention due to its parameter efficiency. However, its impact on improving the performance of large models remains limited. Recent studies suggest that combining LoRA with Mixture-of-Experts (MoE) can significantly enhance fine-tuning performance. MoE adapts to the diversity and complexity of datasets by dynamically selecting the most suitable experts, thereby improving task accuracy and efficiency. Despite impressive results, recent studies reveal issues in the MoE routing mechanism, such as incorrect assignments and imbalanced expert allocation. Inspired by the principles of Redundancy and Fault Tolerance Theory. We innovatively integrate the concept of Mixture of Experts into the routing mechanism and propose an efficient fine-tuning method called Mixture of Routers (MoR). It employs multiple sub-routers for joint selection and uses a learnable main router to determine the weights of the sub-routers. The results show that MoR outperforms baseline models on most tasks, achieving an average performance improvement of 1%. MoR can serve as a plug-and-play, parameter-efficient fine-tuning method suitable for a wide range of applications. Our code is available here: this https URL.
zh

[NLP-74] Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在保持事实性知识方面的不足，避免因知识缺陷导致的幻觉（hallucinations）和不可靠输出。由于全面评估闭源权重模型（closed-weight models）与完整规模知识库（knowledge bases）之间的契合度在计算上不可行，论文提出了一种名为随机误差上升（Stochastic Error Ascent, SEA）的可扩展且高效框架，用于在严格的查询预算（query budget）约束下发现闭源权重LLMs中的知识缺陷。

SEA的关键创新在于将错误发现建模为一个随机优化过程，通过利用语义相似性从先前观察到的失败中迭代检索高错误候选对象，而非简单地遍历所有可能的知识候选项。此外，SEA采用文档级和段落级的分层检索策略，并构建了一个有向无环图（Directed Acyclic Graph, DAG）来模拟错误传播，从而识别系统性的失效模式。实验结果表明，SEA相比现有的方法能够以更低的成本发现更多的知识错误，同时生成的问题质量经人工评估验证为高质量。这些发现揭示了LLMs家族中跨模型的失败模式以及反复出现的缺陷，强调了未来LLM开发中需要改进数据覆盖范围和针对性微调的重要性。

链接: https://arxiv.org/abs/2503.23361
作者: Linxin Song,Xuwei Ding,Jieyu Zhang,Taiwei Shi,Ryotaro Shimizu,Rahul Gupta,Yang Liu,Jian Kang,Jieyu Zhao
机构: University of Southern California (南加州大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校); University of Washington (华盛顿大学); ZOZO Research (ZOZO研究); Amazon, AGI (亚马逊, 通用人工智能); University of Rochester (罗切斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs’ knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.
zh

[NLP-75] Not All LoRA Parameters Are Essential: Insights on Inference Necessity

【速读】：该论文旨在解决在推理阶段所有经过微调的低秩适应（LoRA）层是否都必要的问题。尽管现有研究主要集中在减少微调参数数量或优化架构上，但这些LoRA层在推理过程中的必要性仍未被充分探索。论文假设较低层的LoRA模块在模型推理和理解中起着更为关键的作用。为了解决这一问题，作者提出了一种简单而有效的方法，通过分析一小部分验证样本来识别区分关键LoRA层的“边界层”。在推理过程中，丢弃超出此边界的所有LoRA层。实验结果表明，该方法在四个常用文本生成数据集上的三种强基准模型中均显示出一致且显著的性能提升，证明了在推理阶段选择性保留关键LoRA层的有效性。

链接: https://arxiv.org/abs/2503.23360
作者: Guanhua Chen,Yutong Yao,Ci-Jun Gao,Lidia S. Chao,Feng Wan,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (澳门大学计算机与信息科学系 NLP2CT 实验室); Department of Electrical and Computer Engineering, University of Macau (澳门大学电气与计算机工程系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current research on LoRA primarily focuses on minimizing the number of fine-tuned parameters or optimizing its architecture. However, the necessity of all fine-tuned LoRA layers during inference remains underexplored. In this paper, we investigate the contribution of each LoRA layer to the model’s ability to predict the ground truth and hypothesize that lower-layer LoRA modules play a more critical role in model reasoning and understanding. To address this, we propose a simple yet effective method to enhance the performance of large language models (LLMs) fine-tuned with LoRA. Specifically, we identify a ``boundary layer’’ that distinguishes essential LoRA layers by analyzing a small set of validation samples. During inference, we drop all LoRA layers beyond this boundary. We evaluate our approach on three strong baselines across four widely-used text generation datasets. Our results demonstrate consistent and significant improvements, underscoring the effectiveness of selectively retaining critical LoRA layers during inference.
zh

[NLP-76] A Scalable Framework for Evaluating Health Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在健康领域应用中开放性文本响应评估所面临的效率低下、成本高昂及可扩展性差的问题。当前评估方法高度依赖于人类专家，这不仅引入了人为偏差，还限制了评估的规模化能力。为应对这一挑战，论文提出了一种名为“自适应精确布尔量表”（Adaptive Precise Boolean rubrics）的评估框架，其关键是通过一套精简的目标导向评分准则，有效识别模型响应中的不足之处，同时结合复杂目标与高精度布尔型评价指标的对比机制，从而实现人机协同评估的高效性和一致性。实验验证表明，该方法不仅提高了专家与非专家评估者之间的一致性，还在自动化评估中优于传统的李克特量表（Likert scale），且所需评估时间减少约一半，为LLMs在医疗健康领域的广泛应用提供了更经济有效的评估手段。

链接: https://arxiv.org/abs/2503.23339
作者: Neil Mallinar,A. Ali Heydari,Xin Liu,Anthony Z. Faranesh,Brent Winslow,Nova Hammerquist,Benjamin Graef,Cathy Speed,Mark Malhotra,Shwetak Patel,Javier L. Prieto,Daniel McDuff,Ahmed A. Metwally
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools for analyzing complex datasets. Recent studies demonstrate their potential to generate useful, personalized responses when provided with patient-specific health information that encompasses lifestyle, biomarkers, and context. As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety. Current evaluation practices for open-ended text responses heavily rely on human experts. This approach introduces human factors and is often cost-prohibitive, labor-intensive, and hinders scalability, especially in complex domains like healthcare where response assessment necessitates domain expertise and considers multifaceted patient data. In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics questions. Our approach is based on recent work in more general evaluation settings that contrasts a smaller set of complex evaluation targets with a larger set of more precise, granular targets answerable with simple boolean responses. We validate this approach in metabolic health, a domain encompassing diabetes, cardiovascular disease, and obesity. Our results demonstrate that Adaptive Precise Boolean rubrics yield higher inter-rater agreement among expert and non-expert human evaluators, and in automated assessments, compared to traditional Likert scales, while requiring approximately half the evaluation time of Likert-based methods. This enhanced efficiency, particularly in automated evaluation and non-expert contributions, paves the way for more extensive and cost-effective evaluation of LLMs in health.
zh

[NLP-77] Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics

【速读】：该论文致力于解决**多模态生成式推荐（Multimodal Generative Recommendation, MGR）**中的关键问题，即在生成式推荐框架中如何有效选择和利用多种模态。现有方法大多仅关注单一模态（通常为文本），忽视了真实世界数据丰富的多模态特性及其对生成式推荐模型性能的影响。论文指出，生成式推荐模型对模态选择尤为敏感，并在多模态场景下面临显著挑战。

为应对这些问题，论文提出了一个增强的后融合框架MGR-LF++，其关键在于通过对比模态对齐（contrastive modality alignment）以及使用特殊标记符（special tokens）来表示不同模态，从而实现多模态信息的有效整合。实验结果显示，与单模态方法相比，该方案在性能上提升了超过20%。

链接: https://arxiv.org/abs/2503.23333
作者: Jing Zhu,Mingxuan Ju,Yozen Liu,Danai Koutra,Neil Shah,Tong Zhao
机构: University of Michigan, Ann Arbor (密歇根大学，安阿伯); Snap (Snap)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) has become a powerful paradigm in recommendation systems that implicitly links modality and semantics to item representation, in contrast to previous methods that relied on non-semantic item identifiers in autoregressive models. However, previous research has predominantly treated modalities in isolation, typically assuming item content is unimodal (usually text). We argue that this is a significant limitation given the rich, multimodal nature of real-world data and the potential sensitivity of GR models to modality choices and usage. Our work aims to explore the critical problem of Multimodal Generative Recommendation (MGR), highlighting the importance of modality choices in GR nframeworks. We reveal that GR models are particularly sensitive to different modalities and examine the challenges in achieving effective GR when multiple modalities are available. By evaluating design strategies for effectively leveraging multiple modalities, we identify key challenges and introduce MGR-LF++, an enhanced late fusion framework that employs contrastive modality alignment and special tokens to denote different modalities, achieving a performance improvement of over 20% compared to single-modality alternatives.
zh

[NLP-78] SPIO: Ensemble and Selective Strategies via LLM -Based Multi-Agent Planning in Automated Data Science

【速读】：该论文旨在解决现有多阶段自动化数据分析方法中因依赖固定单一流程而导致策略探索不足和多样性受限的问题，这通常会限制模型的预测性能。为应对这些挑战，论文提出了一种名为SPIO（Sequential Plan Integration and Optimization）的新框架，其关键是利用大型语言模型（LLM）驱动的决策机制，协调四个关键模块（数据预处理、特征工程、建模和超参数调优）中的多智能体规划。在每个模块中，专用规划代理独立生成候选策略，并通过级联传递到后续阶段以促进全面探索。此外，计划优化代理通过对策略进行优化建议来进一步提升性能。论文还提出了两种变体：SPIO-S通过LLM确定单一最佳路径，而SPIO-E则选择前k个候选计划并进行集成以最大化预测表现。实验结果表明，SPIO显著优于当前最先进的方法，提供了稳健且可扩展的自动化数据科学任务解决方案。

链接: https://arxiv.org/abs/2503.23314
作者: Wonduk Seo,Juhyeon Lee,Yi Bu
机构: Peking University (北京大学); Enhans.ai; Peking University Chongqing Research Institute of Big Data (北京大学重庆大数据研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized automated data analytics and machine learning by enabling dynamic reasoning and adaptability. While recent approaches have advanced multi-stage pipelines through multi-agent systems, they typically rely on rigid, single-path workflows that limit the exploration and integration of diverse strategies, often resulting in suboptimal predictions. To address these challenges, we propose SPIO (Sequential Plan Integration and Optimization), a novel framework that leverages LLM-driven decision-making to orchestrate multi-agent planning across four key modules: data preprocessing, feature engineering, modeling, and hyperparameter tuning. In each module, dedicated planning agents independently generate candidate strategies that cascade into subsequent stages, fostering comprehensive exploration. A plan optimization agent refines these strategies by suggesting several optimized plans. We further introduce two variants: SPIO-S, which selects a single best solution path as determined by the LLM, and SPIO-E, which selects the top k candidate plans and ensembles them to maximize predictive performance. Extensive experiments on Kaggle and OpenML datasets demonstrate that SPIO significantly outperforms state-of-the-art methods, providing a robust and scalable solution for automated data science task.
zh

[NLP-79] Linguistic Loops and Geometric Invariants as a Way to Pre-Verbal Thought?

【速读】：本文旨在引入语言变换（linguistic transformation）、语言环（linguistic loop）和语义缺失（semantic deficit）的概念，并通过李群理论（Lie group theory）和几何技术定义不变量（invariants），以捕捉整个语言环的结构特性。关键在于利用李群理论与高维几何工具，研究语言的数学表征及其潜在的元语言（meta-linguistic）或前语言（pre-verbal）思维结构，即探索先于语言存在的认知结构的数学刻画。这一工作不仅开辟了语言研究的新方向，还为理解认知基础提供了独特的视角。

链接: https://arxiv.org/abs/2503.23311
作者: Daniele Corradetti,Alessio Marrani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:In this work we introduce the concepts of linguistic transformation, linguistic loop and semantic deficit. By exploiting Lie group theoretical and geometric techniques, we define invariants that capture the structural properties of a whole linguistic loop. This result introduces new line of research, employing tools from Lie theory and higher-dimensional geometry within language studies. But, even more intriguingly, our study hints to a mathematical characterization of the meta-linguistic or pre-verbal thought, namely of those cognitive structures that precede the language.
zh

[NLP-80] Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

【速读】：该论文试图解决长上下文大型语言模型（Long-context LLMs）在处理长序列时容易被无关上下文干扰的问题，这一现象背后的机制尚不清晰。论文的关键解决方案在于识别出“上下文头”（Contextual Heads），即一类特殊的注意力头，它们控制着整个模型的整体注意力分配。研究发现，当这些上下文头未能将足够的注意力分配给相关上下文时，就会发生干扰现象，而通过增强对相关上下文的注意力可以缓解此问题。进一步地，论文指出“焦点方向”（Focus Directions）位于这些头的关键查询激活中，能够引导模型自动关注相关上下文，而无需显式指定具体的相关上下文。综合评估表明，焦点方向有助于改善长上下文任务中的模型任务对齐性，从而为长上下文 LLM 的对齐研究提供了重要启示。

链接: https://arxiv.org/abs/2503.23306
作者: Youxiang Zhu,Ruochen Li,Danqing Wang,Daniel Haehn,Xiaohui Liang
机构: University of Massachusetts Boston (波士顿大学); Technische Universität München (慕尼黑工业大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context large language models (LLMs) are prone to be distracted by irrelevant contexts. The reason for distraction remains poorly understood. In this paper, we first identify the contextual heads, a special group of attention heads that control the overall attention of the LLM. Then, we demonstrate that distraction arises when contextual heads fail to allocate sufficient attention to relevant contexts and can be mitigated by increasing attention to these contexts. We further identify focus directions, located at the key and query activations of these heads, which enable them to allocate more attention to relevant contexts without explicitly specifying which context is relevant. We comprehensively evaluate the effect of focus direction on various long-context tasks and find out focus directions could help to mitigate the poor task alignment of the long-context LLMs. We believe our findings could promote further research on long-context LLM alignment.
zh

[NLP-81] Using Source-Side Confidence Estimation for Reliable Translation into Unfamiliar Languages ACL2025

【速读】：该论文旨在解决交互式机器翻译系统中目标语言翻译可信度和可解释性不足的问题。传统机器翻译置信度估计主要关注目标侧，而源侧置信度评估通常依赖于基于词对齐的方法。论文提出了一种直接且无需词对齐的新方法，通过衡量目标词概率对源端嵌入变化的敏感性来评估源侧置信度。关键在于这种方法不依赖词对齐，而是直接利用源端嵌入的变化来推导目标词的概率变化，从而更有效地检测潜在的误译。实验结果表明，该方法在误译检测方面优于传统的对齐基方法。

链接: https://arxiv.org/abs/2503.23305
作者: Kenneth J. Sible,David Chiang
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 5 figures, 1 table. Submitted to ACL 2025 System Demonstrations

点击查看摘要

Abstract:We present an interactive machine translation (MT) system designed for users who are not proficient in the target language. It aims to improve trustworthiness and explainability by identifying potentially mistranslated words and allowing the user to intervene to correct mistranslations. However, confidence estimation in machine translation has traditionally focused on the target side. Whereas the conventional approach to source-side confidence estimation would have been to project target word probabilities to the source side via word alignments, we propose a direct, alignment-free approach that measures how sensitive the target word probabilities are to changes in the source embeddings. Experimental results show that our method outperforms traditional alignment-based methods at detection of mistranslations.
zh

[NLP-82] Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions

【速读】：该论文旨在解决泰米尔语-英语代码混合文本（Tamil-English Code-Mixed Texts）中的情感分析任务问题。论文重点关注因语法不一致、正字法变化和发音歧义所带来的挑战，并探讨了现有数据集的局限性和标注不足的问题，强调了构建更大且更多样化语料库的需求。为应对这些挑战，论文评估了包括XLM-RoBERTa、mT5、IndicBERT和RemBERT在内的多种基于Transformer的架构在低资源、代码混合环境下的性能。研究的关键在于通过分析性能指标，验证特定模型在多语言情感分类任务中的有效性，并提出未来需在数据增强、发音规范化以及混合建模方法等方面进一步改进，以提升整体准确性。

链接: https://arxiv.org/abs/2503.23295
作者: Mikhail Krasitskii,Olga Kolesnikova,Liliana Chanona Hernandez,Grigori Sidorov,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The sentiment analysis task in Tamil-English code-mixed texts has been explored using advanced transformer-based models. Challenges from grammatical inconsistencies, orthographic variations, and phonetic ambiguities have been addressed. The limitations of existing datasets and annotation gaps have been examined, emphasizing the need for larger and more diverse corpora. Transformer architectures, including XLM-RoBERTa, mT5, IndicBERT, and RemBERT, have been evaluated in low-resource, code-mixed environments. Performance metrics have been analyzed, highlighting the effectiveness of specific models in handling multilingual sentiment classification. The findings suggest that further advancements in data augmentation, phonetic normalization, and hybrid modeling approaches are required to enhance accuracy. Future research directions for improving sentiment analysis in code-mixed texts have been proposed.
zh

[NLP-83] Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference DATE2025

【速读】：该论文旨在解决长上下文（long context）处理过程中，因上下文长度过长导致的推理延迟不可接受以及GPU内存占用过高的问题。现有方法通过基于Token粒度的混合精度量化来优化LLMs中的键值（KV）缓存，但存在搜索过程耗时且硬件效率低的问题。论文提出的关键解决方案是引入一种名为Cocktail的新方法，采用块自适应混合精度量化（chunk-adaptive mixed-precision quantization）优化KV缓存。Cocktail包含两个核心模块：块级量化搜索和块级KV缓存计算。块级量化搜索快速确定KV缓存块的最佳位宽配置，同时保持模型精度；块级KV缓存计算在量化前重新排序缓存块，避免了混合精度量化带来的硬件效率低下问题。实验表明，Cocktail在多种模型和数据集上优于现有的KV缓存量化方法。

链接: https://arxiv.org/abs/2503.23294
作者: Wei Tao,Bin Zhang,Xiaoyang Qu,Jiguang Wan,Jianzong Wang
机构: Huazhong University of Science and Technology (华中科技大学); Ping An Technology (Shenzhen) Co., Ltd (平安科技（深圳）有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted by the Design, Automation, and Test in Europe 2025 (DATE 2025)

点击查看摘要

Abstract:Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
zh

[NLP-84] Extracting Patient History from Clinical Text: A Comparative Study of Clinical Large Language Models

【速读】：本文旨在解决从自由文本临床笔记中提取与患者主诉（Chief Complaint, CC）、现病史（History of Present Illness, HPI）和过去、家庭及社会史（Past, Family, and Social History, PFSH）相关的医学历史实体（Medical History Entities, MHEs）的问题，以结构化电子健康记录（EHRs），从而提升后续连续性护理、医疗编码和质量指标等任务的效率。论文的关键解决方案在于利用经过微调的临床大型语言模型（Clinical Large Language Models, cLLMs），并通过本地部署保护敏感数据。研究通过微调七种最先进的cLLMs来识别MHEs，并评估了结合基本医学实体（Basic Medical Entities, BMEs）对模型性能的影响。此外，研究还分析了文本长度、实体长度和分段方式等特征对模型性能的影响，揭示了某些长实体识别困难以及良好组织的带标题段落对模型表现有益的现象。关键在于选择合适的cLLMs（如GatorTron和GatorTronS）并优化其输入特征以提高MHEs的提取效率。

链接: https://arxiv.org/abs/2503.23281
作者: Hieu Nghiem,Tuan-Dung Le,Suhao Chen,Thanh Thieu,Andrew Gin,Ellie Phuong Nguyen,Dursun Delen,Johnson Thomas,Jivan Lamichhane,Zhuqi Miao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Extracting medical history entities (MHEs) related to a patient’s chief complaint (CC), history of present illness (HPI), and past, family, and social history (PFSH) helps structure free-text clinical notes into standardized EHRs, streamlining downstream tasks like continuity of care, medical coding, and quality metrics. Fine-tuned clinical large language models (cLLMs) can assist in this process while ensuring the protection of sensitive data via on-premises deployment. This study evaluates the performance of cLLMs in recognizing CC/HPI/PFSH-related MHEs and examines how note characteristics impact model accuracy. We annotated 1,449 MHEs across 61 outpatient-related clinical notes from the MTSamples repository. To recognize these entities, we fine-tuned seven state-of-the-art cLLMs. Additionally, we assessed the models’ performance when enhanced by integrating, problems, tests, treatments, and other basic medical entities (BMEs). We compared the performance of these models against GPT-4o in a zero-shot setting. To further understand the textual characteristics affecting model accuracy, we conducted an error analysis focused on note length, entity length, and segmentation. The cLLMs showed potential in reducing the time required for extracting MHEs by over 20%. However, detecting many types of MHEs remained challenging due to their polysemous nature and the frequent involvement of non-medical vocabulary. Fine-tuned GatorTron and GatorTronS, two of the most extensively trained cLLMs, demonstrated the highest performance. Integrating pre-identified BME information improved model performance for certain entities. Regarding the impact of textual characteristics on model performance, we found that longer entities were harder to identify, note length did not correlate with a higher error rate, and well-organized segments with headings are beneficial for the extraction.
zh

[NLP-85] PromptDistill: Query-based Selective Token Retention in Intermediate Layers for Efficient Large Language Model Inference

【速读】：该论文试图解决大型语言模型（LLMs）在处理复杂任务和长文档时，推理阶段计算和内存成本成为主要瓶颈的问题。为了解决这一问题，论文提出了一种名为PromptDistill的新方法，这是一种无需训练的方法，旨在提高推理效率的同时保持生成质量。PromptDistill的关键在于通过利用早期层中的注意力交互来识别并保留最具信息量的标记，保存其隐藏状态，同时减少后期层的计算负担，从而使模型能够专注于重要的上下文信息，而无需完全处理所有标记。这种方法与之前仅在处理完整输入后进行压缩的H2O和SnapKV或不考虑上下文依赖性的GemFilter等方法不同，PromptDistill能够在保持全局输入意识的同时，动态分配计算资源给最相关的标记。实验结果表明，PromptDistill在提升效率方面显著优于基线方法，同时对输出质量的影响极小。

链接: https://arxiv.org/abs/2503.23274
作者: Weisheng Jin,Maojia Song,Tej Deep Pala,Yew Ken Chia,Amir Zadeh,Chuan Li,Soujanya Poria
机构: Singapore University of Technology and Design (新加坡科技大学); Lambda Labs (Lambda 实验室); Singapore University of Technology and Design (新加坡科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) tackle increasingly complex tasks and longer documents, their computational and memory costs during inference become a major bottleneck. To address this, we propose PromptDistill, a novel, training-free method that improves inference efficiency while preserving generation quality. PromptDistill identifies and retains the most informative tokens by leveraging attention interactions in early layers, preserving their hidden states while reducing the computational burden in later layers. This allows the model to focus on essential contextual information without fully processing all tokens. Unlike previous methods such as H2O and SnapKV, which perform compression only after processing the entire input, or GemFilter, which selects a fixed portion of the initial prompt without considering contextual dependencies, PromptDistill dynamically allocates computational resources to the most relevant tokens while maintaining a global awareness of the input. Experiments using our method and baseline approaches with base models such as LLaMA 3.1 8B Instruct, Phi 3.5 Mini Instruct, and Qwen2 7B Instruct on benchmarks including LongBench, InfBench, and Needle in a Haystack demonstrate that PromptDistill significantly improves efficiency while having minimal impact on output quality compared to the original models. With a single-stage selection strategy, PromptDistill effectively balances performance and efficiency, outperforming prior methods like GemFilter, H2O, and SnapKV due to its superior ability to retain essential information. Specifically, compared to GemFilter, PromptDistill achieves an overall 1% to 5% performance improvement while also offering better time efficiency. Additionally, we explore multi-stage selection, which further improves efficiency while maintaining strong generation performance.
zh

[NLP-86] Evaluating how LLM annotations represent diverse views on contentious topics

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在标注数据时是否会对特定群体的观点产生代表性不足的问题。研究关注LLMs在处理敏感任务时对多样化观点的表征能力，并评估其标注结果是否存在与人口统计学特征相关的偏倚。解决方案的关键在于通过在四个数据集上的四种标注任务进行实验，发现LLMs的标注一致性更多地取决于模型本身、提示设计以及人类标注者之间的分歧，而非标注者的 demographics 特性。这表明使用LLMs标注数据时，不必过度担忧特定群体观点被低估的问题。论文最后讨论了这一发现对研究人员和实际应用者的启示。

链接: https://arxiv.org/abs/2503.23243
作者: Megan A. Brown,Shubham Atreja,Libby Hemphill,Patrick Y. Wu
机构: School of Information, University of Michigan (密歇根大学); Department of Computer Science, American University (美国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Researchers have proposed the use of generative large language models (LLMs) to label data for both research and applied settings. This literature emphasizes the improved performance of LLMs relative to other natural language models, noting that LLMs typically outperform other models on standard metrics such as accuracy, precision, recall, and F1 score. However, previous literature has also highlighted the bias embedded in language models, particularly around contentious topics such as potentially toxic content. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show substantial disagreement with annotators on the basis of demographics. Instead, the model, prompt, and disagreement between human annotators on the labeling task are far more predictive of LLM agreement. Our findings suggest that when using LLMs to annotate data, under-representing the views of particular groups is not a substantial concern. We conclude with a discussion of the implications for researchers and practitioners.
zh

[NLP-87] Beyond speculation: Measuring the growing presence of LLM -generated texts in multilingual disinformation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）生成的多语言文本在潜在虚假信息滥用中的角色争议问题。论文的关键解决方案在于通过提供实证证据，首次展示LLMs在最新现实世界虚假信息数据集中的存在，记录ChatGPT发布后机器生成内容的增长，并揭示跨语言、平台和时间周期的关键模式。

链接: https://arxiv.org/abs/2503.23242
作者: Dominik Macko,Aashish Anantha Ramakrishnan,Jason Samuel Lucas,Robert Moro,Ivan Srba,Adaku Uchendu,Dongwon Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Increased sophistication of large language models (LLMs) and the consequent quality of generated multilingual text raises concerns about potential disinformation misuse. While humans struggle to distinguish LLM-generated content from human-written texts, the scholarly debate about their impact remains divided. Some argue that heightened fears are overblown due to natural ecosystem limitations, while others contend that specific “longtail” contexts face overlooked risks. Our study bridges this debate by providing the first empirical evidence of LLM presence in the latest real-world disinformation datasets, documenting the increase of machine-generated content following ChatGPT’s release, and revealing crucial patterns across languages, platforms, and time periods.
zh

[NLP-88] Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance

【速读】：该论文试图解决传统信息检索（IR）训练范式中对比学习（contrastive learning）存在的两个主要问题：(1) 将未显式标注为相关的文档一律视为同等负样本，忽略其实际相关性带来的细微差异，从而影响排序性能；(2) 对标注噪声敏感。为克服这些限制，论文提出了一种全新的解决方案，即完全放弃真实训练文档及其标注，而是利用开源大语言模型（LLMs）直接生成具有不同相关性级别的合成文档，并结合渐进相关性排名上下文与适当的列表损失函数（如Wasserstein距离），以更有效地捕捉排序任务的本质。关键在于通过合成数据构建更精细的相关性分级，并采用更适合列表级别优化的损失函数，从而显著提升密集检索器（dense retriever）的性能，同时增强其对分布偏移的鲁棒性。

链接: https://arxiv.org/abs/2503.23239
作者: Reza Esfandiarpoor,George Zerveas,Ruochen Zhang,Macton Mgonzo,Carsten Eickhoff,Stephen H. Bach
机构: Brown University (布朗大学); Microsoft (微软); University of Tübingen (图宾根大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.
zh

[NLP-89] RECALL-MM: A Multimodal Dataset of Consumer Product Recalls for Risk Analysis using Computational Methods and Large Language Models

【速读】：本文旨在解决产品召回数据在工程设计过程中的潜在风险和隐患洞察未被充分挖掘的问题。解决方案的关键在于构建一个多模态数据集（RECALL-MM），通过整合美国消费品安全委员会（CPSC）召回数据库的历史信息，并结合生成式方法（Generative Methods）进行数据增强，以支持基于数据的风险评估。此外，通过交互式聚类地图将所有召回嵌入共享潜在空间，结合可视化工具与大型语言模型（LLM），实现对产品风险的识别及设计决策的指导。这种方法强调从历史召回数据中提取模式，利用视觉和文本信息预测潜在危害，从而推动更安全的设计实践。然而，研究也指出某些情况下风险预测仍具挑战性，突显了在整个设计过程中保持风险意识的重要性。

链接: https://arxiv.org/abs/2503.23213
作者: Diana Bolanos,Mohammadmehdi Ataei,Daniele Grandi,Kosa Goucher-Lambert
机构: Department of Mechanical Engineering (机械工程系), University of California (加州大学); Autodesk Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Product recalls provide valuable insights into potential risks and hazards within the engineering design process, yet their full potential remains underutilized. In this study, we curate data from the United States Consumer Product Safety Commission (CPSC) recalls database to develop a multimodal dataset, RECALL-MM, that informs data-driven risk assessment using historical information, and augment it using generative methods. Patterns in the dataset highlight specific areas where improved safety measures could have significant impact. We extend our analysis by demonstrating interactive clustering maps that embed all recalls into a shared latent space based on recall descriptions and product names. Leveraging these data-driven tools, we explore three case studies to demonstrate the dataset’s utility in identifying product risks and guiding safer design decisions. The first two case studies illustrate how designers can visualize patterns across recalled products and situate new product ideas within the broader recall landscape to proactively anticipate hazards. In the third case study, we extend our approach by employing a large language model (LLM) to predict potential hazards based solely on product images. This demonstrates the model’s ability to leverage visual context to identify risk factors, revealing strong alignment with historical recall data across many hazard categories. However, the analysis also highlights areas where hazard prediction remains challenging, underscoring the importance of risk awareness throughout the design process. Collectively, this work aims to bridge the gap between historical recall data and future product safety, presenting a scalable, data-driven approach to safer engineering design.
zh

[NLP-90] Enhancing Knowledge Graph Completion with Entity Neighborhood and Relation Context

【速读】：该论文旨在解决知识图谱补全（Knowledge Graph Completion, KGC）中的两个主要挑战：传统基于结构的方法计算需求高且扩展性差，而基于文本的方法虽缓解了这些问题，但未能充分利用上下文信息，特别是关系的上下文。为了解决这些问题，论文提出了KGC-ERC框架，其关键是结合实体和关系的上下文信息来增强生成式语言模型的输入，同时引入了一种采样策略，在输入标记限制内有效选择相关上下文，以优化上下文信息的利用并提升模型性能。实验结果表明，KGC-ERC在预测性能和可扩展性方面优于或匹配最先进的基线方法。

链接: https://arxiv.org/abs/2503.23205
作者: Jianfang Chen,Kai Zhang,Aoran Gan,Shiwei Tong,Shuanghong Shen,Qi Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Knowledge Graph Completion (KGC) aims to infer missing information in Knowledge Graphs (KGs) to address their inherent incompleteness. Traditional structure-based KGC methods, while effective, face significant computational demands and scalability challenges due to the need for dense embedding learning and scoring all entities in the KG for each prediction. Recent text-based approaches using language models like T5 and BERT have mitigated these issues by converting KG triples into text for reasoning. However, they often fail to fully utilize contextual information, focusing mainly on the neighborhood of the entity and neglecting the context of the relation. To address this issue, we propose KGC-ERC, a framework that integrates both types of context to enrich the input of generative language models and enhance their reasoning capabilities. Additionally, we introduce a sampling strategy to effectively select relevant context within input token constraints, which optimizes the utilization of contextual information and potentially improves model performance. Experiments on the Wikidata5M, Wiki27K, and FB15K-237-N datasets show that KGC-ERC outperforms or matches state-of-the-art baselines in predictive performance and scalability.
zh

[NLP-91] he Challenge of Achieving Attributability in Multilingual Table-to-Text Generation with Question-Answer Blueprints

【速读】：该论文试图解决多语言表格到文本（Table-to-Text）自然语言生成（NLG）在低资源语言中的挑战，特别是提升模型输出对源表格数据的可归因性（attributability）。论文的关键在于探索通过引入基于问题-答案（Question-Answer, QA）蓝图的中间规划技术是否能够改善多语言场景下Table-to-Text生成的可归因性。为此，作者扩展了一个包含非洲语言的多语言Table-to-Text数据集TaTA，并在有无QA蓝图的情况下微调序列到序列的语言模型。研究发现，QA蓝图在仅针对英语样本进行微调和评估时有效提升了性能，但在多语言设置下未能带来显著增益，主要原因是蓝图在机器翻译过程中存在不准确性，且模型未能充分依赖生成的蓝图。

链接: https://arxiv.org/abs/2503.23204
作者: Aden Haussmann
机构: The University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual Natural Language Generation (NLG) is challenging due to the lack of training data for low-resource languages. However, some low-resource languages have up to tens of millions of speakers globally, making it important to improve NLG tools for them. Table-to-Text NLG is an excellent measure of models’ reasoning abilities but is very challenging in the multilingual setting. System outputs are often not attributable, or faithful, to the data in the source table. Intermediate planning techniques like Question-Answer (QA) blueprints have been shown to improve attributability on summarisation tasks. This work explores whether QA blueprints make multilingual Table-to-Text outputs more attributable to the input tables. This paper extends the challenging multilingual Table-to-Text dataset, TaTA, which includes African languages, with QA blueprints. Sequence-to-sequence language models are then finetuned on this dataset, with and without blueprints. Results show that QA blueprints improve performance for models finetuned and evaluated only on English examples, but do not demonstrate gains in the multilingual setting. This is due to inaccuracies in machine translating the blueprints from English into target languages when generating the training data, and models failing to rely closely on the blueprints they generate. An in-depth analysis is conducted on why this is challenging.
zh

[NLP-92] RA: Better Length Generalisation with Threshold Relative Attention

【速读】：该论文旨在解决Transformer在长度泛化方面的局限性问题，特别是在长序列任务中表现不佳的核心挑战。论文指出，这种局限性源于自注意力机制的两个关键失败：一是无法完全移除无关信息；二是位置相关问题，即使键值对之间的点积表明其无关，学习到的位置偏差仍可能无意中赋予这些信息过高的权重，尤其当距离超出分布范围时更为危险。为解决这些问题，论文提出的关键方案包括：a) 选择性稀疏性（Selective Sparsity），通过完全从注意力Softmax中移除无关键来消除干扰；b) 基于上下文的相对距离（Contextualised Relative Distance），仅考虑与重要键之间的距离。通过重构注意力机制并结合这两种缓解措施，论文展示了显著提升解码器-only Transformer长度泛化能力的方法。

链接: https://arxiv.org/abs/2503.23174
作者: Mattia Opper,Roland Fernandez,Paul Smolensky,Jianfeng Gao
机构: University of Edinburgh (爱丁堡大学); Microsoft Research (微软研究); Johns Hopkins University (约翰斯·霍普金斯大学); Microsoft Research (微软研究); Microsoft Research (微软研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve generalisation capabilities of decoder only transformers.
zh

[NLP-93] he realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

【速读】：该论文试图解决语音学与语义学交互作用中音高实现（pitch realization）这一复杂关系的问题，尤其是在普通话双音节词的声调实现方面。研究聚焦于台湾地区自然口语语料库中的普通话双音节词，这些词涵盖了所有20种可能的两音节声调组合。论文的关键解决方案在于结合广义可加混合模型（Generalized Additive Mixed Models, GAMs）分析影响基频（f0）轮廓的因素，并引入GPT-2大型语言模型提取上下文嵌入（contextualized embeddings），以捕捉词汇及其意义在具体语境中的细微差异。研究发现，词汇选择和语境意义对音高轮廓的影响显著超过传统基于声调模式的预测能力，表明语境化意义与语音实现之间的联系比传统语言学理论所预测的更为紧密。

链接: https://arxiv.org/abs/2503.23163
作者: Yuxin Lu,Yu-Ying Chuang,R.Harald Baayen
机构: Quantitative Linguistics, Eberhard Karls Universität Tübingen (图宾根大学), Germany; Department of Taiwan Culture, Languages and Literature, National Taiwan Normal University (国立台湾师范大学), Taipei, Taiwan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.
zh

[NLP-94] CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

【速读】：本文试图解决归纳程序合成（Inductive Program Synthesis）在大型语言模型（LLMs）中的应用问题，特别是现有方法在缺乏真实交互反馈场景下的局限性。传统评估协议仅依赖静态示例集与保留测试集，无法提供错误合成函数的即时反馈，也未能反映如反向工程等实际应用场景的需求。为应对这些挑战，论文提出了CodeARC（Code Abstraction and Reasoning Challenge），这是一种新的评估框架，允许代理通过与隐藏目标函数进行交互来动态查询新输入、合成候选函数，并利用差分测试Oracle迭代优化其解。关键在于引入了这种互动式的评估机制，使代理能够基于反馈执行函数调用及自我修正，从而更接近真实世界的应用场景。实验结果显示，在1114个通用归纳程序合成任务中，o3-mini模型取得了最高的成功率（52.7%），而对LLaMA模型进行微调可带来高达31%的相对性能提升。因此，CodeARC不仅提供了更具现实意义的测试平台，还显著提高了对LLM驱动的程序合成及归纳推理能力的评估难度。

链接: https://arxiv.org/abs/2503.23145
作者: Anjiang Wei,Tarun Suresh,Jiannan Cao,Naveen Kannan,Yuheng Wu,Kai Yan,Thiago S. F. X. Teixeira,Ke Wang,Alex Aiken
机构: Stanford University (斯坦福大学); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); MIT (麻省理工学院); Intel (英特尔); Visa Research;
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning.
zh

[NLP-95] When YES Meets BUT: Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning ?

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, VLMs）在理解复杂矛盾叙事及幽默表达方面的显著挑战，特别是涉及需要比较推理的情境。为解决这一问题，论文引入了一个名为YesBut (V2) 的新基准数据集，包含来自多语言和多元文化背景下的1,262幅漫画图像，并带有全面的注释以捕捉叙事理解的多个方面。通过在四个互补任务上的系统评估，从表面内容理解到深层叙事推理，特别是在矛盾元素之间的比较推理上，研究揭示了即使是最先进的模型也远逊于人类的表现，并存在视觉感知、关键元素识别、比较分析以及幻觉生成等方面的常见失败。关键在于探索基于文本的训练策略和社会知识增强方法以提升模型性能，从而为开发具备情境感知能力和更深层次叙事理解能力的模型提供路径。

链接: https://arxiv.org/abs/2503.23137
作者: Tuo Liang,Zhe Hu,Jing Li,Hao Zhang,Yiren Lu,Yunlai Zhou,Yiran Qiao,Disheng Liu,Jeirui Peng,Jing Ma,Yu Yin
机构: Computer and Data Sciences Department, Case Western Reserve University (凯斯西储大学); Department of Computing, The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI’s ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs’ understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.
zh

[NLP-96] Can DeepSeek -V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery

【速读】：该论文试图解决的问题是如何评估DeepSeek-V3在机器人手术场景下的对话能力，特别是在单句问答(Single Phrase QA)、视觉问答(Visual QA)以及详细描述(Detailed Description)等任务中的表现。论文通过使用公开可用的数据集（如EndoVis18和CholecT50）及其对应的对话数据，对DeepSeek-V3进行了全面评估。研究发现，在特定提示下，DeepSeek-V3在手术器械和组织识别任务中表现出色，但在空间位置分析方面存在显著不足，并且难以准确理解手术动作。此外，当采用通用提示时，DeepSeek-V3无法有效分析全局手术概念，也未能提供关于手术场景的详细见解。因此，论文的关键解决方案在于强调需要针对手术专用数据集对DeepSeek-V3进行微调，以提升其在手术视觉语言任务中的性能。

链接: https://arxiv.org/abs/2503.23130
作者: Boyi Ma,Yanguang Zhao,Jie Wang,Guankun Wang,Kun Yuan,Tong Chen,Long Bai,Hongliang Ren
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Technical Report

点击查看摘要

Abstract:DeepSeek-V3, a recently emerging Large Language Model (LLM), demonstrates outstanding performance in general scene understanding, question-answering (QA), and text generation tasks, owing to its efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of DeepSeek-V3 in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our comprehensive evaluation results indicate that, when provided with specific prompts, DeepSeek-V3 performs well in surgical instrument and tissue recognition tasks However, DeepSeek-V3 exhibits significant limitations in spatial position analysis and struggles to understand surgical actions accurately. Additionally, our findings reveal that, under general prompts, DeepSeek-V3 lacks the ability to effectively analyze global surgical concepts and fails to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek-V3 is not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.
zh

[NLP-97] A large-scale image-text dataset benchmark for farmland segmentation

【速读】：该论文旨在解决传统深度学习范式在表示农田元素与其周围空间关系以及动态时序演化和空间异质性建模方面的局限性。为应对这些挑战，论文提出了一种语言驱动的学习范式，利用语言作为结构化知识载体来显式表达农田的时空特性。为填补现有农田遥感影像基准数据集的空白，论文引入了基于语言描述的农田数据，并开发了FarmSeg-VL数据集，这是首个针对时空农田的细粒度图像-文本数据集。解决方案的关键在于提出了一种半自动标注方法，能够精确地为每张图像分配描述性标签，确保数据质量与语义丰富性的同时提高数据集构建效率。此外，论文通过性能分析展示了FarmSeg-VL作为农田分割标准基准的潜力。

链接: https://arxiv.org/abs/2503.23106
作者: Chao Tao,Dandan Zhong,Weiliang Mu,Zhuofei Du,Haiyang Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding this http URL struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language,as a structured knowledge carrier,can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution,and surrounding environmental this http URL,a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of this http URL,in the field of remote sensing imagery of farmland,there is currently no comprehensive benchmark dataset to support this research this http URL fill this gap,we introduced language based descriptions of farmland and developed FarmSeg-VL dataset,the first fine-grained image-text dataset designed for spatiotemporal farmland this http URL, this article proposed a semi-automatic annotation method that can accurately assign caption to each image, ensuring high data quality and semantic richness while improving the efficiency of dataset this http URL,the FarmSeg-VL exhibits significant spatiotemporal this http URL terms of the temporal dimension,it covers all four this http URL terms of the spatial dimension,it covers eight typical agricultural regions across this http URL addition, in terms of captions,FarmSeg-VL covers rich spatiotemporal characteristics of farmland,including its inherent properties,phenological characteristics, spatial distribution,topographic and geomorphic features,and the distribution of surrounding this http URL,we present a performance analysis of VLMs and the deep learning models that rely solely on labels trained on the FarmSeg-VL,demonstrating its potential as a standard benchmark for farmland segmentation.
zh

[NLP-98] Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

【速读】：本文旨在解决传统 Mixture of Experts (MoE) 架构在训练和推理过程中因专家模块数量激增而导致的内存占用过高和通信开销过大的问题。为应对这些挑战，论文提出了一种名为 Mixture of Latent Experts (MoLE) 的新型参数化方法。其关键是将特定专家映射到共享的潜在空间中，并将所有专家操作分解为两个主要部分：首先通过共享投影降低维度至潜在空间，随后进行专家特定的变换以显著减少参数复杂度。这种因子化方法大幅降低了参数数量和计算需求。此外，论文还构建了一个严格的数学框架，用于将预训练的 MoE 模型转换为 MoLE 架构，并开发了系统化的两阶段算法来实现这一转换过程。理论分析表明，MoLE 在保持模型表示能力的同时显著提升了多方面的计算效率。实验结果验证了理论发现，证明 MoLE 可在大幅减少资源消耗的情况下实现与标准 MoE 实现相当的性能。

链接: https://arxiv.org/abs/2503.23100
作者: Zehua Liu,Han Wu,Ruifeng She,Xiaojin Fu,Xiongwei Han,Tao Zhong,Mingxuan Yuan
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token. Nevertheless, conventional MoE architectures encounter substantial challenges, including excessive memory utilization and communication overhead during training and inference, primarily attributable to the proliferation of expert modules. In this paper, we introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. Specifically, all expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations with significantly reduced parametric complexity. This factorized approach substantially diminishes parameter count and computational requirements. Beyond the pretraining implementation of the MoLE architecture, we also establish a rigorous mathematical framework for transforming pre-trained MoE models into the MoLE architecture, characterizing the sufficient conditions for optimal factorization and developing a systematic two-phase algorithm for this conversion process. Our comprehensive theoretical analysis demonstrates that MoLE significantly enhances computational efficiency across multiple dimensions while preserving model representational capacity. Empirical evaluations corroborate our theoretical findings, confirming that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.
zh

[NLP-99] Memory-Aware and Uncertainty-Guided Retrieval for Multi-Hop Question Answering

【速读】：该论文致力于解决多跳问答（Multi-hop Question Answering, QA）任务中生成式模型面临的两个关键挑战：(1) 固定或过于频繁的检索步骤（Retrieval Steps），以及 (2) 对先前检索知识的低效利用。论文提出的解决方案是 MIND (Memory-Informed and INteractive Dynamic RAG) 框架，其核心在于通过基于提示的实体提取（prompt-based entity extraction）、基于标记级熵和注意力信号的动态检索触发（dynamic retrieval triggering），以及与记忆相关的过滤机制（memory-aware filtering）来实现对跨推理步骤高置信事实的记忆存储，从而支持一致的多跳生成（consistent multi-hop generation）。

链接: https://arxiv.org/abs/2503.23095
作者: Yuelyu Ji,Rui Meng,Zhuochun Li,Daqing He
机构: University of Pittsburgh (匹兹堡大学); Salesforce Research (Salesforce 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-hop question answering (QA) requires models to retrieve and reason over multiple pieces of evidence. While Retrieval-Augmented Generation (RAG) has made progress in this area, existing methods often suffer from two key limitations: (1) fixed or overly frequent retrieval steps, and (2) ineffective use of previously retrieved knowledge. We propose MIND (Memory-Informed and INteractive Dynamic RAG), a framework that addresses these challenges through: (i) prompt-based entity extraction to identify reasoning-relevant elements, (ii) dynamic retrieval triggering based on token-level entropy and attention signals, and (iii) memory-aware filtering, which stores high-confidence facts across reasoning steps to enable consistent multi-hop generation. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.23095 [cs.CL] (or arXiv:2503.23095v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.23095 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-100] Parsing Through Boundaries in Chinese Word Segmentation ACL2025

【速读】：该论文致力于解决中文分词与句法解析之间的相互关系问题，特别是中文分词方案对依存句法结构的影响。论文的关键在于通过分析多个分词边界方案（包括不同的语言学假设和计算假设），揭示这些方案如何塑造中文的句法结构，并通过引入一个交互式的基于网络的可视化工具，详细比较不同分词方法下的句法解析结果，从而为理解分词策略与句法解析之间的复杂关系提供清晰的视角。

链接: https://arxiv.org/abs/2503.23091
作者: Yige Chen,Zelong Li,Changbing Yang,Cindy Zhang,Amandisa Cady,Ai Ka Lee,Zejiao Zeng,Haihua Pan,Jungyeul Park
机构: The Chinese University of Hong Kong (香港中文大学); University College London (伦敦大学学院); The University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL2025 System Demonstration

点击查看摘要

Abstract:Chinese word segmentation is a foundational task in natural language processing (NLP), with far-reaching effects on syntactic analysis. Unlike alphabetic languages like English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous. This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese. Focusing on the Chinese GSD treebank, we analyze multiple word boundary schemes, each reflecting distinct linguistic and computational assumptions, and examine how they influence the resulting syntactic structures. To support detailed comparison, we introduce an interactive web-based visualization tool that displays parsing outcomes across segmentation methods.
zh

[NLP-101] UNITYAI-GUARD: Pioneering Toxicity Detection Across Low-Resource Indian Languages

【速读】：该论文旨在解决低资源印度语言二元毒性分类的问题，现有系统主要针对高资源语言，而UnityAI-Guard通过开发针对多种Brahmic/Indic脚本的顶尖模型填补了这一关键空白。其解决方案的关键在于利用包含888k训练实例和35k人工验证测试实例的数据集，实现了平均F1分数为84.23%的卓越性能。

链接: https://arxiv.org/abs/2503.23088
作者: Himanshu Beniwal,Reddybathuni Venkat,Rohit Kumar,Birudugadda Srivibhav,Daksh Jain,Pavan Doddi,Eshwar Dhande,Adithya Ananth,Kuldeep,Heer Kubadia,Pratham Sharda,Mayank Singh
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘地讷格尔); Indian Institute of Technology Goa (印度理工学院果阿); Indian Institute of Technology Tirupati (印度理工学院蒂鲁帕蒂)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces UnityAI-Guard, a framework for binary toxicity classification targeting low-resource Indian languages. While existing systems predominantly cater to high-resource languages, UnityAI-Guard addresses this critical gap by developing state-of-the-art models for identifying toxic content across diverse Brahmic/Indic scripts. Our approach achieves an impressive average F1-score of 84.23% across seven languages, leveraging a dataset of 888k training instances and 35k manually verified test instances. By advancing multilingual content moderation for linguistically diverse regions, UnityAI-Guard also provides public API access to foster broader adoption and application.
zh

[NLP-102] he Reasoning -Memorization Interplay in Language Models Is Mediated by a Single Direction

【速读】：该论文试图解决大型语言模型（LLMs）在文本生成过程中何时切换推理与记忆机制的问题。尽管LLMs在多种推理基准上表现出色，但它们在未见过的问题上有时难以泛化，可能由于过度依赖于训练数据的记忆。论文的关键在于通过识别残差流（residual stream）中的一组线性特征，揭示了LLMs在推理与记忆之间平衡的核心动态。这些特征不仅能够区分推理任务与需要大量记忆的任务，还能够被操控以因果方式影响模型在推理任务上的表现。此外，干预这些推理特征有助于模型在回答生成过程中更准确地激活相关的问题解决能力。因此，研究提供了关于LLMs推理与记忆机制的新见解，并为开发更稳健且可解释的生成式AI系统奠定了基础。

链接: https://arxiv.org/abs/2503.23084
作者: Yihuai Hong,Dian Zhou,Meng Cao,Lei Yu,Zhijing Jin
机构: New York University (纽约大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); McGill University (麦吉尔大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所, 德国图宾根); University of Toronto (多伦多大学); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs’ reasoning-memorization dynamics by identifying a set of linear features in the model’s residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems.
zh

[NLP-103] Efficient Adaptation For Remote Sensing Visual Grounding

【速读】：该论文旨在解决现有基础模型在遥感（Remote Sensing, RS）领域中进行视觉定位（Visual Grounding, VG）任务时因领域特定挑战导致性能不佳的问题。解决方案的关键在于采用参数高效微调（Parameter Efficient Fine-Tuning, PEFT）技术，通过在Grounding DINO的不同模块中评估LoRA的放置，并结合BitFit和适配器（adapters）对预训练于通用VG数据集的OFA基础模型进行微调。这种方法不仅实现了与当前最优（State Of The Art, SOTA）模型相当或更优的性能，同时大幅降低了计算成本。这一研究展示了PEFT技术在提升遥感多模态分析效率和精度方面的潜力，提供了一种实用且经济高效的替代全模型训练方案。

链接: https://arxiv.org/abs/2503.23083
作者: Hasan Moughnieh,Mohamad Chalhoub,Hasan Nasrallah,Cristiano Nattero,Paolo Campanella,Ali J. Ghandour
机构: National Center for Remote Sensing, CNRS (国家遥感中心, CNRS); Lebanese University (黎巴嫩大学); WASDI; Dudelange, Luxembourg; National Center for Remote Sensing, CNRS (国家遥感中心, CNRS); National Center for Remote Sensing, CNRS (国家遥感中心, CNRS)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Foundation models have revolutionized artificial intelligence (AI), offering remarkable capabilities across multi-modal domains. Their ability to precisely locate objects in complex aerial and satellite images, using rich contextual information and detailed object descriptions, is essential for remote sensing (RS). These models can associate textual descriptions with object positions through the Visual Grounding (VG) task, but due to domain-specific challenges, their direct application to RS produces sub-optimal results. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.
zh

[NLP-104] EventWeave: A Dynamic Framework for Capturing Core and Supporting Events in Dialogue Systems

【速读】：该论文试图解决现有大型语言模型（Large Language Models, LLMs）在多轮对话系统中未能充分跟踪事件导致的上下文不完整问题，这使得对话系统容易失去连贯性并遗漏用户意图的细微变化，从而产生不连贯的响应。为了解决这一问题，论文提出了一种以事件为中心的框架——EventWeave。其关键是通过构建一个动态事件图来识别和更新核心事件与支持事件，该图展示了核心事件与支持事件之间的相互作用，其中核心事件塑造主要概念，而支持事件提供整个对话中的关键上下文信息。利用这个动态图，EventWeave帮助模型在生成回复时关注最相关的事件，避免重复遍历整个对话历史，从而提升响应质量和事件相关性。

链接: https://arxiv.org/abs/2503.23078
作者: Zhengyi Zhao,Shubo Zhang,Yiming Du,Bin Liang,Baojun Wang,Zhongyang Li,Binyang Li,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); University of International Relations (国际关系学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Huawei Technologies, Co., Ltd. (华为技术有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing large language models (LLMs) have shown remarkable progress in dialogue systems. However, many approaches still overlook the fundamental role of events throughout multi-turn interactions, leading to \textbfincomplete context tracking. Without tracking these events, dialogue systems often lose coherence and miss subtle shifts in user intent, causing disjointed responses. To bridge this gap, we present \textbfEventWeave, an event-centric framework that identifies and updates both core and supporting events as the conversation unfolds. Specifically, we organize these events into a dynamic event graph, which represents the interplay between \textbfcore events that shape the primary idea and \textbfsupporting events that provide critical context during the whole dialogue. By leveraging this dynamic graph, EventWeave helps models focus on the most relevant events when generating responses, thus avoiding repeated visits of the entire dialogue history. Experimental results on two benchmark datasets show that EventWeave improves response quality and event relevance without fine-tuning.
zh

[NLP-105] Efficient Inference for Large Reasoning Models: A Survey

【速读】：该论文试图解决大型推理模型（LRMs）在复杂任务求解中的推理效率问题，具体表现为高 token 消耗、内存占用以及较长的推理时间。尽管 LRMs 通过学习推理显著提升了大型语言模型（LLMs）的推理能力，但其详尽的推理过程带来了效率瓶颈。为应对这一挑战，论文聚焦于设计高效的推理方法，旨在减少 token 使用的同时保持高质量的推理能力。

解决方案的关键在于提出两种主要的推理优化策略：(a) 显式的紧凑型链式思维（Chain-of-Thought, CoT），通过保留显式推理结构来减少 token 数量；(b) 隐式的潜在 CoT，在隐藏表示中编码推理步骤而非显式 token。论文进一步分析了这些方法在性能与效率方面的优缺点，并探讨了开放性挑战，如以人为中心的可控推理、可解释性与效率之间的权衡、高效推理的安全保障以及更广泛的应用场景。此外，论文强调了通过模型融合、新架构设计以及代理路由等技术提升 LRMs 推理效率的关键见解。

链接: https://arxiv.org/abs/2503.23077
作者: Yue Liu,Jiaying Wu,Yufei He,Hongcheng Gao,Hongyu Chen,Baolong Bi,Jiaheng Zhang,Zhiqi Huang,Bryan Hooi
机构: National University of Singapore; University of Chinese Academy of Sciences; Beijing Jiaotong University; Moonshot; National University of Singapore; National University of Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in complex task-solving. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from performance and efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs’ inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field\footnotethis https URL.
zh

[NLP-106] A Training-free LLM Framework with Interaction between Contextually Related Subtasks in Solving Complex Tasks

【速读】：该论文旨在解决复杂任务分解过程中，上下文相关的子任务在执行时可能遭遇的信息丢失问题，这可能导致冗余操作或执行失败。为了解决这一问题，论文提出了一种无需训练的框架，并引入交互机制，使子任务能够通过发送请求查询其他已完成子任务的具体信息或触发特定动作。关键解决方案包括：(1) 设计子任务轨迹记忆模块，以便在接收到交互请求时恢复已完成的子任务；(2) 在执行过程中新增生成子任务执行过程与结果简明描述的功能，以辅助后续子任务确定交互目标和请求。实验结果显示，该框架在WebShop和HotpotQA任务上的表现优于现有最先进的无训练基线。

链接: https://arxiv.org/abs/2503.23053
作者: Hongjia Liu,Jinlong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable capabilities in solving complex tasks. Recent work has explored decomposing such tasks into subtasks with independent contexts. However, some contextually related subtasks may encounter information loss during execution, leading to redundant operations or execution failures. To address this issue, we propose a training-free framework with an interaction mechanism, which enables a subtask to query specific information or trigger certain actions in completed subtasks by sending requests. To implement interaction, we introduce a subtask trajectory memory to enable resumption of completed subtasks upon receiving interaction requests. Additionally, we propose a new action during execution, which generates a concise and precise description of execution process and outcomes of a subtask, to assist subsequent subtasks in determining interaction targets and requests. We evaluate our framework on interactive decision-making task WebShop and multi-hop question answering HotpotQA, with GPT-3.5 and GPT-4, and comparison results show that our framework outperforms the state-of-the-art training-free baselines.
zh

[NLP-107] Agent ic Large Language Models a survey

【速读】：该论文试图解决大型语言模型（LLMs）在实际应用中的局限性问题，特别是其推理能力、行动能力和交互能力的不足。论文的核心解决方案是提出“生成式代理”（agentic LLMs），即具备推理（reasoning）、行动（acting）和交互（interacting）能力的大型语言模型，并通过组织相关研究工作将其分为三个主要类别：推理与检索、行动模型与工具使用、多智能体系统协作。关键在于这些类别之间的相互促进作用：检索能力支持工具使用，反思机制提升多智能体协作效率，而推理能力则赋能所有类别。此外，论文指出生成式代理能够通过实时推断生成新的训练数据，从而缓解模型因缺乏训练数据而遇到的问题。同时，论文强调了潜在风险，如模型在现实世界中采取行动可能带来的负面影响，但也认为生成式代理有望为医疗诊断、物流、金融市场分析等领域带来重要应用价值，并推动科学研究本身的发展。

链接: https://arxiv.org/abs/2503.23037
作者: Aske Plaat,Max van Duijn,Niki van Stein,Mike Preuss,Peter van der Putten,Kees Joost Batenburg
机构: Leiden University (莱顿大学), Netherlands
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There is great interest in agentic LLMs, large language models that act as agents. We review the growing body of work in this area and provide a research agenda. Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs may provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world, while agentic LLMs are also likely to benefit society.
zh

[NLP-108] A Retrieval-Augmented Knowledge Mining Method with Deep Thinking LLM s for Biomedical Research and Clinical Support

【速读】：该论文旨在解决生物医学领域知识图谱构建受限于复杂术语、数据异质性和快速知识演化的问题，以及大型语言模型（LLMs）在检索与推理方面的局限性，这些问题导致难以揭示跨文档关联和推理路径。论文的关键解决方案是提出一个结合LLMs的流水线，用于从大规模文章中构建生物医学知识图谱（BioStrataKG），并创建跨文档问答数据集（BioCDQA）以评估潜在的知识检索和多跳推理能力。此外，引入了集成渐进检索增强推理（IP-RAR）方法，通过基于集成推理的检索最大化信息召回，并通过基于渐进推理的生成优化知识理解，利用自我反思实现深度思考和精确上下文理解。实验表明，IP-RAR相比现有方法将文档检索F1分数提高了20%，答案生成准确性提高了25%。

链接: https://arxiv.org/abs/2503.23029
作者: Yichun Feng,Jiawei Wang,Ruikun He,Lu Zhou,Yixue Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways. To address these issues, we propose a pipeline that uses LLMs to construct a biomedical knowledge graph (BioStrataKG) from large-scale articles and builds a cross-document question-answering dataset (BioCDQA) to evaluate latent knowledge retrieval and multi-hop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through Integrated Reasoning-based Retrieval and refines knowledge via Progressive Reasoning-based Generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods. This framework helps doctors efficiently integrate treatment evidence for personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating scientific discovery and decision-making.
zh

[NLP-109] S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning

【速读】：该论文旨在解决Sparse Mixture of Experts (SMoE) 在训练过程中因representation collapse导致的挑战。现有方法通过改进路由机制缓解此问题，但存在两个关键限制：(1) 专家嵌入维度远小于模型维度，加剧表示坍塌；(2) 将输入路由至Top-K专家可能导致其学习过于相似的特征。为应对这些问题，论文提出了一种名为S2MoE（Stochastic Learning for Robust Sparse Mixture of Experts）的新方法，通过在不确定性下学习（Learning under Uncertainty），同时利用确定性和非确定性输入来设计专家混合模型。实验表明，S2MoE在保持与其他路由方法相当性能的同时，将计算推理成本降低了28%。关键在于结合随机学习策略以增强专家间的多样性并缓解表示坍塌问题。

链接: https://arxiv.org/abs/2503.23007
作者: Giang Do,Hung Le,Truyen Tran
机构: Applied Artificial Intelligence Institute (A2I2), Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注: 4 pages

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) enables efficient training of large language models by routing input tokens to a select number of experts. However, training SMoE remains challenging due to the issue of representation collapse. Recent studies have focused on improving the router to mitigate this problem, but existing approaches face two key limitations: (1) expert embeddings are significantly smaller than the model’s dimension, contributing to representation collapse, and (2) routing each input to the Top-K experts can cause them to learn overly similar features. In this work, we propose a novel approach called Robust Sparse Mixture of Experts via Stochastic Learning (S2MoE), which is a mixture of experts designed to learn from both deterministic and non-deterministic inputs via Learning under Uncertainty. Extensive experiments across various tasks demonstrate that S2MoE achieves performance comparable to other routing methods while reducing computational inference costs by 28%.
zh

[NLP-110] Sparse Mixture of Experts as Unified Competitive Learning

【速读】：该论文旨在解决当前Sparse Mixture of Experts (SMoE)在泛化能力上的局限性问题，尤其是在Massive Text Embedding Benchmark (MTEB)等任务中的表现不佳。论文指出，现有的SMoE分为Token Choice和Expert Choice两类，但它们分别存在过度关注无关专家和可能丢弃重要tokens的风险，从而影响性能。为解决此问题，论文提出了一种名为Unified Competitive Learning SMoE (USMoE)的新框架，其关键是通过统一的竞争学习机制，在无需重新训练的情况下同时优化Token Choice和Expert Choice，以提升现有SMoE在多种任务中的性能或显著降低计算开销，同时保持竞争力。

链接: https://arxiv.org/abs/2503.22996
作者: Giang Do,Hung Le,Truyen Tran
机构: Applied Artificial Intelligence Institute (A2I2), Deakin University (迪肯大学)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) improves the efficiency of large language model training by directing input tokens to a subset of experts. Despite its success in generation tasks, its generalization ability remains an open question. In this paper, we demonstrate that current SMoEs, which fall into two categories: (1) Token Choice ;and (2) Expert Choice, struggle with tasks such as the Massive Text Embedding Benchmark (MTEB). By analyzing their mechanism through the lens of competitive learning, our study finds that the Token Choice approach may overly focus on irrelevant experts, while the Expert Choice approach risks discarding important tokens, potentially affecting performance. Motivated by this analysis, we propose Unified Competitive Learning SMoE (USMoE), a novel and efficient framework designed to improve the performance of existing SMoEs in both scenarios: with and without training. Extensive experiments across various tasks show that USMoE achieves up to a 10% improvement over traditional approaches or reduces computational inference costs by 14% while maintaining strong performance.
zh

[NLP-111] FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

【速读】：该论文旨在解决在AI模型处理复杂问题时，可靠的人类监督变得更具挑战性的问题，主要因为验证解决方案的难度增加。为应对这一挑战，论文提出了几种扩展AI监督的方法，包括辩论（debate）、批评（critique）和证明者-验证者游戏（prover-verifier games）。论文的关键在于引入了一个名为FindTheFlaws的数据集组，包含五个跨医学、数学、科学、编码和Lojban语言的多样化数据集。这些数据集提供了专家验证的正确长篇解答以及带有特定错误标注的有缺陷解答，从而支持对前沿模型批评能力的评估，并为可扩展监督实验提供依据。研究发现，某些任务/数据集组合中的专家基线表现甚至超过顶级模型，这使得它们对于可扩展监督实验更为有益。

链接: https://arxiv.org/abs/2503.22989
作者: Gabriel Recchia,Chatrik Singh Mangat,Issac Li,Gayatri Krishnakumar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 43 pages, 3 figures. for associated repository, see this https URL

点击查看摘要

Abstract:As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable ‘prover’ model generates solutions that must be verifiable by a less capable ‘verifier’. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models’ critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments. Comments: 43 pages, 3 figures. for associated repository, see this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2 Cite as: arXiv:2503.22989 [cs.AI] (or arXiv:2503.22989v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.22989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-112] FReM: A Flexible Reasoning Mechanism for Balancing Quick and Slow Thinking in Long-Context Question Answering

【速读】：该论文旨在解决长上下文问答（Long-context Question-Answering, LCQA）系统中现有推理模式的局限性。慢速推理模式倾向于探索所有可能的推理路径，导致过度思考和时间浪费；而快速推理模式通常依赖模式匹配而非真正理解查询逻辑，从而缺乏适当的语义理解。为了解决这些问题，论文提出了一种名为FReM（Flexible Reasoning Mechanism）的方法，其关键是根据每个问题的复杂度动态调整推理深度。具体而言，FReM通过利用合成参考问答示例提供显式的推理链，使得模型能够高效处理简单查询的同时，对更复杂的查询进行更深的推理。这种方法不仅帮助快速推理模型超越浅层的模式匹配，还减少了慢速推理模型的无谓探索范围，从而提升推理的准确性和可扩展性，特别是在多跳复杂问题上的表现。

链接: https://arxiv.org/abs/2503.22985
作者: Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Bin Liang,Binyang Li,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); University of International Relations (国际关系学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context question-answering (LCQA) systems have greatly benefited from the powerful reasoning capabilities of large language models (LLMs), which can be categorized into slow and quick reasoning modes. However, both modes have their limitations. Slow thinking generally leans to explore every possible reasoning path, which leads to heavy overthinking and wastes time. Quick thinking usually relies on pattern matching rather than truly understanding the query logic, which misses proper understanding. To address these issues, we propose FReM: Flexible Reasoning Mechanism, a method that adjusts reasoning depth according to the complexity of each question. Specifically, FReM leverages synthetic reference QA examples to provide an explicit chain of thought, enabling efficient handling of simple queries while allowing deeper reasoning for more complex ones. By doing so, FReM helps quick-thinking models move beyond superficial pattern matching and narrows the reasoning space for slow-thinking models to avoid unnecessary exploration. Experiments on seven QA datasets show that FReM improves reasoning accuracy and scalability, particularly for complex multihop questions, indicating its potential to advance LCQA methodologies.
zh

[NLP-113] XL-Instruct: Synthetic Data for Cross-Lingual Open-Ended Generation

【速读】：本文试图解决跨语言开放式生成（cross-lingual open-ended generation）这一重要但研究不足的问题，即在目标语言与用户查询语言不同的情况下生成响应。为评估大语言模型（Large Language Models, LLMs）的跨语言生成能力，论文引入了XL-AlpacaEval基准，并提出了XL-Instruct，这是一种高质量的合成数据生成方法。解决方案的关键在于利用仅8K条由XL-Instruct生成的指令对模型进行微调，显著提升了模型性能，将对抗GPT-4o-Mini的胜率从7.4%提升至21.5%，并在多个细粒度质量指标上取得改进。此外，基于XL-Instruct微调的模型展现出强大的零样本迁移能力，适用于英语单语言及多语言生成任务。鉴于其在各方面的一致性提升，论文强烈建议在未来多语言LLMs的后训练流程中采用XL-Instruct。为促进进一步研究，XL-Instruct和XL-AlpacaEval数据集将被公开免费发布。

链接: https://arxiv.org/abs/2503.22973
作者: Vivek Iyer,Ricardo Rei,Pinzhen Chen,Alexandra Birch
机构: School of Informatics, University of Edinburgh (爱丁堡大学), UK; Unbabel (Unbabel), Lisbon, Portugal
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cross-lingual open-ended generation – i.e. generating responses in a desired language different from that of the user’s query – is an important yet understudied problem. We introduce XL-AlpacaEval, a new benchmark for evaluating cross-lingual generation capabilities in Large Language Models (LLMs), and propose XL-Instruct, a high-quality synthetic data generation method. Fine-tuning with just 8K XL-Instruct-generated instructions significantly improves model performance, increasing the win rate against GPT-4o-Mini from 7.4% to 21.5%, and improving on several fine-grained quality metrics. Additionally, models fine-tuned on XL-Instruct exhibit strong zero-shot transfer to both English-only and multilingual generation tasks. Given its consistent gains across the board, we strongly recommend incorporating XL-Instruct in the post-training pipeline of future multilingual LLMs. To facilitate further research, we will publicly and freely release the XL-Instruct and XL-AlpacaEval datasets, which constitute two of the few cross-lingual resources currently available in the literature.
zh

[NLP-114] HRET: A Self-Evolving LLM Evaluation Toolkit for Korean

【速读】：该论文旨在解决韩国大型语言模型（Korean Large Language Models, LLMs）评估领域缺乏标准化评价框架的问题，导致结果不一致且可比性有限。为了解决这一问题，论文提出了一种名为HRET Haerae Evaluation Toolkit的开源、自进化评价框架，专门针对韩国LLMs设计。其关键是通过统一多种评估方法（如基于logit的评分、精确匹配、语言不一致性惩罚以及LLM作为裁判的评估），并采用模块化、注册表架构集成主要基准（如HAE-RAE Bench、KMMLU、KUDGE、HRM8K）和多种推理后端（如vLLM、HuggingFace、OpenAI兼容端点），同时提供自动化管道以实现持续进化，从而为可重复、公平和透明的韩国自然语言处理研究奠定坚实基础。

链接: https://arxiv.org/abs/2503.22968
作者: Hanwool Lee,Soo Yong Kim,Dasol Choi,SangWon Baek,Seunghyeok Hong,Ilgyun Jeong,Inseon Hwang,Naeun Lee,Guijin Son
机构: HAERAE (HAERAE); Shinhan Securities (新韩证券); Yonsei University (延世大学); Catius (Catius); Hankuk University of Foreign Studies (韩国外国语大学); Korea University (高丽大学); OneLineAI (OneLineAI)
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Korean large language models (LLMs) have spurred numerous benchmarks and evaluation methodologies, yet the lack of a standardized evaluation framework has led to inconsistent results and limited comparability. To address this, we introduce HRET Haerae Evaluation Toolkit, an open-source, self-evolving evaluation framework tailored specifically for Korean LLMs. HRET unifies diverse evaluation methods, including logit-based scoring, exact-match, language-inconsistency penalization, and LLM-as-a-Judge assessments. Its modular, registry-based architecture integrates major benchmarks (HAE-RAE Bench, KMMLU, KUDGE, HRM8K) and multiple inference backends (vLLM, HuggingFace, OpenAI-compatible endpoints). With automated pipelines for continuous evolution, HRET provides a robust foundation for reproducible, fair, and transparent Korean NLP research.
zh

[NLP-115] Can LLM s Support Medical Knowledge Imputation? An Evaluation-Based Perspective

【速读】：该论文旨在解决医疗知识图谱（Medical Knowledge Graphs, KGs）中因知识缺口和编码系统结构性限制导致的不完整性问题，特别是在疾病与潜在治疗之间缺失或不一致关联的治疗映射领域。解决方案的关键在于探索利用大型语言模型（Large Language Models, LLMs）来推断缺失的治疗关系。然而，尽管LLMs在知识增强方面展现出潜力，其在医疗知识推断中的应用存在事实性错误、幻觉关联以及模型间和模型内不稳定等显著风险。论文通过系统的基准比较评估了LLM驱动的治疗映射可靠性，并揭示了其与临床指南不一致及可能威胁患者安全的关键局限性。研究强调了对LLMs应用进行批判性评估的重要性，并倡导采用混合方法以提升医疗知识图谱中治疗映射的质量。

链接: https://arxiv.org/abs/2503.22954
作者: Xinyu Yao,Aditya Sannabhadti,Holly Wiberg,Karmel S. Shehadeh,Rema Padman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 10 pages, 3 figures, AMIA

点击查看摘要

Abstract:Medical knowledge graphs (KGs) are essential for clinical decision support and biomedical research, yet they often exhibit incompleteness due to knowledge gaps and structural limitations in medical coding systems. This issue is particularly evident in treatment mapping, where coding systems such as ICD, Mondo, and ATC lack comprehensive coverage, resulting in missing or inconsistent associations between diseases and their potential treatments. To address this issue, we have explored the use of Large Language Models (LLMs) for imputing missing treatment relationships. Although LLMs offer promising capabilities in knowledge augmentation, their application in medical knowledge imputation presents significant risks, including factual inaccuracies, hallucinated associations, and instability between and within LLMs. In this study, we systematically evaluate LLM-driven treatment mapping, assessing its reliability through benchmark comparisons. Our findings highlight critical limitations, including inconsistencies with established clinical guidelines and potential risks to patient safety. This study serves as a cautionary guide for researchers and practitioners, underscoring the importance of critical evaluation and hybrid approaches when leveraging LLMs to enhance treatment mappings on medical knowledge graphs.
zh

[NLP-116] SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在从大规模数据集中学习的过程中可能无意中记住受版权保护内容的问题，这一问题已引发多起法律诉讼。论文提出了一种名为SUV（Selective Unlearning for Verbatim data）的选择性遗忘框架，其核心目标是在防止LLM记忆受版权保护内容的同时，尽可能保留模型的整体效用。

解决方案的关键在于通过构建一个捕捉目标LLM版权侵权实例的数据集，并利用直接偏好优化（Direct Preference Optimization, DPO）来替换模型中的逐字受版权保护内容，用更合理且连贯的替代内容取代它们。为了缓解DPO可能对模型在其他无关任务上的性能造成的负面影响，论文引入了梯度投影和Fisher信息正则化方法以减轻性能退化。实验结果表明，SUV显著减少了逐字记忆现象，同时对与版权无关的任务性能影响极小，验证了该方法的有效性和可扩展性。

链接: https://arxiv.org/abs/2503.22948
作者: Tianyang Xu,Xiaoze Liu,Feijie Wu,Xiaoqian Wang,Jing Gao
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed natural language processing by learning from massive datasets, yet this rapid progress has also drawn legal scrutiny, as the ability to unintentionally generate copyrighted content has already prompted several prominent lawsuits. In this work, we introduce SUV (Selective Unlearning for Verbatim data), a selective unlearning framework designed to prevent LLM from memorizing copyrighted content while preserving its overall utility. In detail, the proposed method constructs a dataset that captures instances of copyrighted infringement cases by the targeted LLM. With the dataset, we unlearn the content from the LLM by means of Direct Preference Optimization (DPO), which replaces the verbatim copyrighted content with plausible and coherent alternatives. Since DPO may hinder the LLM’s performance in other unrelated tasks, we integrate gradient projection and Fisher information regularization to mitigate the degradation. We validate our approach using a large-scale dataset of 500 famous books (predominantly copyrighted works) and demonstrate that SUV significantly reduces verbatim memorization with negligible impact on the performance on unrelated tasks. Extensive experiments on both our dataset and public benchmarks confirm the scalability and efficacy of our approach, offering a promising solution for mitigating copyright risks in real-world LLM applications.
zh

[NLP-117] Resona: Improving Context Copying in Linear Recurrence Models with Retrieval

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）领域中线性循环模型（Linear Recurrent Model）在上下文学习（in-context learning）能力方面与Transformer模型存在的显著性能差距问题。解决方案的关键在于提出了一种名为__Resona__的简单且可扩展的框架，通过引入检索机制（retrieval augmentation）增强线性循环模型的能力，使其能够从输入上下文中整合检索到的信息，从而更好地适应多样化任务需求。实验结果表明，采用__Resona__框架后，线性循环模型在多种合成数据及真实自然语言任务上均取得了显著的性能提升，凸显了其作为通用方法改善线性循环LLMs上下文学习和语言建模能力的潜力。

链接: https://arxiv.org/abs/2503.22913
作者: Xinyu Wang,Linrui Ma,Jerry Huang,Peng Lu,Prasanna Parthasarathi,Xiao-Wen Chang,Boxing Chen,Yufei Cui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce Resona, a simple and scalable framework for augmenting linear recurrent models with retrieval. Resona~augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that Resona-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.
zh

[NLP-118] Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

【速读】：该论文旨在解决大规模状态空间模型（State Space Models, SSMs）在云服务或资源受限设备上的扩展难题，主要由于其存储需求和计算能力的要求较高。为了解决这一问题，论文提出了一种名为Quamba2的量化方法，通过使用低比特宽度的数据格式来减小模型大小，并利用硬件加速提升性能。然而，SSMs容易受到量化误差的影响，因此研究重点在于优化特定模型或比特宽度配置以提高效率而不牺牲性能。针对不同应用场景需要不同的比特宽度配置（如W4A8用于大批次解码加速，W4A16用于短提示单用户生成速度提升），Quamba2支持多种配置以适应不同平台的需求。

Quamba2的关键在于其提出的离线量化方法，该方法基于SSMs的通道顺序保持性和激活持久性特性，通过对输入x进行排序和聚类实现8位量化，并结合针对输入相关参数B和C的每状态组量化策略。此外，为了保证SSM输出的计算不变性，在离线阶段重新排列权重以匹配聚类序列。实验结果显示，Quamba2-8B相比其他最先进的SSM量化方法表现更优，在预填充阶段提升了1.3倍速度，在生成阶段提升了3倍速度，同时实现了4倍内存减少且仅造成平均1.6%的精度下降。MMLU评估进一步证明了框架的通用性和鲁棒性。

链接: https://arxiv.org/abs/2503.22879
作者: Hung-Yueh Chiang,Chi-Chih Chang,Natalia Frumkin,Kai-Chiang Wu,Mohamed S. Abdelfattah,Diana Marculescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input x , combined with a per-state-group quantization for input-dependent parameters B and C . To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms several state-of-the-art SSM quantization methods and delivers 1.3 \times and 3 \times speed-ups in the pre-filling and generation stages, respectively, while offering 4 \times memory reduction with only a 1.6% average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: this https URL.
zh

[NLP-119] Understanding Inequality of LLM Fact-Checking over Geographic Regions with Agent and Retrieval models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在事实核查任务中因地理区域差异导致的性能不均衡问题。论文通过评估开放与私有模型在不同地区和场景下的事实准确性，揭示了来自全球北方（Global North）的陈述在所有实验设置下均显著优于全球南方（Global South）的陈述，尤其是在更贴近现实的基于维基百科代理系统的场景中，这一差距更为明显。这表明通用知识库难以充分应对特定区域的细微差别。论文的关键在于强调需要改进数据集平衡策略以及增强检索机制，以提升LLMs在全球多样化地理背景下进行事实核查的能力。

链接: https://arxiv.org/abs/2503.22877
作者: Bruno Coelho,Shujaat Mirza,Yuyuan Cui,Christina Pöpper,Damon McCoy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Fact-checking is a potentially useful application of Large Language Models (LLMs) to combat the growing dissemination of disinformation. However, the performance of LLMs varies across geographic regions. In this paper, we evaluate the factual accuracy of open and private models across a diverse set of regions and scenarios. Using a dataset containing 600 fact-checked statements balanced across six global regions we examine three experimental setups of fact-checking a statement: (1) when just the statement is available, (2) when an LLM-based agent with Wikipedia access is utilized, and (3) as a best case scenario when a Retrieval-Augmented Generation (RAG) system provided with the official fact check is employed. Our findings reveal that regardless of the scenario and LLM used, including GPT-4, Claude Sonnet, and LLaMA, statements from the Global North perform substantially better than those from the Global South. Furthermore, this gap is broadened for the more realistic case of a Wikipedia agent-based system, highlighting that overly general knowledge bases have a limited ability to address region-specific nuances. These results underscore the urgent need for better dataset balancing and robust retrieval strategies to enhance LLM fact-checking capabilities, particularly in geographically diverse contexts. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2503.22877 [cs.CL] (or arXiv:2503.22877v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.22877 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-120] Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets

【速读】：该论文旨在解决在基于推文（Tweets）的建筑功能分类（Building Function Classification, BFC）任务中，由于地理启发式收集和外部数据库标注所引入的标签噪声以及句子级特征噪声（如无关或无信息的推文）对模型性能的影响。论文的关键在于提出了一种利用大型语言模型（LLM）生成合成基准数据集的方法，该数据集仅包含正确标注且与关联建筑物语义相关的推文。通过构建这一“理想化”（oracle）数据集，研究能够系统性地分析真实世界数据中难以分离的噪声影响。实验结果表明，真实推文中的特征噪声显著削弱了mBERT模型的上下文学习能力，而清洁的合成数据集则使mBERT能够有效学习并大幅超越朴素贝叶斯分类器的表现。这表明，在此任务中解决特征噪声比提升模型复杂度更为关键。该合成数据集为未来噪声注入研究提供了新的实验环境，并已公开发布于GitHub。

链接: https://arxiv.org/abs/2503.22856
作者: Shanshan Bai,Anna Kruspe,Xiaoxiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich University of Applied Sciences (慕尼黑应用技术大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases, an inherently weakly supervised process that introduces both label noise and sentence level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise difficult to isolate in real-world data. To assess its utility, we compare model performance using Naive Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn effectively, outperforming Naive Bayes Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset offers a novel experimental environment for future noise injection studies and is publicly available on GitHub.
zh

[NLP-121] L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution

【速读】：该论文试图解决复杂推理任务中模型在逐步应用简单规则时保持一致性和准确性的问题，即“level-0”推理能力的评估与提升。为系统性评价这一能力，论文提出了L0-Bench，这是一个用于测试程序正确性的语言模型基准，强调生成无误的推理过程，以补充现有主要关注最终结果正确性的基准。解决方案的关键在于通过合成Python函数构建可控且可扩展的测试程序，并利用L0-Bench分析模型在不同维度（如输入上下文长度、多数投票方案数量及推理步数）上的表现，揭示了模型在多步骤推理中的退化规律以及更大规模或增强推理能力的模型的优势，为改进“level-0”推理能力提供了方向。

链接: https://arxiv.org/abs/2503.22832
作者: Simeng Sun,Cheng-Ping Hsieh,Faisal Ladhak,Erik Arakelyan,Santiago Akle Serano,Boris Ginsburg
机构: NVIDIA (英伟达)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps, a foundational capability which we term “level-0” reasoning. To systematically evaluate this capability, we introduce L0-Bench, a language model benchmark for testing procedural correctness – the ability to generate correct reasoning processes, complementing existing benchmarks that primarily focus on outcome correctness. Given synthetic Python functions with simple operations, L0-Bench grades models on their ability to generate step-by-step, error-free execution traces. The synthetic nature of L0-Bench enables systematic and scalable generation of test programs along various axes (e.g., number of trace steps). We evaluate a diverse array of recent closed-source and open-weight models on a baseline test set. All models exhibit degradation as the number of target trace steps increases, while larger models and reasoning-enhanced models better maintain correctness over multiple steps. Additionally, we use L0-Bench to explore test-time scaling along three dimensions: input context length, number of solutions for majority voting, and inference steps. Our results suggest substantial room to improve “level-0” reasoning and potential directions to build more reliable reasoning systems.
zh

[NLP-122] Learning to Reason for Long-Form Story Generation

【速读】：该论文试图解决通过大语言模型（Large Language Models, LLMs）生成高质量长篇故事的问题，特别是在缺乏标注数据集和精确质量评估指标的情况下，如何实现更具泛化性和自动化的生成能力。传统方法依赖手工设计的提示技术（hand-designed prompting techniques），但这些方法高度依赖具体任务且效率低下。为此，论文提出了一种通用的故事生成任务——下一章预测（Next-Chapter Prediction），以及一种基于可验证奖励（Verified Rewards via Completion Likelihood Improvement）的奖励机制，利用未标注的小说数据集作为学习信号，使模型能够从情节摘要中推理并制定详细的下一章生成计划。关键在于通过强化学习（Reinforcement Learning, RL）结合可验证奖励，实现了无需大量标注数据的自动化推理与生成优化，从而显著提升了生成质量，并在科幻与奇幻题材中表现出更优的效果。

链接: https://arxiv.org/abs/2503.22828
作者: Alexander Gurung,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generating high-quality stories spanning thousands of tokens requires competency across a variety of skills, from tracking plot and character arcs to keeping a consistent and engaging style. Due to the difficulty of sourcing labeled datasets and precise quality measurements, most work using large language models (LLMs) for long-form story generation uses combinations of hand-designed prompting techniques to elicit author-like behavior. This is a manual process that is highly dependent on the specific story-generation task. Motivated by the recent success of applying RL with Verifiable Rewards to domains like math and coding, we propose a general story-generation task (Next-Chapter Prediction) and a reward formulation (Verified Rewards via Completion Likelihood Improvement) that allows us to use an unlabeled book dataset as a learning signal for reasoning. We learn to reason over a story’s condensed information and generate a detailed plan for the next chapter. Our reasoning is evaluated via the chapters it helps a story-generator create, and compared against non-trained and supervised finetuning (SFT) baselines. Pairwise human judgments reveal the chapters our learned reasoning produces are preferred across almost all metrics, and the effect is more pronounced in Scifi and Fantasy genres.
zh

[NLP-123] Boosting Large Language Models with Mask Fine-Tuning

【速读】：该论文试图解决主流大语言模型（Large Language Model, LLM）微调协议中始终保持模型完整性的惯例是否为提升性能所必需的问题。研究表明，通过适当破坏模型完整性可能意外带来性能提升。论文提出的解决方案关键在于引入了一种名为掩码微调（Mask Fine-Tuning, MFT）的新范式，其通过学习一组由典型LLM微调目标监督的二值掩码来实现。MFT在不同领域和基础模型上均表现出一致的性能增益，同时提供了详细的超参数分析以深入理解其机制，并通过将其应用于完全训练好的模型上扩展了掩码学习的传统功能范围。

链接: https://arxiv.org/abs/2503.22764
作者: Mingyuan Zhang,Yue Bai,Huan Wang,Yizhou Wang,Qihua Dong,Yun Fu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The model is usually kept integral in the mainstream large language model (LLM) fine-tuning protocols. No works have questioned whether maintaining the integrity of the model is indispensable for performance. In this work, we introduce Mask Fine-Tuning (MFT), a brand-new LLM fine-tuning paradigm to show that properly breaking the integrity of the model can surprisingly lead to improved performance. Specifically, MFT learns a set of binary masks supervised by the typical LLM fine-tuning objective. Extensive experiments show that MFT gains a consistent performance boost across various domains and backbones (e.g., 1.95%/1.88% average gain in coding with LLaMA2-7B/3.1-8B). Detailed procedures are provided to study the proposed MFT from different hyperparameter perspectives for better insight. In particular, MFT naturally updates the current LLM training protocol by deploying it on a complete well-trained model. This study extends the functionality of mask learning from its conventional network pruning context for model compression to a more general scope.
zh

[NLP-124] Susceptibility of Large Language Models to User-Driven Factors in Medical Queries

【速读】：该论文旨在研究用户驱动因素（如误导性信息表述、来源权威性、模型人格设定以及关键临床细节的缺失）对大型语言模型（Large Language Models, LLMs）在医疗诊断准确性与可靠性方面的影响。论文通过两项实验探讨了这些问题：一是引入具有不同断言强度的误导性外部意见（扰动测试），二是移除特定类别的患者信息（消融测试）。解决方案的关键在于强调在使用LLMs时需要提供结构化的提示（well-structured prompts）和完整的临床背景信息，同时避免以权威方式表述误导性信息，并确保提供全面的临床细节，尤其是对于复杂病例。

链接: https://arxiv.org/abs/2503.22746
作者: Kyung Ho Lim,Ujin Kang,Xiang Li,Jin Sung Kim,Young-Chul Jung,Sangjoon Park,Byung-Hoon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.
zh

[NLP-125] Adaptive Integrated Layered Attention (AILA)

【速读】：该论文旨在解决跨网络层自适应特征重用的问题，提出了一种名为Adaptive Integrated Layered Attention (AILA) 的神经网络架构。其关键创新在于结合密集跳跃连接与不同机制以实现网络层间特征的灵活复用。具体而言，论文设计了两种AILA架构：AILA-Architecture 1采用简单的线性层作为层间连接机制，而AILA-Architecture 2引入注意力机制以选择性聚焦于前序层的输出。通过在价格预测、图像识别及情感分析等任务上的评估，结果表明AILA不仅匹配甚至超越了强基准模型（如LSTM、Transformer和ResNet），同时显著降低了训练与推理时间，证明了其自适应层间连接在多深度灵活复用相关特征方面的有效性。

链接: https://arxiv.org/abs/2503.22742
作者: William Claster,Suhas KM,Dhairya Gundechia
机构: Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We propose Adaptive Integrated Layered Attention (AILA), a neural network architecture that combines dense skip connections with different mechanisms for adaptive feature reuse across network layers. We evaluate AILA on three challenging tasks: price forecasting for various commodities and indices (SP 500, Gold, US dollar Futures, Coffee, Wheat), image recognition using the CIFAR-10 dataset, and sentiment analysis on the IMDB movie review dataset. In all cases, AILA matches strong deep learning baselines (LSTMs, Transformers, and ResNets), achieving it at a fraction of the training and inference time. Notably, we implement and test two versions of the model - AILA-Architecture 1, which uses simple linear layers as the connection mechanism between layers, and AILA-Architecture 2, which implements an attention mechanism to selectively focus on outputs from previous layers. Both architectures are applied in a single-task learning setting, with each model trained separately for individual tasks. Results confirm that AILA’s adaptive inter-layer connections yield robust gains by flexibly reusing pertinent features at multiple network depths. The AILA approach thus presents an extension to existing architectures, improving long-range sequence modeling, image recognition with optimised computational speed, and SOTA classification performance in practice.
zh

[NLP-126] raining in translation tools and technologies: Findings of the EMT survey 2023

【速读】：该论文试图调查和分析后研究生翻译培训项目中教授的计算机化工具和技术的发展趋势，重点关注翻译技术的创新及其在课程中的整合情况。论文通过第三轮针对此类项目的调查，探讨了机器翻译（Machine Translation）、译后编辑（Post-editing）、质量评估（Quality Evaluation）以及生成式工具（Generative Tools）等技术在课程设置中的普及程度和教学方式的变化。此外，还关注了新冠疫情对课程交付模式的影响，特别是向云软件（Cloud-based Software）和学生个人设备（Personal Devices）的转变。

解决方案的关键在于揭示翻译教育如何快速响应技术革新，并通过调整课程内容和交付方式来适应行业需求。这包括巩固核心工具的教学、引入与职业场景相关的技术工作流程，以及加强对文件管理、数据安全及法律伦理问题的关注。通过这些措施，翻译教育能够更好地培养具备实际操作能力的专业人才。

链接: https://arxiv.org/abs/2503.22735
作者: Andrew Rothwell,Joss Moorkens,Tomas Svoboda
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This article reports on the third iteration of a survey of computerized tools and technologies taught as part of postgraduate translation training programmes. While the survey was carried out under the aegis of the EMT Network, more than half of responses are from outside that network. The results show the responsiveness of programmes to innovations in translation technology, with increased compulsory inclusion of machine translation, post-editing, and quality evaluation, and a rapid response to the release of generative tools. The flexibility required during the Covid-19 pandemic has also led to some lasting changes to programmes. While the range of tools being taught has continued to expand, programmes seem to be consolidating their core offering around cloud-based software with cost-free academic access. There has also been an increase in the embedding of professional contexts and workflows associated with translation technology. Generic file management and data security skills have increased in perceived importance, and legal and ethical issues related to translation data have also become more prominent. In terms of course delivery the shift away from conventional labs identified in EMT2017 has accelerated markedly, no doubt partly driven by the pandemic, accompanied by a dramatic expansion in the use of students’ personal devices.
zh

[NLP-127] Reasoning Beyond Limits: Advances and Open Problems for LLM s

【速读】：该论文旨在系统分析大型语言模型（Large Language Models, LLMs）在生成式推理能力方面的突破，并探讨如何通过多种训练方法提升其复杂问题解决能力。论文的关键在于综合评估从2023年至2025年发布的27种顶级LLM模型（如Mistral AI Small 3 24B、DeepSeek-R1等），并详细描述一系列训练策略，包括通用训练方法、专家混合（Mixture-of-Experts, MoE）、架构创新、检索增强生成（Retrieval-Augmented Generation, RAG）、链式思维与自我改进技术，以及测试时计算扩展、蒸馏和强化学习（Reinforcement Learning, RL）等。论文的核心解决方案在于通过这些方法显著提升LLMs的多步推理能力，同时解决无监督条件下的推理优化、链式任务限制、结构化提示与灵活性平衡，以及长上下文检索和外部工具集成等关键挑战。

链接: https://arxiv.org/abs/2503.22732
作者: Mohamed Amine Ferrag,Norbert Tihanyi,Merouane Debbah
机构: Guelma University (古尔马大学, Algeria); Technology Innovation Institute (技术创新研究院, UAE); Eötvös Loránd University (厄特沃什·洛兰大学, Hungary); Khalifa University of Science and Technology (哈利法科学技术大学, UAE)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 41 pages

点击查看摘要

Abstract:Recent generative reasoning breakthroughs have transformed how large language models (LLMs) tackle complex problems by dynamically retrieving and refining information while generating coherent, multi-step thought processes. Techniques such as inference-time scaling, reinforcement learning, supervised fine-tuning, and distillation have been successfully applied to models like DeepSeek-R1, OpenAI’s o1 o3, GPT-4o, Qwen-32B, and various Llama variants, resulting in enhanced reasoning capabilities. In this paper, we provide a comprehensive analysis of the top 27 LLM models released between 2023 and 2025 (including models such as Mistral AI Small 3 24B, DeepSeek-R1, Search-o1, QwQ-32B, and phi-4). Then, we present an extensive overview of training methodologies that spans general training approaches, mixture-of-experts (MoE) and architectural innovations, retrieval-augmented generation (RAG), chain-of-thought and self-improvement techniques, as well as test-time compute scaling, distillation, and reinforcement learning (RL) methods. Finally, we discuss the key challenges in advancing LLM capabilities, including improving multi-step reasoning without human supervision, overcoming limitations in chained tasks, balancing structured prompts with flexibility, and enhancing long-context retrieval and external tool integration.
zh

[NLP-128] A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

【速读】：该论文试图解决生物医学领域高质量、多样化和大规模数据获取的瓶颈问题，这是现代人工智能系统发展的基础。为了解决这一问题，论文的关键方案是引入了一个名为Biomedica的开源数据集，该数据集源自PubMed Central开放获取子集，包含超过600万篇科学文章和2400万对图像-文本对，以及27个元数据字段（包括专家人工标注）。此外，通过提供可扩展的流式传输和搜索API，论文解决了访问大规模数据集的挑战，从而实现了与人工智能系统的无缝集成。

链接: https://arxiv.org/abs/2503.22727
作者: Alejandro Lozano,Min Woo Sun,James Burgess,Jeffrey J. Nirschl,Christopher Polzak,Yuhui Zhang,Liangyu Chen,Jeffrey Gu,Ivan Lopez,Josiah Aklilu,Anita Rau,Austin Wolfgang Katzer,Collin Chiu,Orr Zohar,Xiaohan Wang,Alfred Seunghoon Song,Chiang Chia-Chun,Robert Tibshirani,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Mayo Clinic (梅奥诊所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.
zh

[NLP-129] InfoBid: A Simulation Framework for Studying Information Disclosure in Auctions with Large Language Model-based Agents AAAI2025

【速读】：该论文试图解决在线广告系统中发布者在信息披露策略上的权衡问题：虽然披露更多信息可以提高效率以实现广告展示的最佳分配，但可能会因降低竞争广告商之间的不确定性而失去收入潜力。为了解决这一问题，现有研究和实践转向模拟框架，但由于对现实数据访问的限制，理解这种权衡受到制约。论文的关键解决方案是引入InfoBid，这是一个利用大型语言模型（LLM）代理的灵活模拟框架，用于研究多智能体拍卖环境中信息披露策略的影响。通过使用GPT-4o进行第二价格拍卖的模拟实验，验证了信号传递如何影响战略行为和拍卖结果，并揭示了其与经济理论和社会学习理论的一致性。关键创新在于将基于LLM的代理集成到模拟框架中，以探索信息不对称和信号传递策略，特别是在拍卖场景中的应用。

链接: https://arxiv.org/abs/2503.22726
作者: Yue Yin
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注: AAAI 2025 Workshop: Economics of Modern ML: Markets, Incentives, and Generative AI

点击查看摘要

Abstract:In online advertising systems, publishers often face a trade-off in information disclosure strategies: while disclosing more information can enhance efficiency by enabling optimal allocation of ad impressions, it may lose revenue potential by decreasing uncertainty among competing advertisers. Similar to other challenges in market design, understanding this trade-off is constrained by limited access to real-world data, leading researchers and practitioners to turn to simulation frameworks. The recent emergence of large language models (LLMs) offers a novel approach to simulations, providing human-like reasoning and adaptability without necessarily relying on explicit assumptions about agent behavior modeling. Despite their potential, existing frameworks have yet to integrate LLM-based agents for studying information asymmetry and signaling strategies, particularly in the context of auctions. To address this gap, we introduce InfoBid, a flexible simulation framework that leverages LLM agents to examine the effects of information disclosure strategies in multi-agent auction settings. Using GPT-4o, we implemented simulations of second-price auctions with diverse information schemas. The results reveal key insights into how signaling influences strategic behavior and auction outcomes, which align with both economic and social learning theories. Through InfoBid, we hope to foster the use of LLMs as proxies for human economic and social agents in empirical studies, enhancing our understanding of their capabilities and limitations. This work bridges the gap between theoretical market designs and practical applications, advancing research in market simulations, information design, and agent-based reasoning while offering a valuable tool for exploring the dynamics of digital economies.
zh

[NLP-130] RIDIS: A Comprehensive Medieval and Early Modern Corpus for HTR and NER

【速读】：该论文旨在构建并介绍一个名为TRIDIS（Tria Digita Scribunt）的开源中世纪及早期现代手稿语料库，其目标是促进跨中世纪和早期现代文本遗产的联合鲁棒手写文本识别（Handwritten Text Recognition, HTR）和命名实体识别（Named Entity Recognition, NER）研究。论文的关键在于解决如何有效整合多个开放授权的遗留数据集，并通过统一的半外交体转录规则（如扩展、标准化和标点符号处理）以及基于联合嵌入空间异常值检测的领域外测试分割策略，实现对复杂历史文本的准确识别与分析。此外，论文还提出了初步的基准实验，利用TrOCR和MiniCPM2.5模型对比随机划分与异常值驱动的测试划分方法，以验证所提出方法的有效性。

链接: https://arxiv.org/abs/2503.22714
作者: Sergio Torres Aguilar
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper introduces TRIDIS (Tria Digita Scribunt), an open-source corpus of medieval and early modern manuscripts. TRIDIS aggregates multiple legacy collections (all published under open licenses) and incorporates large metadata descriptions. While prior publications referenced some portions of this corpus, here we provide a unified overview with a stronger focus on its constitution. We describe (i) the narrative, chronological, and editorial background of each major sub-corpus, (ii) its semi-diplomatic transcription rules (expansion, normalization, punctuation), (iii) a strategy for challenging out-of-domain test splits driven by outlier detection in a joint embedding space, and (iv) preliminary baseline experiments using TrOCR and MiniCPM2.5 comparing random and outlier-based test partitions. Overall, TRIDIS is designed to stimulate joint robust Handwritten Text Recognition (HTR) and Named Entity Recognition (NER) research across medieval and early modern textual heritage.
zh

[NLP-131] CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation

【速读】：该论文旨在解决当前自主科学发现（Autonomous Scientific Discovery, ASD）系统面临的两个关键限制：(1) 这些系统主要探索现有代码库的变体或类似受限的设计空间；(2) 它们会产生大量研究产物（如自动生成的论文和代码），通常仅通过会议风格的论文评审进行有限评估，而代码的评估不足。为了解决这些问题，论文提出了一种名为CodeScientist的新系统，其核心在于将创意生成与实验构建建模为一种针对研究文章组合和定义领域内通用操作的代码块（如提示语言模型）的联合遗传搜索范式。通过这种方法，系统在代理和虚拟环境领域内自动开展了数百项实验，最终产生了19项发现，其中6项经过多方面的严格评估（包括外部评审、代码审查和复制尝试）被判定为至少具有初步合理性和增量新颖性，从而实现了从基准优化到更广泛发现的质的转变。

链接: https://arxiv.org/abs/2503.22708
作者: Peter Jansen,Oyvind Tafjord,Marissa Radensky,Pao Siangliulue,Tom Hope,Bhavana Dalvi Mishra,Bodhisattwa Prasad Majumder,Daniel S. Weld,Peter Clark
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 98 Pages (13 pages: main paper body; 85 pages: appendix)

点击查看摘要

Abstract:Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.
zh

[NLP-132] Frag ile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models?

【速读】：该论文旨在解决在资源受限的边缘设备上部署设备端语言模型（On-Device Language Models, ODLMs）时面临的多维挑战，这些挑战包括计算效率、内存占用、功耗以及跨异构任务的语言能力平衡。论文重点研究领域特定优化与跨域鲁棒性之间的权衡，并提出了一种新的架构——通用边缘模型（Generalized Edge Model, GEM），以和谐地实现专业化与广义化之间的平衡。

论文的关键解决方案在于引入稀疏交叉注意力路由器（Sparse Cross-Attention Router, SCAR），它能够动态分配计算资源至不同数量的计算单元，同时在多个域（如医疗、法律、金融、科学、常识、会话式 AI、多语言及领域自适应任务）上保持高性能。通过在 47 个精心挑选的基准测试中验证，GEM 在跨域 F1 准确率方面达到 0.89，且延迟低于 100 毫秒，适用于多种硬件平台。此外，GEM 在保持领域特定性能的同时，提升了通用任务表现达 7%，优于轻量级 GPT-4 Lite。论文还提出了三个新指标（领域专业化指数 DSI、泛化差距 GG 和跨域迁移比率 CDTR），揭示了模型压缩强度与脆弱性之间的强相关性。

链接: https://arxiv.org/abs/2503.22698
作者: Basab Jha,Firoj Paudel
机构: Vedas College, Tribhuwan University (维达斯学院，特里布万大学); Madan Bhandari Memorial College, Tribhuwan University (马丹·班达里纪念学院，特里布万大学)
类目: Computation and Language (cs.CL)
备注: 14 Pages, 5 figures

点击查看摘要

Abstract:The application of on-device language models (ODLMs) on resource-constrained edge devices is a multi-dimensional problem that strikes a fine balance between computational effectiveness, memory, power usage, and linguistic capacity across heterogeneous tasks. This holistic study conducts a thorough investigation of the trade-offs between domain-specific optimization and cross-domain robustness, culminating in the proposal of the Generalized Edge Model (GEM), a new architecture that aims to balance specialization and generalization in a harmonious manner. With a rigorous experimental approach testing 47 well-chosen benchmarks in eight domains–healthcare, law, finance, STEM, commonsense, conversational AI, multilingual, and domain-adaptive tasks–we show that conventional optimization techniques decrease target task perplexity by 18-25% but result in a precipitous decline in general-task performance with F1 scores decreasing by 12-29%, as reported by Liu et al. GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate computation to a variable number of computing resources with a cross-domain F1 accuracy of 0.89 on less than 100ms latency across Raspberry Pi 4, Pixel 6, iPhone 13, and bespoke custom neural processing units (NPUs). Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance. We propose three new measurement tools–Domain Specialization Index (DSI), Generalization Gap (GG), and Cross-Domain Transfer Ratio (CDTR)–which show strong correlation between model compression intensity and brittleness.
zh

[NLP-133] LLM s as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments

【速读】：该论文试图解决传统大型语言模型（Large Language Models, LLMs）在战略规划方面的局限性问题，特别是在辩论场景中缺乏适应性和博弈能力。为了解决这一问题，论文提出了一种名为DebateBrawl的创新AI辩论平台，其关键在于将大型语言模型与遗传算法（Genetic Algorithms, GA）以及对抗搜索（Adversarial Search, AS）相结合。通过引入进化优化和基于博弈论的技术，DebateBrawl不仅能够生成连贯且语境相关的论点，还能实时调整策略，从而显著提升了AI在辩论中的表现和适应性。实验结果表明，该系统在保持高事实准确性的同时，实现了与人类参与者平衡的辩论效果，并获得了用户的积极反馈，证明了其作为教育工具的有效性及其在提升公共话语质量方面的潜力。同时，论文还强调了确保系统负责任开发和部署的重要性，包括实施严格的事实核查机制和透明的决策过程。

链接: https://arxiv.org/abs/2412.06229
作者: Prakash Aryan
机构: Birla Institute of Technology and Science, Pilani - Dubai Campus (BITS 比卡学院浦那校区-迪拜校区)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This paper introduces DebateBrawl, an innovative AI-powered debate platform that integrates Large Language Models (LLMs), Genetic Algorithms (GA), and Adversarial Search (AS) to create an adaptive and engaging debating experience. DebateBrawl addresses the limitations of traditional LLMs in strategic planning by incorporating evolutionary optimization and game-theoretic techniques. The system demonstrates remarkable performance in generating coherent, contextually relevant arguments while adapting its strategy in real-time. Experimental results involving 23 debates show balanced outcomes between AI and human participants, with the AI system achieving an average score of 2.72 compared to the human average of 2.67 out of 10. User feedback indicates significant improvements in debating skills and a highly satisfactory learning experience, with 85% of users reporting improved debating abilities and 78% finding the AI opponent appropriately challenging. The system’s ability to maintain high factual accuracy (92% compared to 78% in human-only debates) while generating diverse arguments addresses critical concerns in AI-assisted discourse. DebateBrawl not only serves as an effective educational tool but also contributes to the broader goal of improving public discourse through AI-assisted argumentation. The paper discusses the ethical implications of AI in persuasive contexts and outlines the measures implemented to ensure responsible development and deployment of the system, including robust fact-checking mechanisms and transparency in decision-making processes.
zh

[NLP-134] Enhancing nonnative speech perception and production through an AI-powered application

【速读】：该论文旨在解决现有研究主要关注外语发音的可理解性和明晰性，而忽视对个体语音音素在感知与产出方面改进的问题。论文通过考察使用基于人工智能的移动应用程序进行训练对外语音素感知与产出的影响来填补这一空白。解决方案的关键在于利用Speakometer这款集成了英语元音录音任务、发音反馈及练习功能的应用程序，通过对参与者进行预测试、干预训练以及后测评估，验证其对非本族语者在目标对比音素辨别准确性与产出方面的提升效果。实验结果显示，尽管取得了显著进步，但仍未达到母语者的水平，这表明AI驱动的应用程序在促进语言学习方面的有效性，并支持其作为个性化互动发音训练工具的潜力。

链接: https://arxiv.org/abs/2503.22705
作者: Georgios P. Georgiou
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While research on using Artificial Intelligence (AI) through various applications to enhance foreign language pronunciation is expanding, it has primarily focused on aspects such as comprehensibility and intelligibility, largely neglecting the improvement of individual speech sounds in both perception and production. This study seeks to address this gap by examining the impact of training with an AI-powered mobile application on nonnative sound perception and production. Participants completed a pretest assessing their ability to discriminate the second language English heed-hid contrast and produce these vowels in sentence contexts. The intervention involved training with the Speakometer mobile application, which incorporated recording tasks featuring the English vowels, along with pronunciation feedback and practice. The posttest mirrored the pretest to measure changes in performance. The results revealed significant improvements in both discrimination accuracy and production of the target contrast following the intervention. However, participants did not achieve native-like competence. These findings highlight the effectiveness of AI-powered applications in facilitating speech acquisition and support their potential use for personalized, interactive pronunciation training beyond the classroom.
zh

[NLP-135] Bridging Language Models and Financial Analysis

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在金融领域的实际应用不足问题。尽管LLMs在自然语言处理方面取得了快速进展，并展现出高效分析复杂金融数据（包括文本、数值表格和可视化图表等多模态信息）的能力，但金融行业因注重谨慎集成与长期验证，导致这些技术的实际采纳速度较慢。论文的关键在于通过综述最新的LLM研究进展及其在金融领域的适用性，特别是强调几种新颖的LLM方法的独特能力，为研究人员和从业者提供方向指引，以推动LLM技术在金融行业的进一步应用和发展。

链接: https://arxiv.org/abs/2503.22693
作者: Alejandro Lopez-Lira,Jihoon Kwon,Sangwoon Yoon,Jy-yong Sohn,Chanyeol Choi
机构: University of Florida (佛罗里达大学); LinqAlpha; Yonsei University (延世大学)
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.
zh

[NLP-136] Enhancing Aviation Communication Transcription: Fine-Tuning Distil-Whisper with LoRA

【速读】：该论文旨在解决航空通信转录任务中基于自动语音识别模型的计算效率问题。具体而言，尽管OpenAI的Whisper在自动语音识别方面表现出色，但直接微调Whisper用于航空通信转录需要较高的计算资源。为了解决这一问题，论文提出了使用参数高效微调方法——Low-Rank Adaptation (LoRA)，来微调一个计算效率更高的版本distil-Whisper。关键在于通过LoRA方法，在保持模型性能的同时显著降低计算成本，并通过在Linguistic Data Consortium提供的Air Traffic Control Corpus数据集上的实验，实现了平均词错误率（WER）降至3.86%的结果，验证了该方法的有效性与潜力。

链接: https://arxiv.org/abs/2503.22692
作者: Shokoufeh Mirzaei,Jesse Arzate,Yukti Vijay
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 4 Figures, 4 Tables, Under review by Journal of Aerospace Information Systems

点击查看摘要

Abstract:Transcription of aviation communications has several applications, from assisting air traffic controllers in identifying the accuracy of read-back errors to search and rescue operations. Recent advances in artificial intelligence have provided unprecedented opportunities for improving aviation communication transcription tasks. OpenAI’s Whisper is one of the leading automatic speech recognition models. However, fine-tuning Whisper for aviation communication transcription is not computationally efficient. Thus, this paper aims to use a Parameter-Efficient Fine-tuning method called Low-Rank Adaptation to fine-tune a more computationally efficient version of Whisper, distil-Whisper. To perform the fine-tuning, we used the Air Traffic Control Corpus dataset from the Linguistic Data Consortium, which contains approximately 70 hours of controller and pilot transmissions near three major airports in the US. The objective was to reduce the word error rate to enhance accuracy in the transcription of aviation communication. First, starting with an initial set of hyperparameters for LoRA (Alpha = 64 and Rank = 32), we performed a grid search. We applied a 5-fold cross-validation to find the best combination of distil-Whisper hyperparameters. Then, we fine-tuned the model for LoRA hyperparameters, achieving an impressive average word error rate of 3.86% across five folds. This result highlights the model’s potential for use in the cockpit.
zh

计算机视觉

[CV-0] Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

【速读】：该论文试图解决动态场景下4D重建任务中因可用4D数据集规模和多样性有限而导致的泛化能力不足问题。传统方法通常通过在可扩展的动态视频数据上微调静态场景的3D模型，并结合几何先验（如光流和深度）来缓解这一限制。论文提出了一种名为Easi3R的全新方法，其关键在于无需从头预训练或网络微调的推理阶段注意力适应（attention adaptation）。通过分析DUSt3R中注意力层隐含的丰富摄像机与物体运动信息，并有效分离这些注意力图，实现了精确的动态区域分割、摄像机姿态估计以及4D密集点云地图重建。实验结果表明，该轻量级方法显著优于依赖大规模动态数据集训练或微调的传统最新技术。

链接: https://arxiv.org/abs/2503.24391
作者: Xingyu Chen,Yue Chen,Yuliang Xiu,Andreas Geiger,Anpei Chen
机构: Westlake University (西湖大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University of Tübingen, Tübingen AI Center (图宾根大学, 图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL Code: this https URL

点击查看摘要

Abstract:Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at this https URL
zh

[CV-1] SU-YOLO: Spiking Neural Network for Efficient Underwater Object Detection

【速读】：该论文旨在解决水下目标检测中因复杂光学环境和设备资源限制导致的高精度与低功耗难以兼顾的问题。为应对这些挑战，论文提出了一种名为Spiking Underwater YOLO (SU-YOLO) 的尖峰神经网络模型。其关键在于结合尖峰神经网络的轻量级和能耗优化特性，提出了仅基于整数加法的水下图像去噪方法，以最小的计算开销提升特征图质量；同时引入了Separated Batch Normalization (SeBN)，通过在多个时间步独立归一化特征图，并与残差结构有效集成，更高效地捕捉尖峰神经网络的时间动态特性。此外，重新设计的尖峰残差块将Cross Stage Partial Network (CSPNet) 与YOLO架构相结合，缓解尖峰退化现象并增强特征提取能力。这些创新共同实现了在URPC2019数据集上的优异性能，证明了尖峰神经网络在工程应用中的潜力。

链接: https://arxiv.org/abs/2503.24389
作者: Chenyang Li,Wenxuan Liu,Guoqiang Gong,Xiaobo Ding,Xian Zhong
机构: College of Computer and Information Technology, China Three Gorges University (三峡大学计算机与信息学院); Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology (湖北交通物联网重点实验室, 武汉理工大学计算机科学与人工智能学院); State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学计算机学院多媒体信息处理国家重点实验室); State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology (武汉理工大学海洋技术与安全国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Underwater object detection is critical for oceanic research and industrial safety inspections. However, the complex optical environment and the limited resources of underwater equipment pose significant challenges to achieving high accuracy and low power consumption. To address these issues, we propose Spiking Underwater YOLO (SU-YOLO), a Spiking Neural Network (SNN) model. Leveraging the lightweight and energy-efficient properties of SNNs, SU-YOLO incorporates a novel spike-based underwater image denoising method based solely on integer addition, which enhances the quality of feature maps with minimal computational overhead. In addition, we introduce Separated Batch Normalization (SeBN), a technique that normalizes feature maps independently across multiple time steps and is optimized for integration with residual structures to capture the temporal dynamics of SNNs more effectively. The redesigned spiking residual blocks integrate the Cross Stage Partial Network (CSPNet) with the YOLO architecture to mitigate spike degradation and enhance the model’s feature extraction capabilities. Experimental results on URPC2019 underwater dataset demonstrate that SU-YOLO achieves mAP of 78.8% with 6.97M parameters and an energy consumption of 2.98 mJ, surpassing mainstream SNN models in both detection accuracy and computational efficiency. These results underscore the potential of SNNs for engineering applications. The code is available in this https URL.
zh

[CV-2] Consistent Subject Generation via Contrastive Instantiated Concepts

【速读】：该论文旨在解决跨多个生成物（creations）保持一致主体（subject）的问题，这一限制阻碍了文本到图像生成模型在长内容生成中的应用。现有方法通常需要耗时的参数调整、针对所有主体的参考数据或访问其他生成结果。论文提出了一种名为对比概念实例化（Contrastive Concept Instantiation, CoCoIns）的方法作为解决方案。其关键是通过一个包含生成模型和映射网络的框架，将输入的潜在代码（latent code）转换为与特定概念实例相关的伪词（pseudo-words），从而实现跨独立生成物的一致主体合成。这种方法通过对比学习训练网络区分提示（prompts）与潜在代码的组合，使得用户能够使用相同的潜在代码生成一致的主体，同时保持较高的灵活性。

链接: https://arxiv.org/abs/2503.24387
作者: Lee Hsin-Ying,Kelvin C.K. Chan,Ming-Hsuan Yang
机构: University of California, Merced (加州大学默塞德分校); Google DeepMind (谷歌深思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.
zh

[CV-3] Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views CVPR2025

【速读】：该论文致力于解决在无位置信息（unposed）且极其稀疏视图（extremely sparse views）条件下，对无界360°场景进行高质量三维重建与新视角合成的问题。传统神经渲染方法在密集输入视图和精确姿态下表现良好，但在稀疏视图且无位置信息的情况下仍面临挑战。为解决这一问题，论文提出了一种新颖的神经渲染框架，其关键在于引入基于分层高斯表示（layered Gaussian-based representation）的方法，通过将场景建模为具有不同空间层次的分层结构来缓解无界场景中固有的空间模糊性（spatial ambiguity）。此外，结合粗几何恢复的立体重建模型与针对每层的引导优化（layer-specific bootstrap optimization），用于细化噪声并填补重建中的遮挡区域。同时，论文还提出了重建与生成之间的迭代融合策略以及一种考虑不确定性的训练方法，以促进两者之间的相互条件依赖与增强。实验表明，该方法在渲染质量和表面重建准确性方面优于现有最先进的技术。

链接: https://arxiv.org/abs/2503.24382
作者: Chong Bao,Xiyu Zhang,Zehao Yu,Jiale Shi,Guofeng Zhang,Songyou Peng,Zhaopeng Cui
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学国家重点实验室); ETH Zürich (苏黎世联邦理工学院); University of Tübingen, Tübingen AI Center (图宾根大学图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360° scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360° scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: this https URL
zh

[CV-4] UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

【速读】：该论文旨在解决 occupancy forecasting（基于历史信息预测未来占用状态）以及从相机图像进行当前帧占用预测的问题。论文的关键在于提出 UniOcc，一个统一的基准数据集，它整合了来自多个真实世界数据集（如 nuScenes 和 Waymo）以及高保真驾驶模拟器（如 CARLA 和 OpenCOOD）的数据，并提供了 2D/3D 占用标签及逐体素流标注，同时支持协同自动驾驶任务。在评估方面，UniOcc 引入了不依赖于真实占用标签的新颖度量标准，从而能够更稳健地评估占用质量的其他方面。通过在最新模型上的大量实验，论文证明大规模多样化训练数据以及显式的流信息显著提升了占用预测和预报性能。

链接: https://arxiv.org/abs/2503.24381
作者: Yuping Wang,Xiangyu Huang,Xiaokang Sun,Mingxuan Yan,Shuo Xing,Zhengzhong Tu,Jiachen Li
机构: University of California, Riverside (加州大学河滨分校); University of Wisconsin, Madison (威斯康星大学麦迪逊分校); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 14 pages; Dataset: this https URL Code: this https URL

点击查看摘要

Abstract:We introduce UniOcc, a comprehensive, unified benchmark for occupancy forecasting (i.e., predicting future occupancies based on historical information) and current-frame occupancy prediction from camera images. UniOcc unifies data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), which provides 2D/3D occupancy labels with per-voxel flow annotations and support for cooperative autonomous driving. In terms of evaluation, unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel metrics that do not depend on ground-truth occupancy, enabling robust assessment of additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance.
zh

[CV-5] Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

【速读】：该论文旨在解决当前视频生成领域中用户意图准确解读的瓶颈问题。解决方案的关键在于将多种条件解释步骤与视频合成步骤解耦，通过利用现代多模态大型语言模型（Multimodal Large Language Models, MLLMs），Any2Caption能够将文本、图像、视频以及特定线索（如区域、运动和相机姿态）等多样化输入转化为密集且结构化的描述性字幕，从而为底层视频生成器提供更好的指导。此外，研究还引入了一个大规模数据集Any2CapIns，包含337K实例和407K条件，用于任意条件到字幕指令的微调。综合评估表明，该系统在现有视频生成模型的各种属性上显著提升了可控性和视频质量。

链接: https://arxiv.org/abs/2503.24379
作者: Shengqiong Wu,Weicai Ye,Jiahao Wang,Quande Liu,Xintao Wang,Pengfei Wan,Di Zhang,Kun Gai,Shuicheng Yan,Hao Fei,Tat-Seng Chua
机构: Kuaishou Technology (快手科技); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:To address the bottleneck of accurate user intent interpretation within the current video generation community, we present Any2Caption, a novel framework for controllable video generation under any condition. The key idea is to decouple various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs–text, images, videos, and specialized cues such as region, motion, and camera poses–into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various aspects of existing video generation models. Project Page: this https URL
zh

[CV-6] ERUPT: Efficient Rendering with Unposed Patch Transformer CVPR2025

【速读】：该论文旨在解决从少量未标定（unposed）RGB图像中合成新颖视角的问题。为应对这一挑战，论文提出了一种名为ERUPT（Efficient Rendering with Unposed Patch Transformer）的端到端场景重建模型。其关键创新在于引入基于图像块（patch-based）的查询机制，替代传统基于像素（pixel-based）的查询方式，从而显著降低渲染目标视角所需的计算开销。此外，ERUPT采用学习得到的隐式相机姿态，使其能够在具有稀疏或不精确真实标注的场景数据集上进行训练。这种设计不仅提升了模型在训练和推理阶段的效率，还能以高达600帧每秒的速度运行于商用硬件上，同时实现高质量的新颖视角合成。与NeRF和Gaussian Splatting等方法相比，ERUPT仅需极少量未标定输入图像即可完成任意场景的新视角渲染，并大幅减少了标注数据需求（约95%）及计算资源消耗。

链接: https://arxiv.org/abs/2503.24374
作者: Maxim V. Shugaev,Vincent Chen,Maxim Karrenbach,Kyle Ashley,Bridget Kennedy,Naresh P. Cuntoor
机构: BlueHalo; Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting, which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95% and decreases computational requirements by an order of magnitude, providing efficient novel view synthesis for diverse real-world scenes.
zh

[CV-7] Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

【速读】：该论文旨在解决现有超声图像分割方法在适应新任务时缺乏灵活性且依赖昂贵的手动标注，同时实时方法难以达到最先进的性能的问题。为克服这些限制，论文提出了一种基于分层视觉基础模型（Hierarchical Vision Foundation Model）的自适应框架。关键在于利用Hiera模型提取多尺度特征，并结合DINOv2表示以增强视觉表达能力，通过解码这些增强的特征实现精确且鲁棒的分割。实验结果表明，该方法在多个数据集上超越了最先进的技术，在仅使用1%和10%数据的有限监督条件下表现尤为突出。此外，该方法在单个GPU上实现了约77帧每秒（FPS）的推理速度，支持临床实时应用。

链接: https://arxiv.org/abs/2503.24368
作者: Xiaoran Zhang,Eric Z. Chen,Lin Zhao,Xiao Chen,Yikang Liu,Boris Maihe,James S. Duncan,Terrence Chen,Shanhui Sun
机构: United Imaging Intelligence (联影智能); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20% on average in the 1% and 10% data settings. Our method achieves \sim 77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.
zh

[CV-8] StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting

【速读】：该论文旨在解决3D Gaussian splatting (3DGS) 方法中因深度排序渲染导致的渲染伪影问题，以及固定表示下难以有效控制渲染成本和视觉保真度的问题。论文的关键创新在于将3D Gaussian splatting与随机光栅化相结合，并利用无偏蒙特卡罗估计器来求解体绘制方程。这种方法避免了深度排序的需求，实现了重叠高斯分布的精确三维混合，并通过调整蒙特卡罗采样数量提供了在计算时间和质量之间权衡的可能性。这种方案的核心突破在于消除了传统方法中的近似，同时显著提升了渲染效率。

链接: https://arxiv.org/abs/2503.24366
作者: Shakiba Kheradmand,Delio Vicini,George Kopanas,Dmitry Lagun,Kwang Moo Yi,Mark Matthews,Andrea Tagliasacchi
机构: Google DeepMind (谷歌深度思维); University of British Columbia (不列颠哥伦比亚大学); Google (谷歌); Runway ML; Simon Fraser University (西蒙弗雷泽大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) is a popular radiance field method, with many application-specific extensions. Most variants rely on the same core algorithm: depth-sorting of Gaussian splats then rasterizing in primitive order. This ensures correct alpha compositing, but can cause rendering artifacts due to built-in approximations. Moreover, for a fixed representation, sorted rendering offers little control over render cost and visual fidelity. For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. Concretely, we leverage an unbiased Monte Carlo estimator of the volume rendering equation. This removes the need for sorting, and allows for accurate 3D blending of overlapping Gaussians. The number of Monte Carlo samples further imbues 3DGS with a way to trade off computation time and quality. We implement our method using OpenGL shaders, enabling efficient rendering on modern GPU hardware. At a reasonable visual quality, our method renders more than four times faster than sorted rasterization.
zh

[CV-9] InstructRestore: Region-Customized Image Restoration with Human Instructions

【速读】：该论文旨在解决现有基于扩散先验的图像修复方法无法根据用户指令进行区域定制化修复的问题。解决方案的关键在于提出了一种名为InstructRestore的新框架，通过开发一个数据生成引擎创建包含高质量图像、目标区域描述和相应区域掩码的训练三元组，并构建了一个包含536,945个三元组的数据集来支持任务的训练与评估。此外，论文研究了如何在ControlNet架构下整合低质量图像特征，以调整图像细节增强的程度，进而开发出一种类似ControlNet的模型，用于识别目标区域并为目标区域及其周围区域分配不同的融合尺度，从而实现与用户指令一致的区域定制化图像修复。实验结果验证了该方法在如背景虚化效果图像及局部增强等场景下的有效性。

链接: https://arxiv.org/abs/2503.24357
作者: Shuaizheng Liu,Jianqi Ma,Lingchen Sun,Xiangtao Kong,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); OPPO Research Institute (OPPO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at this https URL.
zh

[CV-10] PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks

【速读】：该论文旨在解决病理学领域高分辨率图像复杂性和变异性带来的挑战，同时克服现有病理学基础模型开发过程中对大规模数据集、存储容量及计算资源的依赖，并确保其临床适用性和泛化能力。为此，论文提出了一种名为PathOrchestra的通用病理学基础模型，其关键在于通过自监督学习方法，在包含来自20种组织和器官类型的30万张病理切片的数据集上进行训练。此外，该模型在涵盖多种任务（如数字切片预处理、泛癌分类、病灶识别等）的112个临床任务上进行了严格评估，证明了其在多项任务中的卓越性能，特别是首次实现了对高发病率结直肠癌和诊断复杂的淋巴瘤生成结构化报告的能力。这表明了大规模自监督病理学基础模型在临床级任务验证中的可行性和有效性，为临床集成提供了高效且高质量的医疗服务路径。

链接: https://arxiv.org/abs/2503.24345
作者: Fang Yan,Jianfeng Wu,Jiawen Li,Wei Wang,Jiaxuan Lu,Wen Chen,Zizhao Gao,Jianan Li,Hong Yan,Jiabo Ma,Minda Chen,Yang Lu,Qing Chen,Yizhi Wang,Xitong Ling,Xuenian Wang,Zihan Wang,Qiang Huang,Shengyi Hua,Mianxin Liu,Lei Ma,Tian Shen,Xiaofan Zhang,Yonghong He,Hao Chen,Shaoting Zhang,Zhe Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The complexity and variability inherent in high-resolution pathological images present significant challenges in computational pathology. While pathology foundation models leveraging AI have catalyzed transformative advancements, their development demands large-scale datasets, considerable storage capacity, and substantial computational resources. Furthermore, ensuring their clinical applicability and generalizability requires rigorous validation across a broad spectrum of clinical tasks. Here, we present PathOrchestra, a versatile pathology foundation model trained via self-supervised learning on a dataset comprising 300K pathological slides from 20 tissue and organ types across multiple centers. The model was rigorously evaluated on 112 clinical tasks using a combination of 61 private and 51 public datasets. These tasks encompass digital slide preprocessing, pan-cancer classification, lesion identification, multi-cancer subtype classification, biomarker assessment, gene expression prediction, and the generation of structured reports. PathOrchestra demonstrated exceptional performance across 27,755 WSIs and 9,415,729 ROIs, achieving over 0.950 accuracy in 47 tasks, including pan-cancer classification across various organs, lymphoma subtype diagnosis, and bladder cancer screening. Notably, it is the first model to generate structured reports for high-incidence colorectal cancer and diagnostically complex lymphoma-areas that are infrequently addressed by foundational models but hold immense clinical potential. Overall, PathOrchestra exemplifies the feasibility and efficacy of a large-scale, self-supervised pathology foundation model, validated across a broad range of clinical-grade tasks. Its high accuracy and reduced reliance on extensive data annotation underline its potential for clinical integration, offering a pathway toward more efficient and high-quality medical services.
zh

[CV-11] Self-Supervised Pretraining for Aerial Road Extraction

【速读】：该论文试图解决深度神经网络在航拍图像分割任务中对大量标注数据依赖的问题，由于高质量且精确标注的航拍数据集稀缺且成本高昂，限制了模型性能的提升。论文提出了一种自监督预训练方法，通过基于图像修复（inpainting-based）的预训练，使模型在学习重建航拍图像缺失区域的过程中捕获其内在结构，从而在微调用于道路提取任务时提高泛化能力，并增强对领域偏移的鲁棒性，同时不受模型架构和数据集选择的限制。关键在于利用图像修复任务引导模型学习通用特征，显著提升了低数据条件下的分割精度，为航拍图像分析提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2503.24326
作者: Rupert Polley,Sai Vignesh Abishek Deenadayalan,J. Marius Zöllner
机构: FZI Research Center for Information Technology (FZI信息技术研发中心), Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks for aerial image segmentation require large amounts of labeled data, but high-quality aerial datasets with precise annotations are scarce and costly to produce. To address this limitation, we propose a self-supervised pretraining method that improves segmentation performance while reducing reliance on labeled data. Our approach uses inpainting-based pretraining, where the model learns to reconstruct missing regions in aerial images, capturing their inherent structure before being fine-tuned for road extraction. This method improves generalization, enhances robustness to domain shifts, and is invariant to model architecture and dataset choice. Experiments show that our pretraining significantly boosts segmentation accuracy, especially in low-data regimes, making it a scalable solution for aerial image analysis.
zh

[CV-12] Can Test-Time Scaling Improve World Foundation Model?

【速读】：该论文旨在解决世界基础模型（World Foundation Models, WFMs）在推理阶段因计算资源需求高和数据限制导致的效率瓶颈问题。传统方法依赖于模型扩容或重新训练，但这些方式成本高昂且耗时。为此，论文提出了一种名为SWIFT的测试时扩展框架，其关键是结合可扩展的WFM评估工具包与过程级推理策略（如快速标记化、基于概率的Top-K剪枝以及高效的束搜索），从而实现在不重新训练或增加模型规模的情况下，以计算最优的方式提升WFMs的推理性能。实验结果表明，测试时扩展定律适用于WFMs，并证明了SWIFT是一种可扩展且有效的解决方案。

链接: https://arxiv.org/abs/2503.24320
作者: Wenyan Cong,Hanqing Zhu,Peihao Wang,Bangya Liu,Dejia Xu,Kevin Wang,David Z. Pan,Yan Wang,Zhiwen Fan,Zhangyang Wang
机构: UT Austin (德克萨斯大学奥斯汀分校); UW–Madison (威斯康星大学麦迪逊分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. The code is available at this https URL.
zh

[CV-13] Point Tracking in Surgery–The 2024 Surgical Tattoos in Infrared (STIR) Challenge

【速读】：该论文旨在解决手术过程中组织运动理解的问题，这对于下游任务（如分割、3D重建、虚拟组织标志点标注、基于探针的自主扫描以及子任务自治）至关重要。由于这些下游任务需要标注数据来量化和训练算法，因此论文引入了一个点跟踪挑战（Point Tracking Challenge），参与者可以提交算法进行量化评估。解决方案的关键在于使用名为“手术文身在红外线中的数据集”（Surgical Tattoos in Infrared, STIR）的数据集对提交的算法进行评估，并通过两个定量指标——准确性（Accuracy） 和 效率（Efficiency） 来全面衡量算法性能。其中，准确性组件测试算法在体内和体外序列上的表现，而效率组件则关注算法推理的延迟。这一挑战作为MICCAI EndoVis 2024的一部分开展，旨在推动手术领域中更精确和高效的算法发展。

链接: https://arxiv.org/abs/2503.24306
作者: Adam Schmidt,Mert Asim Karaoglu,Soham Sinha,Mingang Jang,Ho-Gun Ha,Kyungmin Jung,Kyeongmo Gu,Ihsan Ullah,Hyunki Lee,Jonáš Šerých,Michal Neoral,Jiří Matas,Rulin Zhou,Wenlong He,An Wang,Hongliang Ren,Bruno Silva,Sandro Queirós,Estêvão Lima,João L. Vilaça,Shunsuke Kikuchi,Atsushi Kouno,Hiroki Matsuzaki,Tongtong Li,Yulu Chen,Ling Li,Xiang Ma,Xiaojian Li,Mona Sheikh Zeinoddin,Xu Wang,Zafer Tandogdu,Greg Shaw,Evangelos Mazomenos,Danail Stoyanov,Yuxin Chen,Zijian Wu,Alexander Ladikos,Simon DiMaio,Septimiu E. Salcudean,Omid Mohareri
机构: Intuitive Surgical Inc. (直观外科公司), USA; ImFusion GmbH (英夫融合有限公司), Germany; Technical University of Munich (慕尼黑工业大学), Germany; NVIDIA (英伟达), USA; Division of Intelligent Robotics, Daegu Gyeongbuk Institute of Science and Technology (DGIST) (大邱庆北科学技术院智能机器人分部), South Korea; CMP Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague (捷克布拉格理工大学电气工程学院视觉识别小组), Czech Republic; Shenzhen Research Institute (深圳研究院), China; The Chinese University of Hong Kong (香港中文大学), China; Shenzhen University (深圳大学), China; Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho (生命与健康科学研究中心(ICVS)，医学学院，明霍大学), Portugal; ICVS/3B’s - PT Government Associate Laboratory, Braga/Guimarães, Portugal (葡萄牙政府合作实验室, 布拉加/吉马良斯); 2Ai –School of Technology, IPCA, Barcelos, Portugal (人工智能技术学院, 英帕卡学院, 巴塞洛斯, 葡萄牙); LASI – Associate Laboratory of Intelligent Systems, Guimarães, Portugal (智能系统联合实验室, 吉马良斯, 葡萄牙); Jmees Inc (智骏株式会社), Japan; University of California - Los Angeles (UCLA) (加州大学洛杉矶分校), USA; School of Management, Hefei University of Technology (合肥工业大学管理学院), China; Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education, Hefei (教育部过程优化与智能决策重点实验室, 合肥), China; Hawkes Institute, University College London (伦敦大学学院霍克斯研究所), UK; Dept of Urology, University College London Hospitals (伦敦大学学院医院泌尿科), UK; University of British Columbia (不列颠哥伦比亚大学), Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding tissue motion in surgery is crucial to enable applications in downstream tasks such as segmentation, 3D reconstruction, virtual tissue landmarking, autonomous probe-based scanning, and subtask autonomy. Labeled data are essential to enabling algorithms in these downstream tasks since they allow us to quantify and train algorithms. This paper introduces a point tracking challenge to address this, wherein participants can submit their algorithms for quantification. The submitted algorithms are evaluated using a dataset named surgical tattoos in infrared (STIR), with the challenge aptly named the STIR Challenge 2024. The STIR Challenge 2024 comprises two quantitative components: accuracy and efficiency. The accuracy component tests the accuracy of algorithms on in vivo and ex vivo sequences. The efficiency component tests the latency of algorithm inference. The challenge was conducted as a part of MICCAI EndoVis 2024. In this challenge, we had 8 total teams, with 4 teams submitting before and 4 submitting after challenge day. This paper details the STIR Challenge 2024, which serves to move the field towards more accurate and efficient algorithms for spatial understanding in surgery. In this paper we summarize the design, submissions, and results from the challenge. The challenge dataset is available here: this https URL , and the code for baseline models and metric calculation is available here: this https URL
zh

[CV-14] Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

【速读】：该论文旨在解决利用预训练图像模型进行视频动作识别时，难以区分近似对称动作（如打开与关闭瓶子等视觉相似但时间顺序相反的动作）的问题。现有基于注意力机制的图像到视频探测方法（如DinoV2和CLIP）在时间建模方面存在排列不变性，导致无论帧的顺序如何，预测结果相同。为了解决这一挑战，论文提出了一种名为Self-attentive Temporal Embedding Probing (STEP) 的方法。其关键是通过三个关键修改增强自注意力探测：(1) 学习型帧级位置编码，显式编码时间顺序；(2) 单一全局分类标记（CLS token），以确保序列一致性；(3) 简化注意力机制以提高参数效率。这些改进使STEP在四个动作识别基准数据集上比现有方法高出3-15%，且仅使用1/3的可学习参数，尤其在区分近似对称动作方面表现出显著优势。

链接: https://arxiv.org/abs/2503.24298
作者: Thinesh Thiyakesan Ponbagavathi,Alina Roitberg
机构: Institute for Artificial Intelligence, University of Stuttgart (斯图加特大学人工智能研究所), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study parameter-efficient image-to-video probing for the unaddressed challenge of recognizing nearly symmetric actions - visually similar actions that unfold in opposite temporal order (e.g., opening vs. closing a bottle). Existing probing mechanisms for image-pretrained models, such as DinoV2 and CLIP, rely on attention mechanism for temporal modeling but are inherently permutation-invariant, leading to identical predictions regardless of frame order. To address this, we introduce Self-attentive Temporal Embedding Probing (STEP), a simple yet effective approach designed to enforce temporal sensitivity in parameter-efficient image-to-video transfer. STEP enhances self-attentive probing with three key modifications: (1) a learnable frame-wise positional encoding, explicitly encoding temporal order; (2) a single global CLS token, for sequence coherence; and (3) a simplified attention mechanism to improve parameter efficiency. STEP outperforms existing image-to-video probing mechanisms by 3-15% across four activity recognition benchmarks with only 1/3 of the learnable parameters. On two datasets, it surpasses all published methods, including fully fine-tuned models. STEP shows a distinct advantage in recognizing nearly symmetric actions, surpassing other probing mechanisms by 9-19%. and parameter-heavier PEFT-based transfer methods by 5-15%. Code and models will be made publicly available.
zh

[CV-15] Style Quantization for Data-Efficient GAN Training

【速读】：该论文旨在解决在数据受限条件下，生成对抗网络（GANs）难以有效探索和利用稀疏输入潜在空间的问题。具体而言，在稀疏输入潜在空间中，相邻变量生成的图像可能在真实感方面存在显著差异，从而导致次优的一致性正则化（Consistency Regularization, CR）结果。为了解决这一问题，论文提出了一种名为\textit{SQ-GAN}的新方法，其关键是引入了一种风格空间量化方案。该方案将稀疏连续的输入潜在空间转换为紧凑且结构化的离散代理空间，使得每个元素能够对应特定的真实数据点，从而提升一致性正则化的性能。此外，通过首先将输入潜在变量映射到一个更少纠缠的“风格”空间，并使用可学习的代码本进行量化，使得每个量化的码可以控制不同的变化因素。同时，优化最优传输距离以使代码本中的码与由基础模型提取的训练数据特征对齐，将外部知识嵌入代码本并构建语义丰富的词汇表，以准确描述训练数据集。实验结果表明，该方法显著提升了判别器的鲁棒性和生成质量。

链接: https://arxiv.org/abs/2503.24282
作者: Jian Wang,Xin Lan,Jizhe Zhou,Yuxin Tian,Jiancheng Lv
机构: College of Computer Science, Sichuan University (四川大学计算机学院); Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education (教育部机器学习与产业智能工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Under limited data setting, GANs often struggle to navigate and effectively exploit the input latent space. Consequently, images generated from adjacent variables in a sparse input latent space may exhibit significant discrepancies in realism, leading to suboptimal consistency regularization (CR) outcomes. To address this, we propose \textitSQ-GAN, a novel approach that enhances CR by introducing a style space quantization scheme. This method transforms the sparse, continuous input latent space into a compact, structured discrete proxy space, allowing each element to correspond to a specific real data point, thereby improving CR performance. Instead of direct quantization, we first map the input latent variables into a less entangled ``style’’ space and apply quantization using a learnable codebook. This enables each quantized code to control distinct factors of variation. Additionally, we optimize the optimal transport distance to align the codebook codes with features extracted from the training data by a foundation model, embedding external knowledge into the codebook and establishing a semantically rich vocabulary that properly describes the training dataset. Extensive experiments demonstrate significant improvements in both discriminator robustness and generation quality with our method.
zh

[CV-16] Learning Velocity and Acceleration: Self-Supervised Motion Consistency for Pedestrian Trajectory Prediction

【速读】：该论文旨在解决行人轨迹预测中因长尾数据分布导致的异常行为捕捉困难问题。传统方法依赖于监督学习，直接优化预测轨迹与真实标签之间的差异，这会放大长尾分布带来的局限性。为应对这一挑战，论文提出了一种自监督的行人轨迹预测框架，其关键是显式建模位置（Position）、速度（Velocity）和加速度（Acceleration）。通过特征注入和基于自监督运动一致性机制利用速度和加速度信息增强位置预测，并设计层次化的特征注入策略实现位置、速度和加速度的联合预测。此外，论文引入基于物理原理的运动一致性评估策略，通过比较预测运动趋势与历史动态来选择最合理的运动模式，从而指导轨迹生成。实验结果表明，该方法在ETH-UCY和Stanford Drone数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.24272
作者: Yizhou Huang,Yihua Cheng,Kezhi Wang
机构: Brunel University of London (布鲁内尔大学伦敦学院); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding human motion is crucial for accurate pedestrian trajectory prediction. Conventional methods typically rely on supervised learning, where ground-truth labels are directly optimized against predicted trajectories. This amplifies the limitations caused by long-tailed data distributions, making it difficult for the model to capture abnormal behaviors. In this work, we propose a self-supervised pedestrian trajectory prediction framework that explicitly models position, velocity, and acceleration. We leverage velocity and acceleration information to enhance position prediction through feature injection and a self-supervised motion consistency mechanism. Our model hierarchically injects velocity features into the position stream. Acceleration features are injected into the velocity stream. This enables the model to predict position, velocity, and acceleration jointly. From the predicted position, we compute corresponding pseudo velocity and acceleration, allowing the model to learn from data-generated pseudo labels and thus achieve self-supervised learning. We further design a motion consistency evaluation strategy grounded in physical principles; it selects the most reasonable predicted motion trend by comparing it with historical dynamics and uses this trend to guide and constrain trajectory generation. We conduct experiments on the ETH-UCY and Stanford Drone datasets, demonstrating that our method achieves state-of-the-art performance on both datasets.
zh

[CV-17] Visual Acoustic Fields

【速读】：该论文旨在解决如何通过视觉信号生成逼真的击打声音并精确定位击打位置的问题。其解决方案的关键在于提出了一种名为“Visual Acoustic Fields”的框架，该框架利用3D高斯点 splatting (3D Gaussian Splatting, 3DGS) 在三维空间中连接击打声音与视觉信号。该框架包含两个核心模块：声音生成模块和声音定位模块。声音生成模块采用条件扩散模型，利用从特征增强的3DGS渲染的多尺度特征生成真实的击打声音；而声音定位模块则通过查询由特征增强的3DGS表示的三维场景来基于声源定位击打位置。此外，为了支持这一框架，论文还引入了一种新的场景级视觉-声音样本对采集管道，实现了捕获图像、撞击位置与相应声音之间的对齐。据作者所知，这是首个在三维上下文中连接视觉和听觉信号的数据集。实验结果验证了该框架在生成可信的击打声音以及准确定位击打源方面的有效性。

链接: https://arxiv.org/abs/2503.24270
作者: Yuelei Li,Hyunjin Kim,Fangneng Zhan,Ri-Zhao Qiu,Mazeyu Ji,Xiaojun Shan,Xueyan Zou,Paul Liang,Hanspeter Pfister,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); Harvard University (哈佛大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at this https URL.
zh

[CV-18] FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics

【速读】：本文旨在解决生成式 AI 推动下合成图像检测面临的挑战，特别是现有检测模型仅限于分类而缺乏对图像真实性解释性不足的问题。论文的关键在于提出FakeScope，这是一种专为AI生成图像取证设计的专家级多模态模型（LMM）。FakeScope不仅能够以高精度识别合成图像，还通过基于视觉痕迹证据的语言真实性推理提供丰富、可解释且查询驱动的取证见解。为实现这一目标，研究构建了包含视觉痕迹证据的语言真实性推理的FakeChain数据集，并进一步提出了FakeInstruct，这是一个包含200万条视觉指令的最大多模态指令微调数据集，用于增强LMM的取证意识。此外，FakeScope通过提出的基于令牌的概率估计策略，在仅使用定性硬标签训练的情况下展现出显著的零样本定量检测能力，并在闭合与开放取证场景中均达到最先进的性能，同时具备强泛化能力和实际应用潜力。

链接: https://arxiv.org/abs/2503.24267
作者: Yixuan Li,Yu Tian,Yipo Huang,Wei Lu,Shiqi Wang,Weisi Lin,Anderson Rocha
机构: College of Computing, City University of Hong Kong (香港城市大学计算学院); School of Computer Science and Engineering, Ministry of Education Key Laboratory of Information Technology, Guangdong Province Key Laboratory of Information Security Technology, Sun Yat-Sen University (中山大学计算机科学与工程学院，教育部信息技术重点实验室，广东省信息安全技术重点实验室); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院); Artificial Intelligence Lab. (Recod.ai), University of Campinas (坎皮纳斯大学人工智能实验室 (Recod.ai))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid and unrestrained advancement of generative artificial intelligence (AI) presents a double-edged sword: while enabling unprecedented creativity, it also facilitates the generation of highly convincing deceptive content, undermining societal trust. As image generation techniques become increasingly sophisticated, detecting synthetic images is no longer just a binary task: it necessitates interpretable, context-aware methodologies that enhance trustworthiness and transparency. However, existing detection models primarily focus on classification, offering limited explanatory insights into image authenticity. In this work, we propose FakeScope, an expert multimodal model (LMM) tailored for AI-generated image forensics, which not only identifies AI-synthetic images with high accuracy but also provides rich, interpretable, and query-driven forensic insights. We first construct FakeChain dataset that contains linguistic authenticity reasoning based on visual trace evidence, developed through a novel human-machine collaborative framework. Building upon it, we further present FakeInstruct, the largest multimodal instruction tuning dataset containing 2 million visual instructions tailored to enhance forensic awareness in LMMs. FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. It can distinguish synthetic images with high accuracy while offering coherent and insightful explanations, free-form discussions on fine-grained forgery attributes, and actionable enhancement strategies. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection, enabled by our proposed token-based probability estimation strategy. Furthermore, FakeScope exhibits strong generalization and in-the-wild ability, ensuring its applicability in real-world scenarios.
zh

[CV-19] Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation

【速读】：该论文旨在解决生成式 AI 在医学影像领域合成数据生成过程中面临的“高保真性”、“多样性”和“效率”的三难问题。论文的关键解决方案是通过构建 GAN 集成模型（GAN ensembles），以克服传统 GAN 模型在应对模式崩溃（mode collapse）及未能充分覆盖真实数据分布方面的局限性。具体而言，研究提出了一种基于多目标优化的方法，平衡合成图像的保真性和多样性，并从中选出针对医学数据优化的 GAN 集成方案。这种集成方法确保每个模型的独特贡献，减少冗余，同时显著提升合成医学图像的质量与实用性，从而增强下游任务如诊断建模的效果。

链接: https://arxiv.org/abs/2503.24258
作者: Lorenzo Tronchin,Tommy Löfstedt,Paolo Soda,Valerio Guarrasi
机构: unknown
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.
zh

[CV-20] Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes

【速读】：该论文旨在解决3D点云实例分割任务中因数据标注成本高昂而导致的数据稀缺问题。传统方法需要在大规模三维空间中为每个点分配类别并提供详细标注，这带来了巨大的挑战。为应对这一问题，论文提出了一种基于生成模型的预训练方法，利用3D合成数据来训练3D点云实例分割模型。其关键在于使用生成模型（如Point-E）直接生成高质量的3D点云数据，并将其插入到场景中，从而实现无需人工标注的高效预训练。实验结果表明，该方法相比基线方法显著提升了性能，验证了3D生成模型在3D点云实例分割任务中的有效性。

链接: https://arxiv.org/abs/2503.24229
作者: Daichi Otsuka,Shinichi Mae,Ryosuke Yamada,Hirokatsu Kataoka
机构: TICO-AIST Cooperative Research Laboratory for Advanced Logistics (ALlab); National Institute of Advanced Industrial Science and Technology (AIST); Visual Geometry Group, University of Oxford
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real-world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real-world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large-scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre-training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point-E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point-E as an early 3D generative model can effectively support the pre-training with 3D synthetic data. In the experimental section, we compare our pre-training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.
zh

[CV-21] DiET-GS: Diffusion Prior and Event Stream-Assisted Motion Deblurring 3D Gaussian Splatting CVPR2025

【速读】：该论文旨在解决从模糊多视角图像重建清晰三维（3D）表示这一计算机视觉领域的长期难题。现有方法尝试利用基于事件的相机通过运动模糊来增强高质量的新视图合成，但由于动态范围高且时间分辨率微秒级，往往在恢复不准确的颜色或丢失细节方面表现欠佳。本文提出了一种名为DiET-GS的方法，这是一种扩散先验和事件流辅助的运动去模糊3D几何合成（3DGS）。该框架的关键在于有效结合无模糊事件流与扩散先验，采用两阶段训练策略。具体而言，通过引入基于事件双积分的新框架来约束3DGS，以实现准确的颜色和清晰的细节定义。此外，还提出了一种简单技术，利用扩散先验进一步增强边缘细节。定量和定性结果表明，DiET-GS在合成数据和真实数据上均能显著提高新视图的质量。项目页面链接为：https://project-page-url.

链接: https://arxiv.org/abs/2503.24210
作者: Seungjun Lee,Gim Hee Lee
机构: Department of Computer Science, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Reconstructing sharp 3D representations from blurry multi-view images are long-standing problem in computer vision. Recent works attempt to enhance high-quality novel view synthesis from the motion blur by leveraging event-based cameras, benefiting from high dynamic range and microsecond temporal resolution. However, they often reach sub-optimal visual quality in either restoring inaccurate color or losing fine-grained details. In this paper, we present DiET-GS, a diffusion prior and event stream-assisted motion deblurring 3DGS. Our framework effectively leverages both blur-free event streams and diffusion prior in a two-stage training strategy. Specifically, we introduce the novel framework to constraint 3DGS with event double integral, achieving both accurate color and well-defined details. Additionally, we propose a simple technique to leverage diffusion prior to further enhance the edge details. Qualitative and quantitative results on both synthetic and real-world data demonstrate that our DiET-GS is capable of producing significantly better quality of novel views compared to the existing baselines. Our project page is this https URL
zh

[CV-22] CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization

【速读】：该论文旨在解决CLIP（Contrastive Language-Image Pretraining）在跨模态任务中表现出强大泛化能力但其理论基础尚不清晰的问题。为填补这一空白，论文提出了跨模态信息瓶颈（Cross-modal Information Bottleneck, CIB）框架，从理论上解释了CLIP对比学习目标的本质，即隐式的信息瓶颈优化过程。CIB的关键在于模型通过最大化跨模态共享信息同时去除模态特定冗余来保持模态间本质语义对齐。基于此洞见，论文进一步引入了跨模态信息瓶颈正则化方法（Cross-modal Information Bottleneck Regularization, CIBR），在训练过程中显式地遵循信息瓶颈原则，通过添加惩罚项抑制模态特定冗余，从而增强图像与文本特征间的语义对齐。实验验证表明，CIBR在多个视觉语言基准测试中（如零样本分类和文本-图像检索任务）实现了优于标准CLIP的一致性能提升，为理解CLIP的泛化能力提供了首个理论视角，并为未来跨模态表征学习提供了实践指导。

链接: https://arxiv.org/abs/2503.24182
作者: Yingrui Ji,Xi Xiao,Gaofei Chen,Hao Xu,Chenrui Ma,Lijing Zhu,Aokun Liang,Jiansheng Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success in cross-modal tasks such as zero-shot image classification and text-image retrieval by effectively aligning visual and textual representations. However, the theoretical foundations underlying CLIP’s strong generalization remain unclear. In this work, we address this gap by proposing the Cross-modal Information Bottleneck (CIB) framework. CIB offers a principled interpretation of CLIP’s contrastive learning objective as an implicit Information Bottleneck optimization. Under this view, the model maximizes shared cross-modal information while discarding modality-specific redundancies, thereby preserving essential semantic alignment across modalities. Building on this insight, we introduce a Cross-modal Information Bottleneck Regularization (CIBR) method that explicitly enforces these IB principles during training. CIBR introduces a penalty term to discourage modality-specific redundancy, thereby enhancing semantic alignment between image and text features. We validate CIBR on extensive vision-language benchmarks, including zero-shot classification across seven diverse image datasets and text-image retrieval on MSCOCO and Flickr30K. The results show consistent performance gains over standard CLIP. These findings provide the first theoretical understanding of CLIP’s generalization through the IB lens. They also demonstrate practical improvements, offering guidance for future cross-modal representation learning.
zh

[CV-23] Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up

【速读】：该论文旨在解决用户在传达任务时因遗漏关键信息导致当前智能代理（Agent）性能下降的问题，尤其是在不支持即时用户干预的代理范式下。论文的关键解决方案是引入了一个名为“自纠正式图形用户界面导航（\textbf{Self-Correction GUI Navigation}）”的任务，并开发了包含图形用户界面（GUI）后续问答对的\textbf{Navi-plus}数据集以及用于评估这一新能力的“双流轨迹评估（\textbf{Dual-Stream Trajectory Evaluation}）”方法。研究结果表明，具备提出GUI后续问题能力的代理能够在其面临模糊用户任务时完全恢复性能。

链接: https://arxiv.org/abs/2503.24180
作者: Ziming Cheng,Zhiyuan Huang,Junting Pan,Zhaohui Hou,Mingjie Zhan
机构: Beijing University of Posts and Telecommunications (北京邮电大学); SenseTime Research (商汤科技研究); MMLab, CUHK (香港中文大学多媒体实验室); SenseTime Research (商汤科技研究); SenseTime Research (商汤科技研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Graphical user interfaces (GUI) automation agents are emerging as powerful tools, enabling humans to accomplish increasingly complex tasks on smart devices. However, users often inadvertently omit key information when conveying tasks, which hinders agent performance in the current agent paradigm that does not support immediate user intervention. To address this issue, we introduce a \textbfSelf-Correction GUI Navigation task that incorporates interactive information completion capabilities within GUI agents. We developed the \textbfNavi-plus dataset with GUI follow-up question-answer pairs, alongside a \textbfDual-Stream Trajectory Evaluation method to benchmark this new capability. Our results show that agents equipped with the ability to ask GUI follow-up questions can fully recover their performance when faced with ambiguous user tasks.
zh

[CV-24] Foundation Models For Seismic Data Processing: An Extensive Review

【速读】：该论文旨在探索基础模型（Foundation Models）在地震数据处理中的应用，针对去多次波（demultiple）、插值（interpolation）和去噪（denoising）等任务进行研究。论文的关键在于评估不同基础模型特性（如预训练技术与神经网络架构）对性能和效率的影响，而非提出单一的地震专用基础模型，而是批判性地分析现有的自然图像领域基础模型，并推荐一些有潜力的候选模型以供未来进一步探索。

链接: https://arxiv.org/abs/2503.24166
作者: Fabian Fuchs,Mario Ruben Fernandez,Norman Ettrich,Janis Keuper
机构: Fraunhofer-Institut für Techno- und Wirtschaftsmathematik; DWS, University of Mannheim; IMLA, Offenburg University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Seismic processing plays a crucial role in transforming raw data into high-quality subsurface images, pivotal for various geoscience applications. Despite its importance, traditional seismic processing techniques face challenges such as noisy and damaged data and the reliance on manual, time-consuming workflows. The emergence of deep learning approaches has introduced effective and user-friendly alternatives, yet many of these deep learning approaches rely on synthetic datasets and specialized neural networks. Recently, foundation models have gained traction in the seismic domain, due to their success in natural imaging. This paper investigates the application of foundation models in seismic processing on the tasks: demultiple, interpolation, and denoising. It evaluates the impact of different model characteristics, such as pre-training technique and neural network architecture, on performance and efficiency. Rather than proposing a single seismic foundation model, this paper critically examines various natural image foundation models and suggest some promising candidates for future exploration.
zh

[CV-25] A Comparative Study of Scanpath Models in Graph-Based Visualization

【速读】：该论文旨在解决信息可视化（InfoVis）系统中优化界面设计的关键挑战，即如何准确预测用户在视觉分析过程中的注意力分配。由于采集眼动追踪（Eye-tracking, ET）数据存在成本高、隐私顾虑及可扩展性差等问题，论文提出利用计算模型来生成合成注视模式以替代传统眼动数据。解决方案的关键在于评估不同计算模型（如DeepGaze、UMSS和Gazeformer）生成的注视路径与真实人类扫描路径的一致性，并进一步研究问题复杂度和节点数量对模型性能的影响。通过这项工作，论文为视觉分析中的预测建模提供了新见解，有助于提升InfoVis系统的功能设计与有效性。

链接: https://arxiv.org/abs/2503.24160
作者: Angela Lopez-Cardona,Parvin Emami,Sebastian Idesis,Saravanakumar Duraisamy,Luis A.Leiva,Ioannis Arapakis
机构: Telefónica Scientific Research (Telefónica科学研究院); Universitat Politècnica de Catalunya (巴塞罗那理工学院); University of Luxembourg (卢森堡大学); Telefónica Scientific Research (Telefónica科学研究院); University of Luxembourg (卢森堡大学); Telefónica Scientific Research (Telefónica科学研究院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Information Visualization (InfoVis) systems utilize visual representations to enhance data interpretation. Understanding how visual attention is allocated is essential for optimizing interface design. However, collecting Eye-tracking (ET) data presents challenges related to cost, privacy, and scalability. Computational models provide alternatives for predicting gaze patterns, thereby advancing InfoVis research. In our study, we conducted an ET experiment with 40 participants who analyzed graphs while responding to questions of varying complexity within the context of digital forensics. We compared human scanpaths with synthetic ones generated by models such as DeepGaze, UMSS, and Gazeformer. Our research evaluates the accuracy of these models and examines how question complexity and number of nodes influence performance. This work contributes to the development of predictive modeling in visual analytics, offering insights that can enhance the design and effectiveness of InfoVis systems.
zh

[CV-26] PixelCAM: Pixel Class Activation Mapping for Histology Image Classification and ROI Localization

【速读】：该论文旨在解决弱监督目标定位（Weakly Supervised Object Localization, WSOL）方法在组织学图像分析中的局限性，包括单步方法容易出现的欠激活或过激活问题、异步收敛问题，以及两步方法因冻结分类器导致的定位能力受限问题。此外，现有方法在处理分布外（Out-of-Distribution, OOD）数据集时也存在困难。论文的关键解决方案在于提出了一种多任务学习框架，通过共享图像编码器的像素特征空间同时训练分类与定位任务，以解决异步收敛问题。特别地，论文引入了PixelCAM，这是一种基于像素特征空间的低成本前景/背景逐像素分类器，用于实现空间目标定位。PixelCAM利用预训练的WSOL模型生成的像素伪标签进行训练，并与图像级分类器通过标准梯度下降法联合优化。这种设计不仅支持精确的前景/背景区域分割，还能够轻松集成到基于CNN和Transformer的架构中。

链接: https://arxiv.org/abs/2503.24135
作者: Alexis Guichemerre,Soufiane Belharbi,Mohammadhadi Shateri,Luke McCaffrey,Eric Granger
机构: LIVIA, ILLS, Dept. of Systems Engineering, ETS Montreal, Canada (LIVIA, ILLS, 麦吉尔大学蒙特利尔工程学院系统工程系, 加拿大); Goodman Cancer Research Centre, Dept. of Oncology, McGill University, Canada (麦吉尔大学 Goodman 癌症研究中心, 加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 20 figures, Medical Imaging with Deep Learning (MIDL 2025)

点击查看摘要

Abstract:Weakly supervised object localization (WSOL) methods allow training models to classify images and localize ROIs. WSOL only requires low-cost image-class annotations yet provides a visually interpretable classifier, which is important in histology image analysis. Standard WSOL methods rely on class activation mapping (CAM) methods to produce spatial localization maps according to a single- or two-step strategy. While both strategies have made significant progress, they still face several limitations with histology images. Single-step methods can easily result in under- or over-activation due to the limited visual ROI saliency in histology images and the limited localization cues. They also face the well-known issue of asynchronous convergence between classification and localization tasks. The two-step approach is sub-optimal because it is tied to a frozen classifier, limiting the capacity for localization. Moreover, these methods also struggle when applied to out-of-distribution (OOD) datasets. In this paper, a multi-task approach for WSOL is introduced for simultaneous training of both tasks to address the asynchronous convergence problem. In particular, localization is performed in the pixel-feature space of an image encoder that is shared with classification. This allows learning discriminant features and accurate delineation of foreground/background regions to support ROI localization and image classification. We propose PixelCAM, a cost-effective foreground/background pixel-wise classifier in the pixel-feature space that allows for spatial object localization. PixelCAM is trained using pixel pseudo-labels collected from a pretrained WSOL model. Both image and pixel-wise classifiers are trained simultaneously using standard gradient descent. In addition, our pixel classifier can easily be integrated into CNN- and transformer-based architectures without any modifications.
zh

[CV-27] Its a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data CVPR2025 ATC

【速读】：该论文试图解决的问题是如何在无监督（unsupervised）或“盲配”（blind matching）条件下，将视觉和语言嵌入表示进行匹配。这一目标基于柏拉图表征假设（Platonic Representation Hypothesis），即随着模型和数据集规模的增大，视觉和语言嵌入的同质性增强，模态内部的成对距离变得更加相似。论文旨在验证成熟的基础模型是否能够支持这种无监督的跨模态匹配，并探索其实现的可能性。

解决方案的关键在于两个方面：首先，论文将无监督匹配建模为一个二次指派问题（Quadratic Assignment Problem, QAP），并提出了一种新颖的启发式算法（heuristic），该算法的表现优于之前的求解器。此外，还开发了一种技术来寻找最优的匹配问题实例，以提高匹配的成功概率。其次，通过在四个数据集上部署多种视觉和语言模型进行了广泛的实验研究，结果表明对于许多问题实例，确实可以在无需标注的情况下实现视觉和语言表示的匹配。这一发现为几乎无需人工标注即可将语义知识嵌入其他模态提供了令人兴奋的可能性，并通过一个无监督分类器的实证展示了其潜力。

链接: https://arxiv.org/abs/2503.24129
作者: Dominik Schnaus,Nikita Araslanov,Daniel Cremers
机构: TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025, Project page: this https URL

点击查看摘要

Abstract:The platonic representation hypothesis suggests that vision and language embeddings become more homogeneous as model and dataset sizes increase. In particular, pairwise distances within each modality become more similar. This suggests that as foundation models mature, it may become possible to match vision and language embeddings in a fully unsupervised fashion, i.e. without parallel data. We present the first feasibility study, and investigate conformity of existing vision and language foundation models in the context of unsupervised, or “blind”, matching. First, we formulate unsupervised matching as a quadratic assignment problem and introduce a novel heuristic that outperforms previous solvers. We also develop a technique to find optimal matching problems, for which a non-trivial match is very likely. Second, we conduct an extensive study deploying a range of vision and language models on four datasets. Our analysis reveals that for many problem instances, vision and language representations can be indeed matched without supervision. This finding opens up the exciting possibility of embedding semantic knowledge into other modalities virtually annotation-free. As a proof of concept, we showcase an unsupervised classifier, which achieves non-trivial classification accuracy without any image-text annotation.
zh

[CV-28] IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration

【速读】：本文旨在解决跨模态医学图像配准（Transmodality Image Registration）中的挑战，特别是在多模态影像诊断、治疗规划及图像引导手术中的精确解剖结构对齐问题。传统方法通常依赖手工设计的相似性度量，难以应对模态差异和噪声干扰。为了解决这些问题，论文提出了一种通用语义相似性度量方法——IMPACT（Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration）。其关键在于利用大规模预训练模型（如TotalSegmentator）提取的深度学习特征，无需针对具体任务重新训练，从而实现跨模态的广泛应用。此外，通过集成Segment Anything Model (SAM) 和其他大型分割网络，IMPACT 提供了鲁棒、可扩展且高效的配准解决方案。在五项具有挑战性的配准任务中，IMPACT 在目标配准误差（Target Registration Error）和Dice相似性系数（Dice Similarity Coefficient）等定量指标上显著优于基线方法，并通过定性分析验证了其在面对噪声、伪影及模态变化时的增强鲁棒性。

链接: https://arxiv.org/abs/2503.24121
作者: Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jason Downling,Simon Rouzé,Caroline Lafond,Anaïs Barateau,Jean-Louis Dillenseger
机构: Univ. Rennes, CLCC Eugene Marquis, INSERM, LTSI - UMR 1099 (雷恩大学, Eugene Marquis 肿瘤中心, 法国国家健康与医学研究院, LTSI - 联合研究组 1099), F-35000 Rennes, France; CSIRO Australian e-Health Research Centre (澳大利亚联邦科学与工业研究组织澳洲电子健康研究中心), Herston, Queensland, Australia; CHU Rennes, Department of Cardio-Thoracic and Vascular Surgery (雷恩大学医院, 心胸血管外科部门), F-35000 Rennes, France
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). This is a preprint version and has not been peer-reviewed

点击查看摘要

Abstract:Image registration is fundamental in medical imaging, enabling precise alignment of anatomical structures for diagnosis, treatment planning, image-guided treatment or longitudinal monitoring. This work introduces IMPACT (Image Metric with Pretrained model-Agnostic Comparison for Transmodality registration), a generic semantic similarity metric designed for seamless integration into diverse image registration frameworks (such as Elastix and Voxelmorph). It compares deep learning-based features extracted from medical images without requiring task-specific training, ensuring broad applicability across various modalities. By leveraging the features of the large-scale pretrained TotalSegmentator models and the ability to integrate Segment Anything Model (SAM) and other large-scale segmentation networks, this approach offers significant advantages. It provides robust, scalable, and efficient solutions for multimodal image registration. The IMPACT loss was evaluated on five challenging registration tasks involving thoracic CT/CBCT, and pelvic MR/CT datasets. Quantitative metrics, such as Target Registration Error and Dice Similarity Coefficient, demonstrated significant improvements in anatomical alignment compared to baseline methods. Qualitative analyses further confirmed the increased robustness of the proposed metric in the face of noise, artifacts, and modality variations. IMPACT’s versatility and efficiency make it a valuable tool for advancing registration performance in clinical and research applications, addressing critical challenges in multimodal medical imaging.
zh

[CV-29] PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

【速读】：该论文旨在解决结肠镜视频中息肉检测、分割、分类以及无监督跟踪的联合任务问题。现有方法通常需要针对特定任务进行微调、缺乏跟踪能力或依赖领域特定的预训练。论文的关键解决方案在于提出了一种名为\textit{PolypSegTrack}的新颖基础模型，其通过引入一种新的条件掩码损失函数，实现了在具有像素级分割掩码或边界框标注的不同数据集上的灵活训练，从而避免了特定任务的微调需求。此外，该方法采用基于对象查询的无监督跟踪模块，无需任何启发式规则即可可靠地关联跨帧的息肉实例，并利用在自然图像上无监督预训练的鲁棒视觉基础模型主干，消除了对领域特定预训练的需求。实验结果表明，该方法在检测、分割、分类及跟踪方面显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.24108
作者: Anwesa Choudhuri,Zhongpai Gao,Meng Zheng,Benjamin Planche,Terrence Chen,Ziyan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce \textitPolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.
zh

[CV-30] DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

【速读】：该论文致力于解决长篇视频视觉叙事连贯性的问题，现有方法仅依赖帧级嵌入（Frame-level embeddings），虽能描述基于对象的内容，但缺乏跨场景的上下文信息。论文的关键解决方案是引入DANTE-AD模型，采用双视觉Transformer架构，通过序列融合帧级和场景级嵌入（Scene-level embeddings）以提升长期上下文理解能力，并提出一种新颖的、最先进的顺序跨注意机制（Sequential Cross-Attention），实现细粒度音频描述生成的上下文锚定（Contextual Grounding）。

链接: https://arxiv.org/abs/2503.24096
作者: Adrienne Deganutti,Simon Hadfield,Andrew Gilbert
机构: University of Surrey (萨里大学), UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.
zh

[CV-31] 4D mmWave Radar in Adverse Environments for Autonomous Driving: A Survey

【速读】：本文旨在解决在恶劣环境（如雨、雪、雾等）下自动驾驶系统感知性能下降的问题，特别是针对激光雷达（LiDAR）和摄像头等传统传感器的局限性。论文的关键在于全面综述了4D毫米波雷达（4D mmWave Radar）在恶劣环境下的研究进展，强调其不仅能够提供三维感知和速度测量，还具备在挑战性条件下保持鲁棒性的独特优势。通过回顾现有4D毫米波雷达数据集、分析不同恶劣条件下的方法与模型，并讨论当前研究面临的挑战及未来发展方向，本文为推动4D毫米波雷达在极端环境中的应用提供了系统性指导。

链接: https://arxiv.org/abs/2503.24091
作者: Xiangyuan Peng,Miao Tang,Huawei Sun,Lorenzo Servadei,Robert Wille
机构: Technical University of Munich (慕尼黑工业大学); Infineon Technologies AG (英飞凌科技); China University of Geosciences (中国地质大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Autonomous driving systems require accurate and reliable perception. However, adverse environments, such as rain, snow, and fog, can significantly degrade the performance of LiDAR and cameras. In contrast, 4D millimeter-wave (mmWave) radar not only provides 3D sensing and additional velocity measurements but also maintains robustness in challenging conditions, making it increasingly valuable for autonomous driving. Recently, research on 4D mmWave radar under adverse environments has been growing, but a comprehensive survey is still lacking. To bridge this gap, this survey comprehensively reviews the current research on 4D mmWave radar under adverse environments. First, we present an overview of existing 4D mmWave radar datasets encompassing diverse weather and lighting scenarios. Next, we analyze methods and models according to different adverse conditions. Finally, the challenges faced in current studies and potential future directions are discussed for advancing 4D mmWave radar applications in harsh environments. To the best of our knowledge, this is the first survey specifically focusing on 4D mmWave radar in adverse environments for autonomous driving.
zh

[CV-32] A Plasticity-Aware Method for Continual Self-Supervised Learning in Remote Sensing

【速读】：该论文旨在解决连续自监督学习（CSSL）方法在处理新任务时因防止灾难性遗忘（catastrophic forgetting）而采用正则化策略导致的学习可塑性（learning plasticity）降低的问题。现有方法虽能保持对先前任务的知识记忆，但牺牲了模型适应新任务数据的能力，从而可能影响性能。为了解决这一问题，论文提出了一种新的CSSL方法，其关键在于引入了一种带有集成解耦机制的知识蒸馏策略。该解耦机制首先将特征维度划分为任务公共部分和任务特定部分，然后通过确保任务公共特征的相关性来保证记忆稳定性，同时通过使任务特定特征去相关来促进新特征的学习。实验结果表明，所提方法在任务增量场景下平均精度提升可达1.12%，顽固性（intransigence）提升2.33%，在类别增量场景下平均精度提升1.24%，顽固性提升2.01%，优于广泛使用的CaSSLe框架。

链接: https://arxiv.org/abs/2503.24088
作者: Lars Möllenbrok,Behnood Rasti,Begüm Demir
机构: Technische Universität Berlin (柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data (BIFOLD – 柏林学习与数据基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025

点击查看摘要

Abstract:Continual self-supervised learning (CSSL) methods have gained increasing attention in remote sensing (RS) due to their capability to learn new tasks sequentially from continuous streams of unlabeled data. Existing CSSL methods, while learning new tasks, focus on preventing catastrophic forgetting. To this end, most of them use regularization strategies to retain knowledge of previous tasks. This reduces the model’s ability to adapt to the data of new tasks (i.e., learning plasticity), which can degrade performance. To address this problem, in this paper, we propose a novel CSSL method that aims to learn tasks sequentially, while achieving high learning plasticity. To this end, the proposed method uses a knowledge distillation strategy with an integrated decoupling mechanism. The decoupling is achieved by first dividing the feature dimensions into task-common and task-specific parts. Then, the task-common features are forced to be correlated to ensure memory stability while the task-specific features are forced to be de-correlated facilitating the learning of new features. Experimental results show the effectiveness of the proposed method compared to CaSSLe, which is a widely used CSSL framework, with improvements of up to 1.12% in average accuracy and 2.33% in intransigence in a task-incremental scenario, and 1.24% in average accuracy and 2.01% in intransigence in a class-incremental scenario. Comments: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.24088 [cs.CV] (or arXiv:2503.24088v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.24088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-33] From Colors to Classes: Emergence of Concepts in Vision Transformers

【速读】：该论文试图解决的问题是如何系统性地理解视觉Transformer（Vision Transformer, ViT）在不同层中编码的概念及其复杂性。传统卷积神经网络（Convolutional Neural Network, CNN）的研究已表明其各层逐渐提取复杂特征的能力，而ViTs由于缺乏类似的归纳偏置（inductive biases），其逐层信息处理机制尚不明确。论文的关键解决方案在于提出了一种基于神经元标记（neuron labeling）的全新逐层分析方法，用于揭示最先进的ViTs中概念编码的特性。研究发现，ViTs在整个网络中逐步编码了越来越复杂的概念，早期层主要编码颜色和纹理等基础特征，而后期层则表示更具体的对象和动物类别。此外，预训练策略的不同显著影响编码概念的数量与类别分布，微调至特定下游任务通常会减少编码概念的数量，并使其向更相关的类别转移。

链接: https://arxiv.org/abs/2503.24071
作者: Teresa Dorszewski,Lenka Tětková,Robert Jenssen,Lars Kai Hansen,Kristoffer Knutsen Wickstrøm
机构: Department of Applied Mathematics and Computer Science, Technical University of Denmark ( DTU 概念验证与计算机科学系, 丹麦技术大学 ); Department of Physics and Technology, UiT The Arctic University of Norway ( UiT 挪威北极大学物理与技术系 ); Pioneer Centre for AI, University of Copenhagen, Denmark ( 丹麦哥本哈根大学先锋人工智能研究中心 ); Norwegian Computing Center, Oslo, Norway ( 挪威计算中心, 挪威奥斯陆 )
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Accepted at The 3rd World Conference on eXplainable Artificial Intelligence

点击查看摘要

Abstract:Vision Transformers (ViTs) are increasingly utilized in various computer vision tasks due to their powerful representation capabilities. However, it remains understudied how ViTs process information layer by layer. Numerous studies have shown that convolutional neural networks (CNNs) extract features of increasing complexity throughout their layers, which is crucial for tasks like domain adaptation and transfer learning. ViTs, lacking the same inductive biases as CNNs, can potentially learn global dependencies from the first layers due to their attention mechanisms. Given the increasing importance of ViTs in computer vision, there is a need to improve the layer-wise understanding of ViTs. In this work, we present a novel, layer-wise analysis of concepts encoded in state-of-the-art ViTs using neuron labeling. Our findings reveal that ViTs encode concepts with increasing complexity throughout the network. Early layers primarily encode basic features such as colors and textures, while later layers represent more specific classes, including objects and animals. As the complexity of encoded concepts increases, the number of concepts represented in each layer also rises, reflecting a more diverse and specific set of features. Additionally, different pretraining strategies influence the quantity and category of encoded concepts, with finetuning to specific downstream tasks generally reducing the number of encoded concepts and shifting the concepts to more relevant categories.
zh

[CV-34] COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

【速读】：该论文旨在解决视觉与语言导航（Vision-and-Language Navigation, VLN）任务中高模型复杂度与高性能需求之间的矛盾。传统方法虽基于Transformer架构并通过引入外部知识库或地图信息提升性能，但导致模型规模增大及计算成本显著上升。为实现高性能与低计算成本的平衡，论文提出了一种名为COmbination of Selective MemOrization (COSMO) 的新架构。其关键是结合状态空间模块与Transformer模块，并设计了两个专用于VLN的自适应状态空间模块：Round Selective Scan (RSS) 和 Cross-modal Selective State Space Module (CS3)。RSS模块促进单次扫描内的多模态交互，而CS3模块通过双流架构增强跨模态交互能力。实验结果表明，COSMO不仅在REVERIE、R2R和R2R-CE三个主流VLN数据集上展现了卓越的导航性能，还大幅降低了计算开销。

链接: https://arxiv.org/abs/2503.24065
作者: Siqi Zhang,Yanyuan Qiao,Qunbo Wang,Zike Yan,Qi Wu,Zhihua Wei,Jing Liu
机构: School of Computer Science and Technology, Tongji University (同济大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Australian Institute for Machine Learning, The University of Adelaide (阿德莱德大学澳大利亚机器学习研究所); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.
zh

[CV-35] AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition ICME2025

【速读】：该论文旨在解决微表情（micro-expressions）由于持续时间短且信号微妙而导致的下游识别难题。为应对这一挑战，论文提出了一种名为自适应运动增强与稀疏 Mamba（Adaptive Motion Magnification and Sparse Mamba, AMMSM）的多任务学习框架。该方案的关键在于结合自监督的细微运动增强技术以提升微表情捕捉的准确性，并通过稀疏空间选择 Mamba 架构（sparse spatial selection Mamba architecture），将稀疏激活与先进的视觉 Mamba 模型相结合，更有效地建模关键运动区域及其有价值的表现形式。此外，论文利用进化搜索优化运动增强因子及空间选择的稀疏比，并通过微调进一步提升性能。实验结果表明，AMMSM 在两个标准数据集上实现了最先进的准确性和鲁棒性。

链接: https://arxiv.org/abs/2503.24057
作者: Xuxiong Liu,Tengteng Dong,Fei Wang,Weijie Feng,Xiao Sun
机构: School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学), Hefei, China; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究所), Hefei, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Micro-expressions are typically regarded as unconscious manifestations of a person’s genuine emotions. However, their short duration and subtle signals pose significant challenges for downstream recognition. We propose a multi-task learning framework named the Adaptive Motion Magnification and Sparse Mamba (AMMSM) to address this. This framework aims to enhance the accurate capture of micro-expressions through self-supervised subtle motion magnification, while the sparse spatial selection Mamba architecture combines sparse activation with the advanced Visual Mamba model to model key motion regions and their valuable representations more effectively. Additionally, we employ evolutionary search to optimize the magnification factor and the sparsity ratios of spatial selection, followed by fine-tuning to improve performance further. Extensive experiments on two standard datasets demonstrate that the proposed AMMSM achieves state-of-the-art (SOTA) accuracy and robustness.
zh

[CV-36] BBoxCut: A Targeted Data Augmentation Technique for Enhancing Wheat Head Detection Under Occlusions

【速读】：该论文旨在解决小麦穗部关键特征自动识别与测量中的挑战，特别是在复杂田间条件下（如叶片遮挡、相邻小麦头重叠、光照变化及运动模糊等）导致的检测性能下降问题。为应对这些挑战，论文提出了一种名为BBoxCut的新颖数据增强技术，其关键在于通过随机局部掩码模拟叶片和邻近小麦头引起的遮挡现象。这种方法显著提升了三种最先进的目标检测器（Faster R-CNN、FCOS和DETR）在小麦头检测任务上的平均精度（mAP），分别提高了2.76、3.26和1.9个百分点，并在定量和定性分析中均表现出色，尤其在处理被遮挡的小麦头场景时展现了方法的鲁棒性。

链接: https://arxiv.org/abs/2503.24032
作者: Yasashwini Sai Gowri P,Karthik Seemakurthy,Andrews Agyemang Opoku,Sita Devi Bharatula
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wheat plays a critical role in global food security, making it one of the most extensively studied crops. Accurate identification and measurement of key characteristics of wheat heads are essential for breeders to select varieties for cross-breeding, with the goal of developing nutrient-dense, resilient, and sustainable cultivars. Traditionally, these measurements are performed manually, which is both time-consuming and inefficient. Advances in digital technologies have paved the way for automating this process. However, field conditions pose significant challenges, such as occlusions of leaves, overlapping wheat heads, varying lighting conditions, and motion blur. In this paper, we propose a novel data augmentation technique, BBoxCut, which uses random localized masking to simulate occlusions caused by leaves and neighboring wheat heads. We evaluated our approach using three state-of-the-art object detectors and observed mean average precision (mAP) gains of 2.76, 3.26, and 1.9 for Faster R-CNN, FCOS, and DETR, respectively. Our augmentation technique led to significant improvements both qualitatively and quantitatively. In particular, the improvements were particularly evident in scenarios involving occluded wheat heads, demonstrating the robustness of our method in challenging field conditions.
zh

[CV-37] HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

【速读】：该论文致力于解决人类动作视频生成任务中的挑战，特别是由于学习人体运动固有复杂性所带来的困难。传统方法通常依赖于从现有视频中提取的姿势进行驱动，缺乏灵活性。为了解决这一问题，论文提出了HumanDreamer框架，其关键是将人类动作视频生成过程解耦为两个阶段：首先通过文本提示生成多样化的姿势（MotionDiT），然后利用这些姿势生成高质量的人类动作视频。此外，论文还引入了一种新的LAMA损失函数，并构建了MotionVid数据集，这些贡献显著提升了Frechet Inception Distance (FID) 62.4%，并分别提高了top1、top2和top3的R-precision指标41.8%、26.3%和18.3%，从而显著改进了Text-to-Pose控制精度和生成质量。

链接: https://arxiv.org/abs/2503.24026
作者: Boyuan Wang,Xiaofeng Wang,Chaojun Ni,Guosheng Zhao,Zhiqin Yang,Zheng Zhu,Muyang Zhang,Yukun Zhou,Xinze Chen,Guan Huang,Lihong Liu,Xingang Wang
机构: GigaAI; Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); Peking University (北京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.
zh

[CV-38] Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

【速读】：该论文旨在解决在跨模态知识蒸馏（Crossmodal Knowledge Distillation, KD）中，利用单模态学生模型和多模态教师模型进行知识迁移时存在的两个关键问题：一是真实世界视觉数据中高层次类别标签难以充分捕捉深层语义结构，直接使用这些标签可能导致标签泄露（label leakage），从而限制蒸馏性能；二是现有方法可能因引入噪声或过于精确的文本嵌入而降低蒸馏效率。为了解决这些问题，论文提出了一种多教师跨模态知识蒸馏框架，通过在分层损失函数下结合CLIP图像嵌入与可学习的WordNet扩展文本嵌入，避免直接使用精确类别名称，转而采用语义更丰富的WordNet扩展来引入更多样化的文本线索，有效缓解标签泄露问题，并增强学生模型的表现。关键在于使用WordNet-relaxed提示词，这不仅能够促进学生模型更多依赖视觉特征而非文本捷径，还能有效地整合新引入的文本信息。实验结果表明，该方法在六个公开数据集上达到了最先进的性能或排名第二，验证了其有效性。

链接: https://arxiv.org/abs/2503.24017
作者: Chenqi Guo,Mengshuo Rong,Qianli Feng,Rongfan Feng,Yinglong Ma
机构: North China Electric Power University (华北电力大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher’s modalities include the student’s, additional complementary information can be exploited to improve knowledge transfer. In supervised image classification, image datasets typically include class labels that represent high-level concepts, suggesting a natural avenue to incorporate textual cues for crossmodal KD. However, these labels rarely capture the deeper semantic structures in real-world visuals and can lead to label leakage if used directly as inputs, ultimately limiting KD performance. To address these issues, we propose a multi-teacher crossmodal KD framework that integrates CLIP image embeddings with learnable WordNet-relaxed text embeddings under a hierarchical loss. By avoiding direct use of exact class names and instead using semantically richer WordNet expansions, we mitigate label leakage and introduce more diverse textual cues. Experiments show that this strategy significantly boosts student performance, whereas noisy or overly precise text embeddings hinder distillation efficiency. Interpretability analyses confirm that WordNet-relaxed prompts encourage heavier reliance on visual features over textual shortcuts, while still effectively incorporating the newly introduced textual cues. Our method achieves state-of-the-art or second-best results on six public datasets, demonstrating its effectiveness in advancing crossmodal KD.
zh

[CV-39] Optimization of Layer Skipping and Frequency Scaling for Convolutional Neural Networks under Latency Constraint ECCV

【速读】：该论文旨在解决在资源受限设备（如移动设备和自动驾驶车辆）上部署卷积神经网络（Convolutional Neural Networks, CNNs）时的高能耗问题。论文的关键解决方案是提出了一种结合比例层跳过（Proportional Layer Skipping, PLS）和频率缩放（Frequency Scaling, FS）的方法。其中，PLS通过选择性地跳过网络层来降低计算复杂度，而FS则通过调整处理器频率在延迟约束下优化能量使用。实验结果表明，该方法在ResNet-152模型上显著减少了计算需求和能量消耗，同时保持了极小的精度损失。这为资源受限环境下的实时处理提供了实用方案，并为平衡计算效率与模型性能提供了重要见解。

链接: https://arxiv.org/abs/2503.24014
作者: Minh David Thao Chan,Ruoyu Zhao,Yukuan Jia,Ruiqing Mao,Sheng Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures, Accepted in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops. Milan, Italy: Springer, September 2024

点击查看摘要

Abstract:The energy consumption of Convolutional Neural Networks (CNNs) is a critical factor in deploying deep learning models on resource-limited equipment such as mobile devices and autonomous vehicles. We propose an approach involving Proportional Layer Skipping (PLS) and Frequency Scaling (FS). Layer skipping reduces computational complexity by selectively bypassing network layers, whereas frequency scaling adjusts the frequency of the processor to optimize energy use under latency constraints. Experiments of PLS and FS on ResNet-152 with the CIFAR-10 dataset demonstrated significant reductions in computational demands and energy consumption with minimal accuracy loss. This study offers practical solutions for improving real-time processing in resource-limited settings and provides insights into balancing computational efficiency and model performance.
zh

[CV-40] Learning 3D-Gaussian Simulators from RGB Videos

【速读】：该论文致力于解决从视频数据中学习物理模拟的问题，特别是如何在不依赖强归纳偏置或精确三维信息的情况下，保持空间和时间一致性以实现可扩展性和泛化能力。论文的关键解决方案在于引入了3DGSim，这是一种基于多视角RGB视频端到端学习物体动力学的三维物理模拟器。其核心创新点包括：将图像编码为三维高斯粒子表示，利用变换器（Transformer）传播动力学，并通过三维高斯散射（Gaussian Splatting）渲染帧；同时，通过联合训练逆向渲染与动力学变换器，结合时序编码和融合层，将物理属性嵌入到逐点潜变量向量中，而无需强制施加显式的连接约束。这种方法使得模型能够捕捉从刚体到弹性及类似织物的多样化物理行为，并具备真实的光照效果，同时泛化至未见过的多体交互和新场景编辑。

链接: https://arxiv.org/abs/2503.24009
作者: Mikel Zhobro,Andreas René Geist,Georg Martius
机构: University of Tübingen (蒂宾根大学); Max Planck Institute for Intelligent Systems, Tübingen (马克斯·普朗克智能系统研究所，蒂宾根)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Learning physics simulations from video data requires maintaining spatial and temporal consistency, a challenge often addressed with strong inductive biases or ground-truth 3D information – limiting scalability and generalization. We introduce 3DGSim, a 3D physics simulator that learns object dynamics end-to-end from multi-view RGB videos. It encodes images into a 3D Gaussian particle representation, propagates dynamics via a transformer, and renders frames using 3D Gaussian splatting. By jointly training inverse rendering with a dynamics transformer using a temporal encoding and merging layer, 3DGSimembeds physical properties into point-wise latent vectors without enforcing explicit connectivity constraints. This enables the model to capture diverse physical behaviors, from rigid to elastic and cloth-like interactions, along with realistic lighting effects that also generalize to unseen multi-body interactions and novel scene edits.
zh

[CV-41] H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

【速读】：该论文旨在解决现有视频理解评估基准在覆盖范围、任务多样性以及场景适应性方面的显著局限性问题，这些不足阻碍了对多模态模型综合视频理解能力的准确评估。论文的关键解决方案是提出了一种分层且整体的视频理解（Hierarchical and Holistic Video Understanding, H2VU）基准，该基准不仅涵盖了从3秒短片到1.5小时完整记录的扩展视频时长，还引入了超越传统感知与推理任务的新模块，如反常识理解与轨迹状态跟踪，并扩充了第一人称流媒体视频数据集，以全面评估多模态大型语言模型（Multimodal Large Language Models, MLLMs）在通用视频及在线流媒体视频理解中的表现。

链接: https://arxiv.org/abs/2503.24008
作者: Qi Wu,Quanlong Zheng,Yanhao Zhang,Junlin Xie,Jinguo Luo,Kuo Wang,Peng Liu,Qingsong Xie,Ru Zhen,Haonan Lu,Zhenyu Yang
机构: OPPO AI Center (OPPO 人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid development of multimodal models, the demand for assessing video understanding capabilities has been steadily increasing. However, existing benchmarks for evaluating video understanding exhibit significant limitations in coverage, task diversity, and scene adaptability. These shortcomings hinder the accurate assessment of models’ comprehensive video understanding capabilities. To tackle this challenge, we propose a hierarchical and holistic video understanding (H2VU) benchmark designed to evaluate both general video and online streaming video comprehension. This benchmark contributes three key features: Extended video duration: Spanning videos from brief 3-second clips to comprehensive 1.5-hour recordings, thereby bridging the temporal gaps found in current benchmarks. Comprehensive assessment tasks: Beyond traditional perceptual and reasoning tasks, we have introduced modules for countercommonsense comprehension and trajectory state tracking. These additions test the models’ deep understanding capabilities beyond mere prior knowledge. Enriched video data: To keep pace with the rapid evolution of current AI agents, we have expanded first-person streaming video datasets. This expansion allows for the exploration of multimodal models’ performance in understanding streaming videos from a first-person perspective. Extensive results from H2VU reveal that existing multimodal large language models (MLLMs) possess substantial potential for improvement in our newly proposed evaluation tasks. We expect that H2VU will facilitate advancements in video understanding research by offering a comprehensive and in-depth analysis of MLLMs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.24008 [cs.CV] (or arXiv:2503.24008v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.24008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-42] DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

【速读】：该论文旨在解决自动驾驶领域中的深度完成（depth completion）任务，即从稀疏深度图和RGB图像生成密集深度图的问题。现有方法大多依赖于空间传播网络来迭代优化初始生成的密集深度图。论文的关键创新在于提出了一种名为DenseFormer的新方法，将扩散模型（diffusion model）引入深度完成任务。通过利用扩散模型的去噪机制，DenseFormer逐步优化初始随机深度分布以生成密集深度图。其解决方案的核心包括两个模块：一是特征提取模块，结合特征金字塔结构与多层可变形注意力机制，有效提取并整合来自稀疏深度图和RGB图像的信息作为扩散过程的引导条件；二是深度优化模块，在扩散过程生成的密集深度结果基础上进行多步迭代细化，利用包含多尺度信息的图像特征和稀疏深度输入进一步提升预测深度图的准确性。实验表明，DenseFormer在KITTI室外场景数据集上的表现优于经典深度完成方法。

链接: https://arxiv.org/abs/2503.23993
作者: Ming Yuan,Sichao Wang,Chuang Zhang,Lei He,Qing Xu,Jianqiang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.
zh

[CV-43] SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency

【速读】：本文旨在解决激光雷达（LiDAR）点云数据标注效率低的问题，提出了一种灵活的半自动标注工具 SALT（Semi-Automatic Labeling Tool），具备跨场景适应性和 4D 一致性。关键解决方案包括：(1) 提出一种新颖的零样本学习范式——数据对齐（data alignment），通过将 LiDAR 数据转换为伪图像，使其与视觉基础模型的训练分布对齐，从而实现自动预分割；(2) 设计了一种 4D 一致提示策略和 4D 非极大值抑制模块，增强 SAM2 的性能，确保高质量且时间一致的预分割结果。这些创新显著提升了标注效率，并在 SemanticKITTI 数据集上超越最新零样本方法 18.4% PQ，在新收集的低分辨率 LiDAR 数据和多类型 LiDAR 数据集上接近人类标注器 40%-50% 的性能。

链接: https://arxiv.org/abs/2503.23980
作者: Yanbo Wang,Yongtao Chen,Chuan Cao,Tianchen Deng,Wentao Zhao,Jingchuan Wang,Weidong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR point clouds with cross-scene adaptability and 4D consistency. Unlike recent approaches that rely on camera distillation, SALT operates directly on raw LiDAR data, automatically generating pre-segmentation results. To achieve this, we propose a novel zero-shot learning paradigm, termed data alignment, which transforms LiDAR data into pseudo-images by aligning with the training distribution of vision foundation models. Additionally, we design a 4D-consistent prompting strategy and 4D non-maximum suppression module to enhance SAM2, ensuring high-quality, temporally consistent presegmentation. SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and achieves nearly 40-50% of human annotator performance on our newly collected low-resolution LiDAR data and on combined data from three LiDAR types, significantly boosting annotation efficiency. We anticipate that SALT’s open-sourcing will catalyze substantial expansion of current LiDAR datasets and lay the groundwork for the future development of LiDAR foundation models. Code is available at this https URL.
zh

[CV-44] Video-based Traffic Light Recognition by Rockchip RV1126 for Autonomous Driving

【速读】：该论文致力于解决实时交通灯识别在复杂城市环境中的挑战，特别是现有单帧分析方法在遮挡和恶劣光照条件下的不足。解决方案的关键在于提出了一种名为\textit{ViTLR}的新视频处理端到端神经网络，通过处理多帧连续视频数据实现鲁棒的交通灯检测与状态分类。其架构采用类似Transformer的设计，并结合卷积自注意力模块，特别针对Rockchip RV1126嵌入式平台进行了优化，同时保持了每秒25帧的实时处理能力。实验结果表明，\textit{ViTLR}在时间稳定性、目标距离变化及复杂环境条件下的表现优于现有单帧方法。

链接: https://arxiv.org/abs/2503.23965
作者: Miao Fan,Xuxu Kong,Shengtong Xu,Haoyi Xiong,Xiangzeng Liu
机构: NavInfo Co. Ltd. (四维图新有限公司); Autohome Inc. (汽车之家有限公司); Baidu Inc. (百度有限公司); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IEEE IV’25

点击查看摘要

Abstract:Real-time traffic light recognition is fundamental for autonomous driving safety and navigation in urban environments. While existing approaches rely on single-frame analysis from onboard cameras, they struggle with complex scenarios involving occlusions and adverse lighting conditions. We present \textitViTLR, a novel video-based end-to-end neural network that processes multiple consecutive frames to achieve robust traffic light detection and state classification. The architecture leverages a transformer-like design with convolutional self-attention modules, which is optimized specifically for deployment on the Rockchip RV1126 embedded platform. Extensive evaluations on two real-world datasets demonstrate that \textitViTLR achieves state-of-the-art performance while maintaining real-time processing capabilities (25 FPS) on RV1126’s NPU. The system shows superior robustness across temporal stability, varying target distances, and challenging environmental conditions compared to existing single-frame approaches. We have successfully integrated \textitViTLR into an ego-lane traffic light recognition system using HD maps for autonomous driving applications. The complete implementation, including source code and datasets, is made publicly available to facilitate further research in this domain.
zh

[CV-45] A Benchmark for Vision-Centric HD Mapping by V2I Systems

【速读】：本文旨在解决车辆-基础设施协同自动驾驶（Vehicle-Infrastructure Cooperative Autonomous Driving, VICAD）中在线高精地图（HD Map）矢量化研究缺乏真实世界数据的问题。为了解决这一挑战，论文提出了一种基于视觉的车路协同端到端神经框架（V2I-HD），利用车载与路侧摄像头协作数据及人工标注的高精地图元素构建矢量化地图。此外，为了降低计算成本并实现V2I-HD在自动驾驶车辆上的部署，引入了方向解耦自注意力机制以优化模型效率。实验结果表明，V2I-HD在实时推理速度和复杂场景下的地图构建质量方面表现出色，同时保持了稳定性和鲁棒性。因此，论文的关键解决方案在于结合车路协同数据与创新的神经网络架构，并通过自注意力机制优化性能。

链接: https://arxiv.org/abs/2503.23963
作者: Miao Fan,Shanshan Yu,Shengtong Xu,Kun Jiang,Haoyi Xiong,Xiangzeng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by IEEE IV’25

点击查看摘要

Abstract:Autonomous driving faces safety challenges due to a lack of global perspective and the semantic information of vectorized high-definition (HD) maps. Information from roadside cameras can greatly expand the map perception range through vehicle-to-infrastructure (V2I) communications. However, there is still no dataset from the real world available for the study on map vectorization onboard under the scenario of vehicle-infrastructure cooperation. To prosper the research on online HD mapping for Vehicle-Infrastructure Cooperative Autonomous Driving (VICAD), we release a real-world dataset, which contains collaborative camera frames from both vehicles and roadside infrastructures, and provides human annotations of HD map elements. We also present an end-to-end neural framework (i.e., V2I-HD) leveraging vision-centric V2I systems to construct vectorized maps. To reduce computation costs and further deploy V2I-HD on autonomous vehicles, we introduce a directionally decoupled self-attention mechanism to V2I-HD. Extensive experiments show that V2I-HD has superior performance in real-time inference speed, as tested by our real-world dataset. Abundant qualitative results also demonstrate stable and robust map construction quality with low cost in complex and various driving scenes. As a benchmark, both source codes and the dataset have been released at OneDrive for the purpose of further study.
zh

[CV-46] Local Information Matters: Inference Acceleration For Grounded Conversation Generation Models Through Adaptive Local-Aware Token Pruning

【速读】：该论文旨在解决 Grounded Conversation Generation (GCG) 任务中现有视觉标记剪枝方法（如 FastV 和 PyramidDrop）无法有效保留局部视觉特征的问题，这些方法导致模型在处理像素级定位时性能显著下降。为了解决这一挑战，论文提出了一种名为 Adaptive Local-Aware Token Pruning (ALTP) 的框架，其核心在于通过优先保留局部对象信息来加速 GCG 模型。ALTP 的关键创新包括 Detail Density Capture (DDC)，它利用超像素分割技术在以对象为中心的区域保留细节丰富的标记；以及 Dynamic Density Formation (DDF)，它根据信息密度动态分配标记，确保语义丰富区域的高保真度。实验结果表明，ALTP 在 GLaMM 和 OMG-LLaVA 模型上的性能显著优于现有方法。

链接: https://arxiv.org/abs/2503.23959
作者: Bizhe Bai,Jianjian Cao,Yadan Luo,Tao Che
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Grounded Conversation Generation (GCG) is an emerging vision-language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG-LLaVA, achieve pixel-level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local-Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object-centric regions, preserving fine-grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG-LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG-LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.
zh

[CV-47] A Multi-Stage Auto-Context Deep Learning Framework for Tissue and Nuclei Segmentation and Classification in HE-Stained Histological Images of Advanced Melanoma

【速读】：该论文致力于解决皮肤黑色素瘤（Melanoma）组织学图像分析中组织和细胞核信息分离处理导致的潜在次优性问题。传统方法通常将基于组织的分析与基于细胞核的分析作为独立任务处理，而该研究提出了一种新颖的多阶段深度学习方法，在统一框架下结合自上下文（auto-context）概念，同时实现组织分割与分类以及细胞核检测，从而更全面地挖掘组织学图像中的信息。解决方案的关键在于通过预训练和后处理优化，构建了一个能够联合利用组织和细胞核信息的端到端模型，并在PUMA挑战赛中取得了优异成绩，验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.23958
作者: Nima Torbati,Anastasia Meshcheryakova,Diana Mechtcheriakova,Amirreza Mahbod
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Melanoma is the most lethal form of skin cancer, with an increasing incidence rate worldwide. Analyzing histological images of melanoma by localizing and classifying tissues and cell nuclei is considered the gold standard method for diagnosis and treatment options for patients. While many computerized approaches have been proposed for automatic analysis, most perform tissue-based analysis and nuclei (cell)-based analysis as separate tasks, which might be suboptimal. In this work, using the PUMA challenge dataset, we proposed a novel multi-stage deep learning approach by combining tissue and nuclei information in a unified framework based on the auto-context concept to perform segmentation and classification in histological images of melanoma. Through pre-training and further post-processing, our approach achieved second and first place rankings in the PUMA challenge, with average micro Dice tissue score and summed nuclei F1-score of 73.40% for Track 1 and 63.48% for Track 2, respectively. Our implementation for training and testing is available at: this https URL Comments: 15 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.23958 [cs.CV] (or arXiv:2503.23958v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.23958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-48] AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference

【速读】：该论文旨在解决大型视觉语言模型（Large Visual Language Models, LVLMs）在推理过程中因处理大量视觉标记和生成长上下文输出而导致的显著计算开销问题，特别是由此引发的关键值（Key-Value, KV）缓存需求激增这一瓶颈。为应对这一挑战，论文提出了一种名为AirCache的创新KV缓存压缩方法以加速LVLMs的推理速度。解决方案的关键在于通过系统性研究LVLMs注意力机制中视觉与文本标记之间的相关性，发现缓存中的视觉标记存在大量冗余，并证明通过策略性移除这些冗余标记既能保持模型性能，又能大幅提升上下文生成效率。基于此，论文引入了一个精英观察窗口来评估KV缓存中视觉组件的重要性，并结合增强的多视角一致性进行稳定的跨模态相关性建模；同时开发了一种自适应分层预算分配策略，利用标记重要性分布的强度和偏斜特性，在效率上优于均匀分配方法。综合多种LVLMs和基准测试表明，AirCache能够在仅保留10%视觉KV缓存的情况下实现与完整缓存相当的性能，从而将解码延迟降低29%至66%，并且随着缓存保留率的下降，其相对于现有方法的优势愈发明显。

链接: https://arxiv.org/abs/2503.23956
作者: Kai Huang,Hao Zou,Bochen Wang,Ye Xi,Zhen Xie,Hao Wang
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch size and prompt length of inputs. Notably, as cache retention rates decrease, our method exhibits increasing performance advantages over existing approaches.
zh

[CV-49] JointTuner: Appearance-Motion Adaptive Joint Training for Customized Video Generation

【速读】：该论文旨在解决现有文本转视频方法中存在的两个主要问题：一是由于特征域不匹配导致的概念干扰（concept interference），二是参考视频重建过程中因运动与外观纠缠引起的外观污染（appearance contamination）。为了解决这些问题，论文提出了一种名为JointTuner的新颖自适应联合训练框架。其关键创新点在于开发了包含上下文感知门控机制的Adaptive LoRA，并将其集成到扩散模型的空间和时间Transformer中，以实现外观和运动的同时优化，消除概念干扰；同时引入了独立于外观的时间损失函数，通过无外观噪声预测任务解耦参考视频中的运动模式与内在外观，其中关键创新是向真实高斯噪声添加帧级偏移噪声，扰动其分布，从而破坏与帧相关的空间属性而保持时间一致性。此外，构建了一个包含90个外观-运动定制组合及跨四个维度的10个多类型自动度量指标的数据集，用于更全面的任务评估。实验结果表明，该方法在性能上优于当前先进方法。

链接: https://arxiv.org/abs/2503.23951
作者: Fangda Chen,Shanshan Zhao,Chuanfu Xu,Long Lan
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学计算机科学与技术学院); Alibaba International Digital Commerce (阿里国际数字商业)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent text-to-video advancements have enabled coherent video synthesis from prompts and expanded to fine-grained control over appearance and motion. However, existing methods either suffer from concept interference due to feature domain mismatch caused by naive decoupled optimizations or exhibit appearance contamination induced by spatial feature leakage resulting from the entanglement of motion and appearance in reference video reconstructions. In this paper, we propose JointTuner, a novel adaptive joint training framework, to alleviate these issues. Specifically, we develop Adaptive LoRA, which incorporates a context-aware gating mechanism, and integrate the gated LoRA components into the spatial and temporal Transformers within the diffusion model. These components enable simultaneous optimization of appearance and motion, eliminating concept interference. In addition, we introduce the Appearance-independent Temporal Loss, which decouples motion patterns from intrinsic appearance in reference video reconstructions through an appearance-agnostic noise prediction task. The key innovation lies in adding frame-wise offset noise to the ground-truth Gaussian noise, perturbing its distribution, thereby disrupting spatial attributes associated with frames while preserving temporal coherence. Furthermore, we construct a benchmark comprising 90 appearance-motion customized combinations and 10 multi-type automatic metrics across four dimensions, facilitating a more comprehensive evaluation for this customization task. Extensive experiments demonstrate the superior performance of our method compared to current advanced approaches.
zh

[CV-50] AMB-FHE: Adaptive Multi-biometric Fusion with Fully Homomorphic Encryption

【速读】：本文旨在解决多模态生物特征系统在保障高安全性的同时提升用户体验的问题。传统多模态生物特征系统虽然提供了更高的安全性，但可能因需要呈现多种生物特征而降低系统的易用性，且未必在所有场景下都是必要的。为此，论文提出了一种简单但灵活的方法——自适应多模态融合与全同态加密（Adaptive Multi-biometric Fusion with Fully Homomorphic Encryption, AMB-FHE），以增强同态加密下多模态参考模板的隐私保护，并实现在运行时根据安全需求进行适配。方案的关键在于结合多种模态的模板联合加密，同时保持系统的灵活性和认证性能。通过使用深度神经网络从CASIA虹膜数据集和MCYT指纹数据集提取特征，验证了AMB-FHE方法的有效性，展示了其在提高隐私保护的同时支持动态调整安全级别的能力。

链接: https://arxiv.org/abs/2503.23949
作者: Florian Bayer,Christian Rathgeb
机构: Hochschule Darmstadt (达姆施塔特应用技术大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Biometric systems strive to balance security and usability. The use of multi-biometric systems combining multiple biometric modalities is usually recommended for high-security applications. However, the presentation of multiple biometric modalities can impair the user-friendliness of the overall system and might not be necessary in all cases. In this work, we present a simple but flexible approach to increase the privacy protection of homomorphically encrypted multi-biometric reference templates while enabling adaptation to security requirements at run-time: An adaptive multi-biometric fusion with fully homomorphic encryption (AMB-FHE). AMB-FHE is benchmarked against a bimodal biometric database consisting of the CASIA iris and MCYT fingerprint datasets using deep neural networks for feature extraction. Our contribution is easy to implement and increases the flexibility of biometric authentication while offering increased privacy protection through joint encryption of templates from multiple modalities.
zh

[CV-51] Spectral-Adaptive Modulation Networks for Visual Perception

【速读】：该论文旨在解决2D卷积和自注意力机制在频谱行为上的差异及其对视觉模型性能的影响问题。具体而言，研究探索为何2D卷积在高通滤波任务中表现更优，以及为何较大的卷积核倾向于增强形状偏置，类似于自注意力机制的效果。论文的关键在于通过图频谱分析统一框架，揭示节点连通性（受窗口大小调节）是塑造频谱函数的核心因素，并基于此提出了一种“频谱自适应调制（Spectral-Adaptive Modulation, SPAM）混合模块”。该模块以多尺度卷积核和频谱重缩放机制实现视觉特征的频谱自适应处理，以优化频谱成分。以此为基础，开发了SPANetV2作为新型视觉主干网络。大量实验表明，SPANetV2在ImageNet-1K分类、COCO目标检测及ADE20K语义分割等多项视觉任务中超越了现有最先进模型。

链接: https://arxiv.org/abs/2503.23947
作者: Guhnoo Yun,Juhan Yoo,Kijung Kim,Jeongho Lee,Paul Hongsuck Seo,Dong Hwan Kim
机构: Korea University (KU) (韩国大学); Korea Institute of Science and Technology (KIST) (韩国科学技术研究院); Dong-A University (釜山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textitspectral-adaptive modulation (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
zh

[CV-52] Exploring Reliable PPG Authentication on Smartwatches in Daily Scenarios

【速读】：该论文旨在解决基于 Photoplethysmography (PPG) 传感器的认证方法因物理活动引起的运动伪影（motion artifacts）以及时间相关的生理变异性（physiological variability over time）而导致的可靠性不足问题。为应对这些挑战，论文提出了一种名为 MTL-RAPID 的高效且可靠的 PPG 认证模型。其关键在于采用多任务联合训练策略（multitask joint training strategy），同时评估信号质量与验证用户身份，通过两项任务的联合优化实现性能增强且参数更少的结构，从而显著优于单独训练各自任务的模型。

链接: https://arxiv.org/abs/2503.23930
作者: Jiankai Tang,Jiacheng Liu,Renling Tong,Kai Zhu,Zhe Li,Xin Yi,Junliang Xing,Yuanchun Shi,Yuntao Wang
机构: Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University (清华大学); Ant Group (蚂蚁集团); Intelligent Computing and Application Laboratory of Qinghai Province, Qinghai University (青海省智能计算与应用重点实验室, 青海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) Sensors, widely deployed in smartwatches, offer a simple and non-invasive authentication approach for daily use. However, PPG authentication faces reliability issues due to motion artifacts from physical activity and physiological variability over time. To address these challenges, we propose MTL-RAPID, an efficient and reliable PPG authentication model, that employs a multitask joint training strategy, simultaneously assessing signal quality and verifying user identity. The joint optimization of these two tasks in MTL-RAPID results in a structure that outperforms models trained on individual tasks separately, achieving stronger performance with fewer parameters. In our comprehensive user studies regarding motion artifacts (N = 30), time variations (N = 32), and user preferences (N = 16), MTL-RAPID achieves a best AUC of 99.2% and an EER of 3.5%, outperforming existing baselines. We opensource our PPG authentication dataset along with the MTL-RAPID model to facilitate future research on GitHub.
zh

[CV-53] CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

【速读】：该论文旨在解决现有半稠密图像匹配方法中存在的计算冗余高、特征区分性不足以及亚像素级精度受限的问题。论文的关键创新点在于提出了一种名为CoMatch的新方法，其核心解决方案包括：(1) 引入基于共视图引导的令牌压缩器（Covisibility-Guided Token Condenser），通过动态估计令牌的共视分数自适应聚合令牌，以提高计算效率并增强聚合令牌的表征能力；(2) 设计共视辅助注意力机制（Covisibility-Assisted Attention Mechanism），有选择性地抑制非共视区域的无关信息传播，从而实现对相关特征的鲁棒且紧凑的关注；(3) 提出细粒度相关模块（Fine Correlation Module），将源视图和目标视图中的匹配候选点均优化至亚像素级别，显著提升定位精度。这些创新共同实现了更高的准确性、效率和泛化能力。

链接: https://arxiv.org/abs/2503.23925
作者: Zizhuo Li,Yifan Lu,Linfeng Tang,Shihua Zhang,Jiayi Ma
机构: Wuhan University, China (武汉大学, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view’s keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch’s promising accuracy, efficiency, and generalizability.
zh

[CV-54] FineCausal: A Causal-Based Framework for Interpretable Fine-Grained Action Quality Assessment

【速读】：该论文旨在解决动作质量评估（Action Quality Assessment, AQA）中现有深度学习方法因黑箱特性及易受虚假相关性影响而导致的可靠性与可解释性不足的问题。论文提出了一种名为FineCausal的新框架，其关键在于结合图注意力网络（Graph Attention Network, GAT）因果干预模块和时间因果注意力模块的双模块策略。前者用于分离以人为中心的前景线索与背景混杂因素，后者则捕捉动作各阶段间的细粒度时间依赖关系。这种设计不仅实现了在FineDiving-HM数据集上的最新性能，还提供了透明且可解释的反馈，揭示驱动评估的关键特征。然而，FineCausal的实施依赖于广泛的专家知识定义因果结构以及高质量标注数据，这是未来研究需解决的挑战。

链接: https://arxiv.org/abs/2503.23911
作者: Ruisheng Han,Kanglei Zhou,Amir Atapour-Abarghouei,Xiaohui Liang,Hubert P. H. Shum
机构: Durham University (杜伦大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action quality assessment (AQA) is critical for evaluating athletic performance, informing training strategies, and ensuring safety in competitive sports. However, existing deep learning approaches often operate as black boxes and are vulnerable to spurious correlations, limiting both their reliability and interpretability. In this paper, we introduce FineCausal, a novel causal-based framework that achieves state-of-the-art performance on the FineDiving-HM dataset. Our approach leverages a Graph Attention Network-based causal intervention module to disentangle human-centric foreground cues from background confounders, and incorporates a temporal causal attention module to capture fine-grained temporal dependencies across action stages. This dual-module strategy enables FineCausal to generate detailed spatio-temporal representations that not only achieve state-of-the-art scoring performance but also provide transparent, interpretable feedback on which features drive the assessment. Despite its strong performance, FineCausal requires extensive expert knowledge to define causal structures and depends on high-quality annotations, challenges that we discuss and address as future research directions. Code is available at this https URL.
zh

[CV-55] HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment

【速读】：该论文致力于解决Human Image Aesthetic Assessment (HIAA)这一研究空白问题，尽管HIAA在社交媒体、AI工作流及相关领域广泛应用，但其研究远未充分展开。论文的关键解决方案在于提出了一套专门针对HIAA的全面实施框架，并构建了首个专用于HIAA的数据集HumanBeauty，包含10.8万张高质量的人类图像及人工标注。基于此数据集，论文提出了HumanAesExpert模型，创新性地设计了一个融合专家知识的Expert头，同时结合Language Modeling和Regression头，实现整体与细粒度HIAA的双重卓越性能。此外，引入MetaVoter以整合三部分评分，平衡各模块能力，提升评估精度。实验结果表明，该模型显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.23907
作者: Zhichao Liao,Xiaokun Liu,Wenyu Qin,Qingyu Li,Qiulin Wang,Pengfei Wan,Di Zhang,Long Zeng,Pingfa Feng
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored, even though HIAA is widely used in social media, AI workflows, and related domains. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression head. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Our datasets, models, and codes are publicly released to advance the HIAA community. Project webpage: this https URL
zh

[CV-56] Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Model, MLLM）在基于最终结果监督（ORM）的方法（如GRPO）中面临的两个关键问题：低数据利用率（Low data utilization）和文本偏置（Text-bias）。低数据利用率导致在困难样本上无法获得正向奖励以更新模型，而文本偏置则表现为模型在推理过程中过度依赖文本条件，忽视图像条件。为了解决这些问题，论文提出了Hint-GRPO方法，通过自适应提供提示来提高数据利用率，并在测试阶段通过校准标记预测logits引入图像条件以缓解文本偏置。这些创新性的解决方案显著提升了MLLM在复杂多模态推理任务中的性能。

链接: https://arxiv.org/abs/2503.23905
作者: Qihan Huang,Long Chan,Jinlong Liu,Wanggui He,Hao Jiang,Mingli Song,Jingyuan Chen,Chang Yao,Jie Song
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM’s GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code is available at this https URL.
zh

[CV-57] raining-Free Text-Guided Image Editing with Visual Autoregressive Model

【速读】：该论文旨在解决文本引导图像编辑中因反演（inversion）不准确导致的误差传播问题，以及由于文本提示与图像特征纠缠引起的全局修改问题。为了解决这些问题，论文提出了一种基于视觉自回归建模（VAR, Visual AutoRegressive modeling）的新框架，其关键在于引入了一个缓存机制（caching mechanism），用于存储原始图像的标记索引和概率分布，从而捕捉源提示与图像之间的关系。在此基础上，设计了一种自适应细粒度掩码策略（adaptive fine-grained masking strategy），动态识别并约束相关区域的修改，避免不必要的变化，并通过标记重组（token reassembling）进一步优化编辑过程，显著提升了编辑的精确性、多样性和保真度。该方法无需显式的反演步骤且具有无训练（training-free）特性，在高分辨率图像（如1K分辨率）上的推理速度可达1.2秒，实现了与现有扩散模型（diffusion-based）及校正流（rectified flow-based）方法相当甚至更优的性能。

链接: https://arxiv.org/abs/2503.23897
作者: Yufei Wang,Lanqing Guo,Zhihao Li,Jiaxing Huang,Pichao Wang,Bihan Wen,Jian Wang
机构: Snap Research (Snap 研究院); Nanyang Technological University (南洋理工大学); UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-guided image editing is an essential task that enables users to modify images through natural language descriptions. Recent advances in diffusion models and rectified flows have significantly improved editing quality, primarily relying on inversion techniques to extract structured noise from input images. However, inaccuracies in inversion can propagate errors, leading to unintended modifications and compromising fidelity. Moreover, even with perfect inversion, the entanglement between textual prompts and image features often results in global changes when only local edits are intended. To address these challenges, we propose a novel text-guided image editing framework based on VAR (Visual AutoRegressive modeling), which eliminates the need for explicit inversion while ensuring precise and controlled modifications. Our method introduces a caching mechanism that stores token indices and probability distributions from the original image, capturing the relationship between the source prompt and the image. Using this cache, we design an adaptive fine-grained masking strategy that dynamically identifies and constrains modifications to relevant regions, preventing unintended changes. A token reassembling approach further refines the editing process, enhancing diversity, fidelity, and control. Our framework operates in a training-free manner and achieves high-fidelity editing with faster inference speeds, processing a 1K resolution image in as fast as 1.2 seconds. Extensive experiments demonstrate that our method achieves performance comparable to, or even surpassing, existing diffusion- and rectified flow-based approaches in both quantitative metrics and visual quality. The code will be released.
zh

[CV-58] DiffScale: Continuous Downscaling and Bias Correction of Subseasonal Wind Speed Forecasts using Diffusion Models

【速读】：该论文旨在解决可再生能源领域中风速预测精度不足的问题，特别是利用次季节到季节（Subseasonal to Seasonal, S2S）天气预报来提升风速预测的技能。论文的关键在于提出了一种名为DiffScale的扩散模型，通过classifier-free指导机制实现表面风速S2S预报的空间超分辨率增强。解决方案的核心是利用天气先验作为扩散模型生成过程的引导，并采用条件概率视角直接估计不同空间分辨率和提前时间的目标S2S预报密度，避免了自回归或序列预测的需求，从而构建了一个高效且灵活的模型。此外，DiffScale的独特之处在于其能够处理任意缩放因子的下采样任务，无需重新训练即可在多种网格分辨率和提前时间内推广，同时修正模型误差，成为改进S2S风速预报的强大工具。实验结果表明，该方法在预测质量上显著优于基线模型，特别是在前三周内表现突出。

链接: https://arxiv.org/abs/2503.23893
作者: Maximilian Springenberg,Noelia Otero,Yuxin Xue,Jackie Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 28 pages, 18 figures, preprint under review

点击查看摘要

Abstract:Renewable resources are strongly dependent on local and large-scale weather situations. Skillful subseasonal to seasonal (S2S) forecasts – beyond two weeks and up to two months – can offer significant socioeconomic advantages to the energy sector. This study aims to enhance wind speed predictions using a diffusion model with classifier-free guidance to downscale S2S forecasts of surface wind speed. We propose DiffScale, a diffusion model that super-resolves spatial information for continuous downscaling factors and lead times. Leveraging weather priors as guidance for the generative process of diffusion models, we adopt the perspective of conditional probabilities on sampling super-resolved S2S forecasts. We aim to directly estimate the density associated with the target S2S forecasts at different spatial resolutions and lead times without auto-regression or sequence prediction, resulting in an efficient and flexible model. Synthetic experiments were designed to super-resolve wind speed S2S forecasts from the European Center for Medium-Range Weather Forecast (ECMWF) from a coarse resolution to a finer resolution of ERA5 reanalysis data, which serves as a high-resolution target. The innovative aspect of DiffScale lies in its flexibility to downscale arbitrary scaling factors, enabling it to generalize across various grid resolutions and lead times -without retraining the model- while correcting model errors, making it a versatile tool for improving S2S wind speed forecasts. We achieve a significant improvement in prediction quality, outperforming baselines up to week 3.
zh

[CV-59] MuseFace: Text-driven Face Editing via Diffusion-based Mask Generation Approach

【速读】：该论文致力于解决现有文本驱动人脸编辑方法在多样性（diversity）、可控性（controllability）和灵活性（flexibility）方面无法同时满足的问题。为了解决这一挑战，论文提出了一种名为MuseFace的文本驱动人脸编辑框架。其关键在于结合了一个Text-to-Mask扩散模型和一个语义感知的人脸编辑模型：Text-to-Mask扩散模型提供了多样性和灵活性，而语义感知的人脸编辑模型确保了可控性。这种设计使得MuseFace能够直接从文本生成细粒度的语义掩码，并实现精确的人脸编辑，显著提升了人脸编辑模型的可控性和灵活性，同时保持了高保真性能。

链接: https://arxiv.org/abs/2503.23888
作者: Xin Zhang,Siting Huang,Xiangyang Luo,Yifan Xie,Weijiang Yu,Heng Chang,Fei Ma,Fei Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室（深圳）), Shenzhen, China; Xi’an Jiaotong University (西安交通大学), Xi’an, China; Sun Yat-sen University (中山大学), Guangzhou, China; Tsinghua University (清华大学), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures,IEEE International Conference on Multimedia Expo 2025

点击查看摘要

Abstract:Face editing modifies the appearance of face, which plays a key role in customization and enhancement of personal images. Although much work have achieved remarkable success in text-driven face editing, they still face significant challenges as none of them simultaneously fulfill the characteristics of diversity, controllability and flexibility. To address this challenge, we propose MuseFace, a text-driven face editing framework, which relies solely on text prompt to enable face editing. Specifically, MuseFace integrates a Text-to-Mask diffusion model and a semantic-aware face editing model, capable of directly generating fine-grained semantic masks from text and performing face editing. The Text-to-Mask diffusion model provides \textitdiversity and \textitflexibility to the framework, while the semantic-aware face editing model ensures \textitcontrollability of the framework. Our framework can create fine-grained semantic masks, making precise face editing possible, and significantly enhancing the controllability and flexibility of face editing models. Extensive experiments demonstrate that MuseFace achieves superior high-fidelity performance.
zh

[CV-60] GLane3D : Detecting Lanes with Graph of 3D Keypoints CVPR2025

【速读】：本文旨在解决3D空间中车道检测的高泛化能力挑战，特别是针对全球范围内车道结构的广泛变化，传统自顶向下方法在处理未见属性的车道时表现不佳。为克服这一局限性，论文提出了一种通过检测车道关键点并预测它们之间的顺序连接以构建完整3D车道的方法。关键在于每个关键点对于保持车道连续性至关重要，并通过允许相邻网格使用偏移机制预测相同关键点来为每个关键点生成多个提议，从而减少冗余。此外，采用PointNMS消除重叠的关键点提议，优化BEV图估计并降低连接估计的计算开销。实验结果表明，该模型在Apollo和OpenLane数据集上均优于现有方法，特别是在跨数据集评估时展现出卓越的F1分数和更强的泛化能力。

链接: https://arxiv.org/abs/2503.23882
作者: Halil İbrahim Öztürk,Muhammet Esat Kalfaoğlu,Ozsel Kilinc
机构: Togg/Trutek AI Team
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Accurate and efficient lane detection in 3D space is essential for autonomous driving systems, where robust generalization is the foremost requirement for 3D lane detection algorithms. Considering the extensive variation in lane structures worldwide, achieving high generalization capacity is particularly challenging, as algorithms must accurately identify a wide variety of lane patterns worldwide. Traditional top-down approaches rely heavily on learning lane characteristics from training datasets, often struggling with lanes exhibiting previously unseen attributes. To address this generalization limitation, we propose a method that detects keypoints of lanes and subsequently predicts sequential connections between them to construct complete 3D lanes. Each key point is essential for maintaining lane continuity, and we predict multiple proposals per keypoint by allowing adjacent grids to predict the same keypoint using an offset mechanism. PointNMS is employed to eliminate overlapping proposal keypoints, reducing redundancy in the estimated BEV graph and minimizing computational overhead from connection estimations. Our model surpasses previous state-of-the-art methods on both the Apollo and OpenLane datasets, demonstrating superior F1 scores and a strong generalization capacity when models trained on OpenLane are evaluated on the Apollo dataset, compared to prior approaches.
zh

[CV-61] ExScene: Free-View 3D Scene Reconstruction with Gaussian Splatting from a Single Image ICME2025

【速读】：该论文旨在解决从单一视图图像重建高质量、沉浸式全景三维场景的问题。现有方法因单视角输入提供的先验信息有限，通常只能重建低一致性的狭窄视场三维场景，难以泛化以生成沉浸式场景。为应对这一挑战，论文提出了一种名为ExScene的两阶段管道。其关键是设计了一个新颖的多模态扩散模型生成高保真且全局一致的全景图像，并结合几何信息与高保真全景图像训练初始的三维高斯点 splatting (3DGS) 模型。随后，引入基于二维稳定视频扩散先验的GS细化技术，在去噪过程中加入相机轨迹一致性及颜色-几何先验，以提升图像序列的颜色和空间一致性。这些改进用于微调初始3DGS模型，从而显著提高重建质量。实验结果表明，ExScene仅利用单视角输入即可实现一致且沉浸式的场景重建，大幅超越现有最先进的基线方法。

链接: https://arxiv.org/abs/2503.23881
作者: Tianyi Gong,Boyan Li,Yifei Zhong,Fangxin Wang
机构: Shenzhen Future Network of Intelligence Institute (深圳智能网络研究院); School of Science and Engineering, The Chinese University of Hongkong, Shenzhen (香港中文大学（深圳）理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025

点击查看摘要

Abstract:The increasing demand for augmented and virtual reality applications has highlighted the importance of crafting immersive 3D scenes from a simple single-view image. However, due to the partial priors provided by single-view input, existing methods are often limited to reconstruct low-consistency 3D scenes with narrow fields of view from single-view input. These limitations make them less capable of generalizing to reconstruct immersive scenes. To address this problem, we propose ExScene, a two-stage pipeline to reconstruct an immersive 3D scene from any given single-view image. ExScene designs a novel multimodal diffusion model to generate a high-fidelity and globally consistent panoramic image. We then develop a panoramic depth estimation approach to calculate geometric information from panorama, and we combine geometric information with high-fidelity panoramic image to train an initial 3D Gaussian Splatting (3DGS) model. Following this, we introduce a GS refinement technique with 2D stable video diffusion priors. We add camera trajectory consistency and color-geometric priors into the denoising process of diffusion to improve color and spatial consistency across image sequences. These refined sequences are then used to fine-tune the initial 3DGS model, leading to better reconstruction quality. Experimental results demonstrate that our ExScene achieves consistent and immersive scene reconstruction using only single-view input, significantly surpassing state-of-the-art baselines.
zh

[CV-62] ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos ICRA2025

【速读】：该论文试图解决的问题是如何从大规模预录制的人类视频数据中提取可用于机器人操作任务的技能策略，而无需依赖特定于机器人的演示或额外的探索。解决方案的关键在于设计了一个名为ZeroMimic的系统，它能够利用人类视频中的语义和几何视觉理解的进步，结合现代抓取提示检测器和模仿学习策略，从异构且未见过的任务设置中生成可立即部署的图像目标条件技能策略，涵盖多种常见的操作任务类别（如开、关、倒、拾取放置、切割和搅拌），并能够在多样化的物体和任务场景中发挥作用。

链接: https://arxiv.org/abs/2503.23877
作者: Junyao Shi,Zhuolun Zhao,Tianyou Wang,Ian Pedroza,Amy Luo,Jie Wang,Jason Ma,Dinesh Jayaraman
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICRA 2025. Project website: this https URL

点击查看摘要

Abstract:Many recent advances in robotic manipulation have come through imitation learning, yet these rely largely on mimicking a particularly hard-to-acquire form of demonstrations: those collected on the same robot in the same room with the same objects as the trained policy must handle at test time. In contrast, large pre-recorded human video datasets demonstrating manipulation skills in-the-wild already exist, which contain valuable information for robots. Is it possible to distill a repository of useful robotic skill policies out of such data without any additional requirements on robot-specific demonstrations or exploration? We present the first such system ZeroMimic, that generates immediately deployable image goal-conditioned skill policies for several common categories of manipulation tasks (opening, closing, pouring, pickplace, cutting, and stirring) each capable of acting upon diverse objects and across diverse unseen task setups. ZeroMimic is carefully designed to exploit recent advances in semantic and geometric visual understanding of human videos, together with modern grasp affordance detectors and imitation policy classes. After training ZeroMimic on the popular EpicKitchens dataset of ego-centric human videos, we evaluate its out-of-the-box performance in varied real-world and simulated kitchen settings with two different robot embodiments, demonstrating its impressive abilities to handle these varied tasks. To enable plug-and-play reuse of ZeroMimic policies on other task setups and robots, we release software and policy checkpoints of our skill policies.
zh

[CV-63] Learned Image Compression and Restoration for Digital Pathology

【速读】：该论文旨在解决数字病理图像在存储、传输和实时可视化过程中因超高清分辨率和大文件尺寸带来的挑战。论文提出的解决方案是CLERIC，这是一种专为全幻灯片图像（Whole Slide Images, WSIs）设计的基于深度学习的图像压缩框架。CLERIC的关键在于结合可学习提升方案与先进的卷积技术，通过分析阶段的提升方案变换将图像分解为低频和高频成分，从而实现更具结构化的潜在表示，并利用并行编码器中的可变形残差块（Deformable Residual Blocks, DRB）和循环残差块（Recurrent Residual Blocks, R2B）来增强特征提取和空间适应性。此外，合成阶段应用逆向提升变换以有效重建图像，确保组织结构细节的高保真恢复。实验结果表明，CLERIC在率失真性能方面优于现有最先进的学习型图像压缩模型，同时显著减少了存储需求并保持了高质量的诊断图像。

链接: https://arxiv.org/abs/2503.23862
作者: SeonYeong Lee,EonSeung Seong,DongEon Lee,SiYeoul Lee,Yubin Cho,Chunsu Park,Seonho Kim,MinKyoung Seo,YoungSin Ko,MinWoo Kim
机构: Department of Information Convergence Engineering, Pusan National University (釜山国立大学), Yangsan, Korea; Seegene Medical Foundation (Seegene医疗基金会), Seoul, Korea; School of Biomedical Convergence Engineering, Pusan National University (釜山国立大学), Yangsan, Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital pathology images play a crucial role in medical diagnostics, but their ultra-high resolution and large file sizes pose significant challenges for storage, transmission, and real-time visualization. To address these issues, we propose CLERIC, a novel deep learning-based image compression framework designed specifically for whole slide images (WSIs). CLERIC integrates a learnable lifting scheme and advanced convolutional techniques to enhance compression efficiency while preserving critical pathological details. Our framework employs a lifting-scheme transform in the analysis stage to decompose images into low- and high-frequency components, enabling more structured latent representations. These components are processed through parallel encoders incorporating Deformable Residual Blocks (DRB) and Recurrent Residual Blocks (R2B) to improve feature extraction and spatial adaptability. The synthesis stage applies an inverse lifting transform for effective image reconstruction, ensuring high-fidelity restoration of fine-grained tissue structures. We evaluate CLERIC on a digital pathology image dataset and compare its performance against state-of-the-art learned image compression (LIC) models. Experimental results demonstrate that CLERIC achieves superior rate-distortion (RD) performance, significantly reducing storage requirements while maintaining high diagnostic image quality. Our study highlights the potential of deep learning-based compression in digital pathology, facilitating efficient data management and long-term storage while ensuring seamless integration into clinical workflows and AI-assisted diagnostic systems. Code and models are available at: this https URL.
zh

[CV-64] FlexiMo: A Flexible Remote Sensing Foundation Model

【速读】：该论文旨在解决现有遥感基础模型因固定空间分辨率和patch大小而受限，无法充分挖掘卫星影像异构空间特征的问题。论文的关键创新在于提出FlexiMo，一种具有灵活适应任意空间分辨率能力的遥感基础模型。其核心解决方案包括一个空间分辨率感知模块，该模块采用无参数的对齐嵌入机制，能够根据输入图像的分辨率和尺寸动态重新校准patch嵌入，从而在保持token关键特性的同时确保多尺度特征的一致性，并实现高效特征提取而不改变底层网络架构。此外，FlexiMo还引入了一个轻量级通道适应模块，利用传感器的先验光谱信息，使模型能够处理不同通道数的图像，同时保留数据的物理属性。这些设计显著提升了模型的泛化能力和鲁棒性，在多种下游任务中表现出色。

链接: https://arxiv.org/abs/2503.23844
作者: Xuyang Li,Chenyu Li,Pedram Ghamisi,Danfeng Hong
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院航空航天信息研究所); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); School of Mathematics and Statistics, Southeast University (东南大学数学与统计学院); Helmholtz-Zentrum Dresden-Rossendorf (HZDR)(赫尔姆霍兹德累斯顿罗森多夫研究中心); Lancaster Environment Centre, Lancaster University (兰卡斯特大学环境中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image’s resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data’s intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.
zh

[CV-65] Conformal uncertainty quantification to evaluate predictive fairness of foundation AI model for skin lesion classes across patient demographics

【速读】：该论文旨在解决基于深度学习的医疗影像诊断AI系统在透明性方面的不足，特别是在大规模自监督预训练基础模型（foundation models）的应用中，由于其生成的嵌入（embeddings）过程缺乏可解释性，导致临床应用中的可信度较低的问题。论文的关键解决方案是引入**一致性分析（conformal analysis）**方法，通过量化基于视觉Transformer（Vision Transformer, ViT）的基础模型在不同患者人口统计学特征（性别、年龄、种族）下的预测不确定性，为模型的输出提供覆盖保证（coverage guarantee）以及个体层面的不确定性评分。此外，研究还探讨了在模型训练过程中采用基于动态F1分数的采样策略对缓解类别不平衡的影响，并进一步评估其对不确定性量化（Uncertainty Quantification, UQ）的作用。这种方法不仅提升了模型的公平性评估能力，还增强了临床AI系统的可信度与鲁棒性。

链接: https://arxiv.org/abs/2503.23819
作者: Swarnava Bhattacharyya,Umapada Pal,Tapabrata Chakraborti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning based diagnostic AI systems based on medical images are starting to provide similar performance as human experts. However these data hungry complex systems are inherently black boxes and therefore slow to be adopted for high risk applications like healthcare. This problem of lack of transparency is exacerbated in the case of recent large foundation models, which are trained in a self supervised manner on millions of data points to provide robust generalisation across a range of downstream tasks, but the embeddings generated from them happen through a process that is not interpretable, and hence not easily trustable for clinical applications. To address this timely issue, we deploy conformal analysis to quantify the predictive uncertainty of a vision transformer (ViT) based foundation model across patient demographics with respect to sex, age and ethnicity for the tasks of skin lesion classification using several public benchmark datasets. The significant advantage of this method is that conformal analysis is method independent and it not only provides a coverage guarantee at population level but also provides an uncertainty score for each individual. We used a model-agnostic dynamic F1-score-based sampling during model training, which helped to stabilize the class imbalance and we investigate the effects on uncertainty quantification (UQ) with or without this bias mitigation step. Thus we show how this can be used as a fairness metric to evaluate the robustness of the feature embeddings of the foundation model (Google DermFoundation) and thus advance the trustworthiness and fairness of clinical AI.
zh

[CV-66] Bridge the Gap Between Visual and Linguistic Comprehension for Generalized Zero-shot Semantic Segmentation

【速读】：该论文致力于解决广义零样本语义分割（GZS3）中的关键挑战，即如何通过引入语义表示（如词向量）实现不仅对已见类别（seen classes）的分割，还能对训练数据中未见的新类区域进行有效分割。然而，现有方法仅利用单一语义表示将类别关联起来并促进从已见类别到未见类别的知识迁移，这种方法在效率和认知兼容性方面均存在不足。受人类通常利用部分信息（part）和状态信息（state）来理解已见物体并想象未见类别的启发，论文提出将每个类别解耦为详细的描述，包括物体的部分和状态。解决方案的关键在于提出的解耦视觉-语言匹配（DeVLMatch）框架，其包含空间-部分匹配模块（SPMatch）和通道-状态匹配模块（CSMatch）。SPMatch通过视觉和语言视角的空间部分信息进行对象理解并执行图匹配以弥合差距；CSMatch则从语言视角匹配物体的状态，并与视觉视角的兼容通道信息进行匹配。通过跨视觉和语言理解解耦和匹配对象，可以显式地洞察已见类别与未见类别之间在细粒度部分和状态层面的关系，从而促进从已见到未见类别的知识迁移。该框架在PASCAL VOC、COCO-Stuff和CATARACTS等标准基准测试中超越了先前的方法，验证了其有效性。

链接: https://arxiv.org/abs/2503.23806
作者: Xiaoqing Guo,Wuyang Li,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generalized zero-shot semantic segmentation (GZS3) aims to achieve the human-level capability of segmenting not only seen classes but also novel class regions unseen in the training data through introducing the bridge of semantic representations, e.g., word vector. While effective, the way of utilizing one semantic representation to associate the corresponding class and to enable the knowledge transfer from seen to unseen classes is insufficient as well as incompatible with human cognition. Inspired by the observation that humans often use some part' and state’ information to comprehend the seen objects and imagine unseen classes, we decouple each class into detailed descriptions, including object parts and states. Based on the decoupling formulation, we propose a Decoupled Vision-Language Matching (DeVLMatch) framework, composed of spatial-part (SPMatch) and channel-state (CSMatch) matching modules, for GZS3. In SPMatch, we comprehend objects with spatial part information from both visual and linguistic perspectives and perform graph matching to bridge the gap. In CSMatch, states of objects from the linguistic perspective are matched to compatible channel information from the visual perspective. By decoupling and matching objects across visual and linguistic comprehension, we can explicitly introspect the relationship between seen and unseen classes in fine-grained object part and state levels, thereby facilitating the knowledge transfer from seen to unseen classes in visual space. The proposed DeVLMatch framework surpasses the previous GZS3 methods on standard benchmarks, including PASCAL VOC, COCO-Stuff, and CATARACTS, demonstrating its effectiveness.
zh

[CV-67] On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices

【速读】：该论文旨在解决基于扩散模型的设备端文本到视频生成在计算和内存受限的移动设备上的效率与质量难题。论文提出了一种无需重新训练模型的解决方案——On-device Sora。其关键技术包括：1）线性比例跳跃（Linear Proportional Leap, LPL），通过高效跳跃方法减少视频扩散所需的过度去噪步骤；2）时间维度标记合并（Temporal Dimension Token Merging, TDTM），通过沿时间维度合并连续标记来降低注意力层中密集标记处理的计算负担；3）动态加载的并发推理（Concurrent Inference with Dynamic Loading, CI-DL），将大模型动态划分为小块并加载到内存中进行并发推理，从而有效应对设备内存限制。这些技术共同实现了在智能手机等资源受限设备上高效生成高质量视频的能力。

链接: https://arxiv.org/abs/2503.23796
作者: Bosung Kim,Kyuhwan Lee,Isu Jeong,Jungmin Cheon,Yeojin Lee,Seulki Lee
机构: Ulsan National Institute of Science and Technology (蔚山科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(this https URL).
zh

[CV-68] Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables

【速读】：该论文旨在解决基于深度学习的全色锐化（pan-sharpening）算法在推理阶段计算开销过大的问题，特别是在高分辨率遥感图像处理中的挑战。传统方法受限于缺乏专用计算设备（如GPU或TPU）时的性能瓶颈，难以满足实际应用需求。为应对这些挑战，论文提出了一种名为Pan-LUT的新框架，其核心在于通过可学习查找表（Learnable Look-up Table, LUT）实现性能与计算效率之间的平衡。关键创新包括用于通道级光谱映射的PAN引导查找表（PGLUT）、用于捕捉细粒度空间细节的空域细节查找表（SDLUT），以及用于自适应聚合局部上下文的自适应聚合查找表（AALUT）。这种设计使得Pan-LUT仅包含不到30万参数，在单个NVIDIA GeForce RTX 2080 Ti GPU上处理8K分辨率图像耗时不到1毫秒，显著提升了处理速度，并在真实场景下实现了超越现有最先进方法（SOTA）的效果与效率。

链接: https://arxiv.org/abs/2503.23793
作者: Zhongnan Cai,Yingying Wang,Yunlong Lin,Hui Zheng,Ge Meng,Zixu Lin,Jiaxin Xie,Junbin Lu,Yue Huang,Xinghao Ding
机构: Xiamen University (厦门大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, many deep learning-based approaches incur substantial computational overhead during inference, especially with high-resolution images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for high-resolution remote sensing images. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details and adaptively learn local contexts, we introduce the spatial details look-up table (SDLUT) and adaptive aggregation look-up table (AALUT). Our proposed method contains fewer than 300K parameters and processes a 8K resolution image in under 1 ms using a single NVIDIA GeForce RTX 2080 Ti GPU, demonstrating significantly faster performance compared to other methods. Experiments reveal that Pan-LUT efficiently processes large remote sensing images in a lightweight manner, bridging the gap to real-world applications. Furthermore, our model surpasses SOTA methods in full-resolution scenes under real-world conditions, highlighting its effectiveness and efficiency.
zh

[CV-69] MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation

【速读】：该论文旨在解决高分辨率类无关分割（HRCS）任务中生成式AI (Generative AI) 模型在细粒度细节分割上的挑战，具体包括高分辨率输入处理能力有限、低分辨率掩码预测精度不足以及对精确手动提示的高度依赖。为克服这些局限性，论文提出了一种名为MGD-SAM2的新方法，其关键是将SAM2与全局图像与局部区域之间的多视图特征交互相结合，以实现更精确的分割。MGD-SAM2通过引入四个创新模块——多视图感知适配器（MPAdapter）、多视图互补增强模块（MCEM）、分层多视图交互模块（HMIM）以及细节优化模块（DRM），来提升模型对局部纹理和全局语义信息的利用能力，并补偿因直接上采样导致的细粒度细节损失。实验结果验证了该模型在多个高分辨率及正常分辨率数据集上的优越性能与广泛适用性。

链接: https://arxiv.org/abs/2503.23786
作者: Haoran Shen,Peixian Zhuang,Jiahao Kou,Yuxin Zeng,Haoying Xu,Jiangyun Li
机构: School of Automation and Electrical Engineering, University of Science and Technology Beijing (北京科技大学自动化与电气工程学院); Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, University of Science and Technology Beijing (教育部工业过程知识自动化重点实验室, 北京科技大学); Shunde Graduate School of University of Science and Technology Beijing, China (北京科技大学顺德研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Segment Anything Models (SAMs), as vision foundation models, have demonstrated remarkable performance across various image analysis tasks. Despite their strong generalization capabilities, SAMs encounter challenges in fine-grained detail segmentation for high-resolution class-independent segmentation (HRCS), due to the limitations in the direct processing of high-resolution inputs and low-resolution mask predictions, and the reliance on accurate manual prompts. To address these limitations, we propose MGD-SAM2 which integrates SAM2 with multi-view feature interaction between a global image and local patches to achieve precise segmentation. MGD-SAM2 incorporates the pre-trained SAM2 with four novel modules: the Multi-view Perception Adapter (MPAdapter), the Multi-view Complementary Enhancement Module (MCEM), the Hierarchical Multi-view Interaction Module (HMIM), and the Detail Refinement Module (DRM). Specifically, we first introduce MPAdapter to adapt the SAM2 encoder for enhanced extraction of local details and global semantics in HRCS images. Then, MCEM and HMIM are proposed to further exploit local texture and global context by aggregating multi-view features within and across multi-scales. Finally, DRM is designed to generate gradually restored high-resolution mask predictions, compensating for the loss of fine-grained details resulting from directly upsampling the low-resolution prediction maps. Experimental results demonstrate the superior performance and strong generalization of our model on multiple high-resolution and normal-resolution datasets. Code will be available at this https URL.
zh

[CV-70] Evaluation of (Un-)Supervised Machine Learning Methods for GNSS Interference Classification with Real-World Data Discrepancies

【速读】：该论文试图解决在实际环境中应用基于机器学习（Machine Learning, ML）的方法监测全球导航卫星系统（GNSS）干扰信号时所面临的挑战。具体而言，论文关注如何创建包含真实干扰信号及其参考标签的有效训练数据集，同时克服因法律限制导致无法直接采集真实干扰数据的问题。此外，论文还探讨了如何结合来自不同来源的数据以提高模型的适应性和鲁棒性。

解决方案的关键在于通过开展大规模实地测量活动，包括德国两条高速公路以及奥地利Seetal Alps地区的户外场景，以及大型室内可控环境下的实验，收集多样化的数据集。这些数据集不仅涵盖了真实的GNSS干扰情况，还包括了可能影响定位精度的多路径效应等复杂因素。通过对最新监督学习方法进行评估，并探索伪标签技术在无监督学习中的应用，论文展示了如何利用数据增强、域自适应及异常检测等手段来提升模型对于数据分布变化的适应能力。这为实现更可靠、准确的车辆定位提供了重要支持。

链接: https://arxiv.org/abs/2503.23775
作者: Lucas Heublein,Nisha L. Raichur,Tobias Feigl,Tobias Brieger,Fin Heuer,Lennart Asbach,Alexander Rügamer,Felix Ott
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 34 pages, 25 figures

点击查看摘要

Abstract:The accuracy and reliability of vehicle localization on roads are crucial for applications such as self-driving cars, toll systems, and digital tachographs. To achieve accurate positioning, vehicles typically use global navigation satellite system (GNSS) receivers to validate their absolute positions. However, GNSS-based positioning can be compromised by interference signals, necessitating the identification, classification, determination of purpose, and localization of such interference to mitigate or eliminate it. Recent approaches based on machine learning (ML) have shown superior performance in monitoring interference. However, their feasibility in real-world applications and environments has yet to be assessed. Effective implementation of ML techniques requires training datasets that incorporate realistic interference signals, including real-world noise and potential multipath effects that may occur between transmitter, receiver, and satellite in the operational area. Additionally, these datasets require reference labels. Creating such datasets is often challenging due to legal restrictions, as causing interference to GNSS sources is strictly prohibited. Consequently, the performance of ML-based methods in practical applications remains unclear. To address this gap, we describe a series of large-scale measurement campaigns conducted in real-world settings at two highway locations in Germany and the Seetal Alps in Austria, and in large-scale controlled indoor environments. We evaluate the latest supervised ML-based methods to report on their performance in real-world settings and present the applicability of pseudo-labeling for unsupervised learning. We demonstrate the challenges of combining datasets due to data discrepancies and evaluate outlier detection, domain adaptation, and data augmentation techniques to present the models’ capabilities to adapt to changes in the datasets.
zh

[CV-71] XLRS-Bench: Could Your Multimodal LLM s Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? CVPR2025

【速读】：该论文试图解决现有基准在评估多模态大型语言模型（Multimodal Large Language Models, MLLMs）在高分辨率遥感（Remote Sensing, RS）场景中的感知与推理能力时存在的局限性。具体而言，这些问题包括样本图像尺寸显著小于真实世界遥感场景、标注质量有限以及评估维度不足。为了解决这些问题，论文提出了XLRS-Bench，这是一个针对超高分辨率遥感场景中MLLMs感知与推理能力的综合性基准。其关键在于提供更大的平均图像尺寸（8500 × 8500），所有样本均由人工精心标注，并结合一种新型半自动标题生成器处理超高分辨率遥感图像。此外，XLRS-Bench定义了16个子任务以评估MLLMs的10种感知能力和6种推理能力，特别关注促进实际决策和捕捉时空变化的高级认知过程。这一解决方案旨在推动更强大的MLLMs在遥感领域的研究与发展，并已开源供进一步研究使用。

链接: https://arxiv.org/abs/2503.23771
作者: Fengxiang Wang,Hongzhen Wang,Mingshuo Chen,Di Wang,Yulin Wang,Zonghao Guo,Qiang Ma,Long Lan,Wenjing Yang,Jing Zhang,Zhiyuan Liu,Maosong Sun
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学); Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学); School of Computer Science, Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); School of Artificial Intelligence, Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: It has been accepted by CVPR2025

点击查看摘要

Abstract:The astonishing breakthrough of multimodal large language models (MLLMs) has necessitated new benchmarks to quantitatively assess their capabilities, reveal their limitations, and indicate future research directions. However, this is challenging in the context of remote sensing (RS), since the imagery features ultra-high resolution that incorporates extremely complex semantic relationships. Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. XLRS-Bench boasts the largest average image size (8500 \times 8500) observed thus far, with all evaluation samples meticulously annotated manually, assisted by a novel semi-automatic captioner on ultra-high-resolution RS images. On top of the XLRS-Bench, 16 sub-tasks are defined to evaluate MLLMs’ 10 kinds of perceptual capabilities and 6 kinds of reasoning capabilities, with a primary emphasis on advanced cognitive processes that facilitate real-world decision-making and the capture of spatiotemporal changes. The results of both general and RS-focused MLLMs on XLRS-Bench indicate that further efforts are needed for real-world RS applications. We have open-sourced XLRS-Bench to support further research in developing more powerful MLLMs for remote sensing.
zh

[CV-72] STI-Bench: Are MLLM s Ready for Precise Spatial-Temporal World Understanding?

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在实际应用中精确且定量的空间-时间理解能力不足的问题。论文的关键解决方案是引入了一个名为STI-Bench的新基准，用于通过挑战性任务（如物体外观、姿态、位移和运动的估计与预测）评估MLLMs的空间-时间理解能力。STI-Bench涵盖了桌面、室内和室外场景下的广泛机器人和车辆操作任务，并通过实验揭示了当前最先进的MLLMs在真实世界空间-时间理解任务中的局限性，尤其是在需要精确距离估计和运动分析的任务中。

链接: https://arxiv.org/abs/2503.23765
作者: Yun Li,Yiming Zhang,Tao Lin,XiangRui Liu,Wenxiao Cai,Zheng Liu,Bo Zhao
机构: School of AI, Shanghai Jiao Tong University (上海交通大学人工智能学院); China University of Geosciences (中国地质大学); Nanyang Technological University (南洋理工大学); BAAI (北京智源人工智能研究院); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models’ Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs’ spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
zh

[CV-73] WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation

【速读】：该论文旨在解决基于 Transformer 的架构在 3D 医学图像分析中因内存开销大且无法充分捕捉细粒度局部特征而导致性能受限的问题。为应对这些挑战，论文提出了一种名为 WaveFormer 的新型 3D Transformer 模型。其关键解决方案包括：(1) 利用特征在频域的基本属性进行上下文表示；(2) 受人类视觉识别系统自顶向下机制的启发，构建具有生物学动机的架构。WaveFormer 通过在多尺度上应用离散小波变换（Discrete Wavelet Transform, DWT），既保留全局上下文又捕获高频细节，并以高效的小波总结与重建替代计算密集的上采样层，从而显著减少参数数量，这对资源受限的实际部署至关重要。此外，该模型通用性强，易于适应多样化的应用场景。实验结果表明，WaveFormer 在 BraTS2023、FLARE2021 和 KiTS2023 数据集上的性能与最先进的方法相当，同时提供更低的计算复杂度。

链接: https://arxiv.org/abs/2503.23764
作者: Md Mahfuz Al Hasan,Mahdi Zaman,Abdul Jawad,Alberto Santamaria-Pang,Ho Hin Lee,Ivan Tarapov,Kyle See,Md Shah Imran,Antika Roy,Yaser Pourmohammadi Fallah,Navid Asadizanjani,Reza Forghani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based architectures have advanced medical image analysis by effectively modeling long-range dependencies, yet they often struggle in 3D settings due to substantial memory overhead and insufficient capture of fine-grained local features. We address these limi- tations with WaveFormer, a novel 3D-transformer that: i) leverages the fundamental frequency-domain properties of features for contextual rep- resentation, and ii) is inspired by the top-down mechanism of the human visual recognition system, making it a biologically motivated architec- ture. By employing discrete wavelet transformations (DWT) at multiple scales, WaveFormer preserves both global context and high-frequency de- tails while replacing heavy upsampling layers with efficient wavelet-based summarization and reconstruction. This significantly reduces the number of parameters, which is critical for real-world deployment where compu- tational resources and training times are constrained. Furthermore, the model is generic and easily adaptable to diverse applications. Evaluations on BraTS2023, FLARE2021, and KiTS2023 demonstrate performance on par with state-of-the-art methods while offering substantially lower computational complexity.
zh

[CV-74] StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

【速读】：该论文致力于解决在手绘草图生成领域中，基于栅格格式训练的模型易产生非笔画相关伪影，而基于向量格式训练的模型通常缺乏对草图的整体理解，导致可识别性降低的问题。此外，现有方法难以从不同位置出现的相似元素（如动物的眼睛）中提取共同特征。为应对这些挑战，论文提出了一种名为StrokeFusion的两阶段框架，其关键在于包含一个双模态草图特征学习网络，该网络将笔画映射到高质量潜在空间，并将草图分解为归一化的笔画，同时结合无符号距离函数（UDF）图联合编码笔画序列。在此表示基础上，框架利用一种笔画级潜在扩散模型，在生成过程中同时调整笔画的位置、尺度和轨迹，从而实现高保真草图生成并支持笔画插值编辑。

链接: https://arxiv.org/abs/2503.23752
作者: Jin Zhou,Yi Zhou,Pengfei Xu,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.
zh

[CV-75] Decoupled Distillation to Erase: A General Unlearning Method for Any Class-centric Tasks CVPR2025

【速读】：该论文旨在解决基于类别相关的任务中的模型无监督遗忘（unlearning）问题，即在移除特定类别知识的同时，确保其他类别的知识能够被充分保留。传统方法通常仅优化遗忘项而缺乏对保留项的有效监督，这可能导致预训练模型分布失衡，并难以充分保存剩余类别的知识。为解决这一问题，论文提出了一种名为DELETE（DEcoupLEd Distillation To Erase）的通用且强大的无监督遗忘方法。其关键在于通过理论框架将无监督遗忘损失分解为遗忘项和保留项，并利用“暗知识”（dark knowledge）精炼保留项，结合掩码蒸馏技术，同时优化遗忘与精炼后的保留组件。这种方法通过掩码分离遗忘logits与保留logits，在不依赖额外数据或干预的情况下，实现了在多个基准测试上的最新性能，同时适用于多种下游任务，如人脸识别、后门攻击防御及语义分割等。

链接: https://arxiv.org/abs/2503.23751
作者: Yu Zhou,Dian Zheng,Qijie Mo,Renjie Lu,Kun-Yu Lin,Wei-Shi Zheng
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学), China; Peng Cheng Laboratory (鹏城实验室), China; Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室), China; Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025, Equal contributions from first two authors

点击查看摘要

Abstract:In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class-centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre-trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using “dark knowledge” and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state-of-the-art performance across various benchmarks. What’s more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.
zh

[CV-76] Consistency-aware Self-Training for Iterative-based Stereo Matching CVPR2025

【速读】：该论文致力于解决迭代式立体匹配方法对未标注真实世界数据处理能力不足的问题。现有迭代方法高度依赖于标注数据，在未标注数据上的表现欠佳。为应对这一挑战，论文首次提出了一种一致性感知自训练框架，通过教师-学生模式利用未标注的真实场景数据。解决方案的关键在于引入了一种新颖的一致性感知软过滤模块，用于评估教师模型预测的伪标签可靠性，该模块包含多分辨率预测一致性滤波器和迭代预测一致性滤波器，分别评估多分辨率预测和迭代优化过程中的波动情况。此外，还设计了一致性感知软加权损失来动态调整伪标签权重，缓解因错误伪标签导致的误差累积和性能下降问题。实验结果表明，所提方法能够显著提升多种迭代式立体匹配方法在各类场景下的性能，并在多个基准数据集上超越当前最先进的方法。

链接: https://arxiv.org/abs/2503.23747
作者: Jingyi Zhou,Peng Ye,Haoyu Zhang,Jiakang Yuan,Rao Qiang,Liu YangChenXu,Wu Cailin,Feng Xu,Tao Chen
机构: Fudan University (复旦大学); Xiaomi Inc (小米公司); Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Iterative-based methods have become mainstream in stereo matching due to their high performance. However, these methods heavily rely on labeled data and face challenges with unlabeled real-world data. To this end, we propose a consistency-aware self-training framework for iterative-based stereo matching for the first time, leveraging real-world unlabeled data in a teacher-student manner. We first observe that regions with larger errors tend to exhibit more pronounced oscillation characteristics during model this http URL on this, we introduce a novel consistency-aware soft filtering module to evaluate the reliability of teacher-predicted pseudo-labels, which consists of a multi-resolution prediction consistency filter and an iterative prediction consistency filter to assess the prediction fluctuations of multiple resolutions and iterative optimization respectively. Further, we introduce a consistency-aware soft-weighted loss to adjust the weight of pseudo-labels accordingly, relieving the error accumulation and performance degradation problem due to incorrect pseudo-labels. Extensive experiments demonstrate that our method can improve the performance of various iterative-based stereo matching approaches in various scenarios. In particular, our method can achieve further enhancements over the current SOTA methods on several benchmark datasets.
zh

[CV-77] Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation

【速读】：该论文致力于解决将真实世界的静态绘画转化为动态视频（Image-to-Video, I2V）过程中，如何在保持原画细节的同时，通过文本引导生成自然且一致的运动效果的问题。现有I2V方法主要基于自然视频数据集进行训练，难以从静态图像中提取高质量的动态输出，容易产生因文本解析不足导致的静态结果或因与现实艺术风格不匹配引起的失真动态。为应对这一挑战，论文提出了一种无需训练的框架，其关键是引入合成代理图像并通过以下两项创新实现：(1) 双路径得分蒸馏(Dual-path score distillation)，采用双路径架构从真实数据和合成数据中提取运动先验，保留原始绘画的静态细节同时学习合成帧中的动态特性；(2) 潜空间中基于球面线性插值的混合潜特征融合(Hybrid latent fusion)，通过结合来自真实绘画和合成代理图像的特征，确保平滑过渡并增强时间一致性。实验表明，该方法显著提升了文本提示的语义对齐能力，同时忠实保留了原画的独特特性和完整性。

链接: https://arxiv.org/abs/2503.23736
作者: Lingyu Liu,Yaxiong Wang,Li Zhu,Zhedong Zheng
机构: Xian Jiaotong University (西安交通大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: The project is available at: this https URL

点击查看摘要

Abstract:We introduce a training-free framework specifically designed to bring real-world static paintings to life through image-to-video (I2V) synthesis, addressing the persistent challenge of aligning these motions with textual guidance while preserving fidelity to the original artworks. Existing I2V methods, primarily trained on natural video datasets, often struggle to generate dynamic outputs from static paintings. It remains challenging to generate motion while maintaining visual consistency with real-world paintings. This results in two distinct failure modes: either static outputs due to limited text-based motion interpretation or distorted dynamics caused by inadequate alignment with real-world artistic styles. We leverage the advanced text-image alignment capabilities of pre-trained image models to guide the animation process. Our approach introduces synthetic proxy images through two key innovations: (1) Dual-path score distillation: We employ a dual-path architecture to distill motion priors from both real and synthetic data, preserving static details from the original painting while learning dynamic characteristics from synthetic frames. (2) Hybrid latent fusion: We integrate hybrid features extracted from real paintings and synthetic proxy images via spherical linear interpolation in the latent space, ensuring smooth transitions and enhancing temporal consistency. Experimental evaluations confirm that our approach significantly improves semantic alignment with text prompts while faithfully preserving the unique characteristics and integrity of the original paintings. Crucially, by achieving enhanced dynamic effects without requiring any model training or learnable parameters, our framework enables plug-and-play integration with existing I2V methods, making it an ideal solution for animating real-world paintings. More animated examples can be found on our project website.
zh

[CV-78] Investigation of intelligent barbell squat coaching system based on computer vision and machine learning

【速读】：本文旨在解决在力量训练中缺乏实时诊断与反馈系统的问题，以帮助用户改善深蹲动作技术并预防因动作错误导致的运动损伤。研究的关键在于开发了一种基于人工智能（Artificial Intelligence, AI）和计算机视觉（Computer Vision, CV）的杠铃深蹲教练系统，该系统具备实时模式和回放模式。实时模式能够即时诊断每次深蹲中的问题并提供反馈；回放模式则允许用户回顾之前的深蹲动作并查看详细评论。为实现这一目标，研究首先确定了四个主要的深蹲特征：身体关节角度、足背屈、膝髋运动比例以及杠铃稳定性，并通过收集的数据集训练了三种机器学习架构的诊断模型。此外，采用SHapley Additive exPlanations (SHAP) 方法进行特征选择，不仅提升了问题预测的准确性，还减少了计算时间。最终结果显示，六类问题的F1分数分别达到了86.86%至100%，且每次深蹲诊断耗时少于0.5秒。实验进一步验证了该系统的有效性，表明使用此系统的参与者在深蹲技术上有显著提升。

链接: https://arxiv.org/abs/2503.23731
作者: Yinq-Rong Chern,Yuhao Lee,Hsiao-Ching Lin,Guan-Ting Chen,Ying-Hsien Chen,Fu-Sung Lin,Chih-Yao Chuang,Jenn-Jier James Lien,Chih-Hsien Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Purpose: Research has revealed that strength training can reduce the incidence of chronic diseases and physical deterioration at any age. Therefore, having a movement diagnostic system is crucial for training alone. Hence, this study developed an artificial intelligence and computer vision-based barbell squat coaching system with a real-time mode that immediately diagnoses the issue and provides feedback after each squat. In addition, a replay mode allows users to examine their previous squats and check their comments. Initially, four primary characteristics of the barbell squat were identified: body joint angles, dorsiflexion, the ratio of knee-to-hip movement, and barbell stability. Methods: We collect 8,151 squats from 77 participants, categorizing them as good squats and six issues. Then, we trained the diagnosis models with three machine-learning architectures. Furthermore, this research applied the SHapley Additive exPlanations (SHAP) method to enhance the accuracy of issue prediction and reduce the computation time by feature selection. Results: The F1 score of the six issues reached 86.86%, 69.01%, 77.42%, 90.74%, 95.83%, and 100%. Each squat diagnosis took less than 0.5 seconds. Finally, this study examined the efficacy of the proposed system with two groups of participants trained with and without the system. Subsequently, participants trained with the system exhibited substantial improvements in their squat technique, as assessed both by the system itself and by a professional weightlifting coach. Conclusion: This is a comprehensive study that integrates artificial intelligence, computer vision and multivariable processing technologies, aimed at building a real-time, user-friendly barbell squat feedback and training system.
zh

[CV-79] Exploring Temporal Dynamics in Event-based Eye Tracker CVPR2025

【速读】：该论文致力于解决基于帧图像传感器的眼动追踪在捕捉快速眼动（如扫视和眨眼）时因时间分辨率有限而导致的精度不足问题。为应对这一挑战，论文提出了一种名为TDTracker的高效眼动追踪框架，其关键在于通过隐式和显式两种视角全面建模时间动态特性。具体而言，TDTracker利用三维卷积神经网络捕获隐式的短期时间动态，并采用包含频率感知模块、GRU和Mamba的级联结构提取显式的长期时间动态，最终通过预测热图实现眼坐标回归。实验结果表明，TDTracker在SEET数据集上达到了最先进的性能，并在CVPR 2025事件驱动眼动追踪挑战中获得第三名。

链接: https://arxiv.org/abs/2503.23725
作者: Hongwei Ren,Xiaopeng Lin,Hongxiang Huang,Yue Zhou,Bojun Cheng
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 Event-based Vision Workshop

点击查看摘要

Abstract:Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at this https URL.
zh

[CV-80] LATex: Leverag ing Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification

【速读】：该论文旨在解决跨空地异构摄像头视角下的人体重识别（Aerial-Ground Person Re-ID, AG-ReID）问题。现有方法通常依赖大规模模型提取视点不变特征，但忽视了人物属性中的语义信息；同时，这些方法往往需要对大规模模型进行完全微调，导致训练成本显著增加。为了解决这些问题，论文提出了一种名为LATex的新框架，其关键是采用提示调优（prompt-tuning）策略利用基于属性的文本知识。具体而言，通过引入对比语言图像预训练（CLIP）模型作为主干网络，并设计属性感知图像编码器（AIE）提取全局语义特征与属性感知特征，再结合提示属性分类器组（PACG）生成人物属性预测并获取编码表示，最后设计耦合提示模板（CPT）将属性标记和视角信息转化为结构化句子，由CLIP的文本编码器生成更具区分性的特征。这一系列设计使得LATex能够充分利用基于属性的文本知识提升AG-ReID性能。

链接: https://arxiv.org/abs/2503.23722
作者: Xiang Hu,Yuhao Wang,Pingping Zhang,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aerial-Ground person Re-IDentification (AG-ReID) aims to retrieve specific persons across heterogeneous cameras in different views. Previous methods usually adopt large-scale models, focusing on view-invariant features. However, they overlook the semantic information in person attributes. Additionally, existing training strategies often rely on full fine-tuning large-scale models, which significantly increases training costs. To address these issues, we propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge. More specifically, we first introduce the Contrastive Language-Image Pre-training (CLIP) model as the backbone, and propose an Attribute-aware Image Encoder (AIE) to extract global semantic features and attribute-aware features. Then, with these features, we propose a Prompted Attribute Classifier Group (PACG) to generate person attribute predictions and obtain the encoded representations of predicted attributes. Finally, we design a Coupled Prompt Template (CPT) to transform attribute tokens and view information into structured sentences. These sentences are processed by the text encoder of CLIP to generate more discriminative features. As a result, our framework can fully leverage attribute-based text knowledge to improve the AG-ReID. Extensive experiments on three AG-ReID benchmarks demonstrate the effectiveness of our proposed LATex. The source code will be available.
zh

[CV-81] Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

【速读】：该论文致力于解决遥感图像处理中的云移除（Cloud Removal, CR）难题。尽管扩散模型（Diffusion Models, DM）在生成任务中表现出强大的能力，但其直接应用于CR的效果欠佳，因为它们从随机噪声生成无云图像，忽略了云层输入中固有的有用信息。为克服这一局限性，论文提出了一种基于均值回复扩散模型（Mean-Reverting Diffusion Models, MRDMs）的新方法EMRDM，通过重新设计前向过程和引入基于常微分方程（Ordinary Differential Equation, ODE）的后向过程，在云层与无云图像之间构建直接的扩散过程。解决方案的关键在于开发了一个模块化框架，支持模块更新和设计空间解析，并对MRDM的关键模块进行重新设计以提升CR性能，包括通过预条件技术重构去噪器、重组训练过程以及通过确定性和随机采样器改进采样过程。此外，为了实现多时相云移除，论文进一步设计了一个同时处理序列图像去噪的网络。实验结果验证了EMRDM在单时相和多时相数据集上的优越性能。

链接: https://arxiv.org/abs/2503.23717
作者: Yi Liu,Wengen Li,Jihong Guan,Shuigeng Zhou,Yichao Zhang
机构: Tongji University (同济大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 12 figures

点击查看摘要

Abstract:Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at this https URL.
zh

[CV-82] HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation CVPR2025

【速读】：该论文旨在解决现有文本到视频（Text-to-Video, T2V）模型在生成包含人与物体交互（Human-Object Interaction, HOI）场景时精度不足的问题，主要由于缺乏大规模且带有精确字幕的视频数据集。为应对这一挑战，论文的关键解决方案是引入了HOIGen-1M，这是一个包含超过一百万个高质量HOI视频的大规模数据集。其关键是通过设计一个高效的框架利用强大的多模态大型语言模型（Multimodal Large Language Models, MLLMs）自动筛选高质量的HOI视频，并辅以人工标注进行进一步清洗，同时提出了一种基于多模态专家混合（Mixture-of-Multimodal-Experts, MoME）策略的新颖视频描述方法，以生成表达性强且无幻觉的文本字幕。此外，为了评估生成的HOI视频质量，论文还提出了两种新指标以粗细粒度评估视频质量。实验结果表明，当前T2V模型难以生成高质量的HOI视频，而HOIGen-1M数据集对提升HOI视频生成能力具有重要作用。

链接: https://arxiv.org/abs/2503.23715
作者: Kun Liu,Qi Liu,Xinchen Liu,Jie Li,Yongdong Zhang,Jiebo Luo,Xiaodong He,Wu Liu
机构: JD Explore Academy (京东探索研究院); UCAS (中国科学院大学); USTC (中国科学技术大学); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Text-to-video (T2V) generation has made tremendous progress in generating complicated scenes based on texts. However, human-object interaction (HOI) often cannot be precisely generated by current T2V models due to the lack of large-scale videos with accurate captions for HOI. To address this issue, we introduce HOIGen-1M, the first largescale dataset for HOI Generation, consisting of over one million high-quality videos collected from diverse sources. In particular, to guarantee the high quality of videos, we first design an efficient framework to automatically curate HOI videos using the powerful multimodal large language models (MLLMs), and then the videos are further cleaned by human annotators. Moreover, to obtain accurate textual captions for HOI videos, we design a novel video description method based on a Mixture-of-Multimodal-Experts (MoME) strategy that not only generates expressive captions but also eliminates the hallucination by individual MLLM. Furthermore, due to the lack of an evaluation framework for generated HOI videos, we propose two new metrics to assess the quality of generated videos in a coarse-to-fine manner. Extensive experiments reveal that current T2V models struggle to generate high-quality HOI videos and confirm that our HOIGen-1M dataset is instrumental for improving HOI video generation. Project webpage is available at this https URL.
zh

[CV-83] ElimPCL: Eliminating Noise Accumulation with Progressive Curriculum Labeling for Source-Free Domain Adaptation ICME2025

【速读】：该论文致力于解决源无关域适应（Source-Free Domain Adaptation, SFDA）中伪标签噪声积累的问题，特别是在面对因域偏移导致的困难样本时，传统方法生成的伪标签往往高度不确定，这些噪声在适应过程之前即被引入，并通过参数更新进一步强化，最终影响邻近样本。为消除噪声积累的问题，论文提出了一种新颖的渐进式课程伪标记方法（Progressive Curriculum Labeling, ElimPCL），其关键是通过原型一致性迭代筛选可信的伪标记样本，排除高噪声样本的参与训练；同时，在特征空间中设计了双重混合（Dual MixUp）技术以增强困难样本的可分性，从而减轻噪声样本的干扰。实验验证表明，ElimPCL 方法在具有挑战性的任务中相比现有最先进方法提升了高达 3.4% 的性能。

链接: https://arxiv.org/abs/2503.23712
作者: Jie Cheng,Hao Zheng,Meiguang Zheng,Lei Wang,Hao Wu,Jian Zhang
机构: School of Computer Science and Engineering, Central South University (中南大学), Changsha, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICME 2025 camera-ready

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) aims to train a target model without source data, and the key is to generate pseudo-labels using a pre-trained source model. However, we observe that the source model often produces highly uncertain pseudo-labels for hard samples, particularly those heavily affected by domain shifts, leading to these noisy pseudo-labels being introduced even before adaptation and further reinforced through parameter updates. Additionally, they continuously influence neighbor samples through propagation in the feature this http URL eliminate the issue of noise accumulation, we propose a novel Progressive Curriculum Labeling (ElimPCL) method, which iteratively filters trustworthy pseudo-labeled samples based on prototype consistency to exclude high-noise samples from training. Furthermore, a Dual MixUP technique is designed in the feature space to enhance the separability of hard samples, thereby mitigating the interference of noisy samples on their this http URL experiments validate the effectiveness of ElimPCL, achieving up to a 3.4% improvement on challenging tasks compared to state-of-the-art methods.
zh

[CV-84] Expanding-and-Shrinking Binary Neural Networks

【速读】：该论文试图解决二值神经网络（Binary Neural Networks, BNNs）在处理复杂任务时相较于实值网络存在显著精度下降的问题。为了解决这一限制，论文提出了一种扩展与收缩操作（expanding-and-shrinking operation），通过以可忽略的计算复杂度增加来增强二值特征图的表示能力，从而提升其表征能力。这一解决方案的关键在于有效缓解因权重和激活函数二值化导致的特征图表达受限问题，同时保持高效的计算特性。

链接: https://arxiv.org/abs/2503.23709
作者: Xulong Shi,Caiyi Sun,Zhi Qi,Liu Hao,Xiaodong Yang
机构: QCraft; Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While binary neural networks (BNNs) offer significant benefits in terms of speed, memory and energy, they encounter substantial accuracy degradation in challenging tasks compared to their real-valued counterparts. Due to the binarization of weights and activations, the possible values of each entry in the feature maps generated by BNNs are strongly constrained. To tackle this limitation, we propose the expanding-and-shrinking operation, which enhances binary feature maps with negligible increase of computation complexity, thereby strengthening the representation capacity. Extensive experiments conducted on multiple benchmarks reveal that our approach generalizes well across diverse applications ranging from image classification, object detection to generative diffusion model, while also achieving remarkable improvement over various leading binarization algorithms based on different architectures including both CNNs and Transformers.
zh

[CV-85] 3D Dental Model Segmentation with Geometrical Boundary Preserving

【速读】：该论文旨在解决3D口腔扫描网格在牙齿冠与牙龈交界处分割精度较低的问题，现有下采样方法难以有效保留该区域的几何细节。为应对这些挑战，论文提出了一种名为CrossTooth的边界保持分割方法，其关键是结合了3D网格选择性下采样以保留更多顶点于牙齿-牙龈区域，并利用多视角渲染图像提取的跨模态判别边界特征，增强分割网络的几何表达能力。实验表明，该方法显著提升了分割精度。

链接: https://arxiv.org/abs/2503.23702
作者: Shufan Xi,Zexian Liu,Junlin Chang,Hongyu Wu,Xiaogang Wang,Aimin Hao
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航虚拟现实技术与系统国家重点实验室); College of Computer and Information Science, Southwest University (西南大学计算机与信息科学学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

点击查看摘要

Abstract:3D intraoral scan mesh is widely used in digital dentistry diagnosis, segmenting 3D intraoral scan mesh is a critical preliminary task. Numerous approaches have been devised for precise tooth segmentation. Currently, the deep learning-based methods are capable of the high accuracy segmentation of crown. However, the segmentation accuracy at the junction between the crown and the gum is still below average. Existing down-sampling methods are unable to effectively preserve the geometric details at the junction. To address these problems, we propose CrossTooth, a boundary-preserving segmentation method that combines 3D mesh selective downsampling to retain more vertices at the tooth-gingiva area, along with cross-modal discriminative boundary features extracted from multi-view rendered images, enhancing the geometric representation of the segmentation network. Using a point network as a backbone and incorporating image complementary features, CrossTooth significantly improves segmentation accuracy, as demonstrated by experiments on a public intraoral scan dataset.
zh

[CV-86] Detail-aware multi-view stereo network for depth estimation

【速读】：该论文旨在解决现有多视图立体方法在恢复物体边界和细节区域深度时表现不佳的问题。为了解决这些问题，论文提出了一种细节感知的多视图立体网络（Detail-aware Multi-view Stereo Network, DA-MVSNet），采用粗到细的框架。其关键解决方案包括：利用粗阶段隐藏的几何深度线索来保持物体表面之间的几何结构关系并增强图像特征的表达能力；引入图像合成损失以约束细节区域的梯度流动，并进一步强化物体边界和纹理丰富区域的监督；此外，还提出了自适应深度间隔调整策略以提高物体重建的准确性。实验结果表明，该方法在DTU和Tanks & Temples数据集上取得了具有竞争力的结果。

链接: https://arxiv.org/abs/2503.23684
作者: Haitao Tian,Junyang Li,Chenxing Wang,Helong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-view stereo methods have achieved great success for depth estimation based on the coarse-to-fine depth learning frameworks, however, the existing methods perform poorly in recovering the depth of object boundaries and detail regions. To address these issues, we propose a detail-aware multi-view stereo network (DA-MVSNet) with a coarse-to-fine framework. The geometric depth clues hidden in the coarse stage are utilized to maintain the geometric structural relationships between object surfaces and enhance the expressive capability of image features. In addition, an image synthesis loss is employed to constrain the gradient flow for detailed regions and further strengthen the supervision of object boundaries and texture-rich areas. Finally, we propose an adaptive depth interval adjustment strategy to improve the accuracy of object reconstruction. Extensive experiments on the DTU and Tanks Temples datasets demonstrate that our method achieves competitive results. The code is available at this https URL.
zh

[CV-87] he Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

【速读】：该论文致力于解决零样本视频描述（zero-shot video captioning）任务中，现有方法仅关注场景中的单一关键方面，导致生成的描述忽略其余视觉信息的问题。为了解决这一局限并生成更准确和完整的描述，论文提出了一种新颖的渐进式多粒度文本提示策略（progressive multi-granularity textual prompting strategy）。该方案的关键在于构建三个不同的记忆库（memory banks），分别涵盖名词短语、名词短语的场景图（scene graphs）以及完整句子，并引入一种类别感知检索机制（category-aware retrieval mechanism），以建模与特定主题相关的自然语言分布。实验结果表明，该方法在MSR-VTT、MSVD和VATEX基准测试中，相较于现有最先进的方法，在主要指标CIDEr上分别提升了5.7%、16.2%和3.4%。

链接: https://arxiv.org/abs/2503.23679
作者: Mingkai Tian,Guorong Li,Yuankai Qi,Amin Beheshti,Javen Qinfeng Shi,Anton van den Hengel,Qingming Huang
机构: School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院); Macquarie University (麦考瑞大学); Australian Institute for Machine Learning, The University of Adelaide (阿德莱德大学机器学习研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract visual-relevant textual prompts to guide language models in generating captions. These methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics in question. Extensive experiments demonstrate the effectiveness of our method with 5.7%, 16.2%, and 3.4% improvements in terms of the main metric CIDEr on MSR-VTT, MSVD, and VATEX benchmarks compared to existing state-of-the-art.
zh

[CV-88] Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation CVPR

【速读】：该论文旨在解决从稀疏点云推断符号距离函数（Signed Distance Functions, SDFs）的问题，这是曲面重建中的一个挑战。其核心难题在于稀疏点云缺乏详细的几何信息，而这对于学习连续场至关重要。为了解决这一问题，论文的关键创新点在于提出了一种动态变形网络来以端到端的方式预测SDFs。此外，论文还引入了双向表面参数化（Bijective Surface Parameterization, BSP），通过从局部补丁学习全局形状，并构建稀疏点与三维局部补丁之间的双向映射，将这些补丁整合到全局曲面上。同时，通过网格变形优化（Grid Deformation Optimization, GDO），进一步优化网格点的变形并细化参数化曲面。实验结果表明，该方法在合成数据集和真实扫描数据集上显著优于当前最先进的方法。

链接: https://arxiv.org/abs/2503.23670
作者: Takeshi Noda,Chao Chen,Junsheng Zhou,Weiqi Zhang,Yu-Shen Liu,Zhizhong Han
机构: School of Software, Tsinghua University (清华大学软件学院), Beijing, China; Department of Computer Science, Wayne State University (韦恩州立大学计算机科学系), Detroit, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Conference on Computer Vision and Pattern Recognition (CVPR) 2025. Project page: this https URL

点击查看摘要

Abstract:Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: this https URL
zh

[CV-89] Context-Independent OCR with Multimodal LLM s: Effects of Image Resolution and Visual Complexity

【速读】：该论文旨在解决多模态大型语言模型（Multimodal LLMs）在光学字符识别（OCR）任务中因依赖上下文线索而导致单个字符识别准确性不足的问题。研究通过使用包含多样化视觉复杂度的单字符图像，在无上下文条件下评估多模态 LLMs 的 OCR 性能。关键在于分析图像分辨率（如约 300 ppi 时性能接近传统 OCR 方法，而在低于 150 ppi 时显著下降）以及视觉复杂性对识别准确性的影响，并揭示其与传统 OCR 模型的差异，从而为多模态 LLMs 在需要精确字符级精度的 OCR 应用中提供可靠性的条件依据。

链接: https://arxiv.org/abs/2503.23667
作者: Kotaro Inoue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.
zh

[CV-90] LiM-Loc: Visual Localization with Dense and Accurate 3D Reference Maps Directly Corresponding 2D Keypoints to 3D LiDAR Point Clouds

【速读】：本文旨在解决视觉定位中基于仅图像的方法因特征点匹配错误导致的稀疏且不准确的3D参考图问题。为提高相机位姿估计的精度，论文提出了一种将3D激光雷达点云直接分配给特征点以生成密集且精确的3D参考图的方法。关键在于避免了特征匹配过程，并实现了几乎所有特征点的精确三维重建。此外，通过使用广域激光雷达点云去除不可见点并减少2D-3D对应误差，进一步提升了大范围场景下相机位姿估计的准确性。

链接: https://arxiv.org/abs/2503.23664
作者: Masahiko Tsuji,Hitoshi Niigaki,Ryuichi Tanida
机构: NTT Corporation (NTT株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Visual localization is to estimate the 6-DOF camera pose of a query image in a 3D reference map. We extract keypoints from the reference image and generate a 3D reference map with 3D reconstruction of the keypoints in advance. We emphasize that the more keypoints in the 3D reference map and the smaller the error of the 3D positions of the keypoints, the higher the accuracy of the camera pose estimation. However, previous image-only methods require a huge number of images, and it is difficult to 3D-reconstruct keypoints without error due to inevitable mismatches and failures in feature matching. As a result, the 3D reference map is sparse and inaccurate. In contrast, accurate 3D reference maps can be generated by combining images and 3D sensors. Recently, 3D-LiDAR has been widely used around the world. LiDAR, which measures a large space with high density, has become inexpensive. In addition, accurately calibrated cameras are also widely used, so images that record the external parameters of the camera without errors can be easily obtained. In this paper, we propose a method to directly assign 3D LiDAR point clouds to keypoints to generate dense and accurate 3D reference maps. The proposed method avoids feature matching and achieves accurate 3D reconstruction for almost all keypoints. To estimate camera pose over a wide area, we use the wide-area LiDAR point cloud to remove points that are not visible to the camera and reduce 2D-3D correspondence errors. Using indoor and outdoor datasets, we apply the proposed method to several state-of-the-art local features and confirm that it improves the accuracy of camera pose estimation.
zh

[CV-91] DeepDubber-V1: Towards High Quality and Dialogue Narration Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance

【速读】：该论文致力于解决电影配音技术在适应不同配音风格、处理对话、独白及旁白等方面存在的不足，并提升对语音细微特征（如说话者的年龄和性别）的理解能力。为应对这些挑战，论文提出了一种基于多模态大型语言模型的框架作为解决方案。其关键在于首先利用多模态链式思维（CoT）推理方法分析视觉输入以理解配音风格和细粒度属性；其次通过大规模语音生成模型，在多模态条件下生成高质量配音。此外，还构建了一个包含CoT标注的电影配音数据集，验证结果显示该方法在多个数据集上的性能优于现有先进技术。

链接: https://arxiv.org/abs/2503.23660
作者: Junjie Zheng,Zihao Chen,Chaofan Ding,Xinhan Di
机构: AI Lab, Giant Network (AI实验室, 巨人网络); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.
zh

[CV-92] Introducing the Short-Time Fourier Kolmogorov Arnold Network: A Dynamic Graph CNN Approach for Tree Species Classification in 3D Point Clouds

【速读】：本文旨在解决基于地面激光扫描（TLS）和机载激光扫描（ALS）数据进行树种分类时，高效低计算量模型开发的挑战。尽管先进的深度学习模型在三维点云分类任务中表现出色，但其高复杂度限制了高效轻量架构的发展。为了解决这一问题，论文提出了一种名为STFT-KAN的新颖Kolmogorov-Arnold网络，它通过将短时傅里叶变换（Short-Time Fourier Transform, STFT）集成到网络中，用激活函数替代标准线性层，从而在保持性能的同时降低参数数量和计算复杂度。关键在于STFT-KAN与轻量化DGCNN（liteDGCNN）结合后，在TLS数据上实现了对现有KAN变体的超越，并且通过混合架构进一步减少了参数数量，同时保持了与基于MLP模型相当的性能，证明了其在降低复杂度的同时维持竞争力的能力。

链接: https://arxiv.org/abs/2503.23647
作者: Said Ohamouddoua,Mohamed Ohamouddoub,Rafik Lasrib,Hanaa El Afiaa,Raddouane Chiheba,Abdellatif El Afiaa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate classification of tree species based on Terrestrial Laser Scanning (TLS) and Airborne Laser Scanning (ALS) is essential for biodiversity conservation. While advanced deep learning models for 3D point cloud classification have demonstrated strong performance in this domain, their high complexity often hinders the development of efficient, low-computation architectures. In this paper, we introduce STFT-KAN, a novel Kolmogorov-Arnold network that integrates the Short-Time Fourier Transform (STFT), which can replace the standard linear layer with activation. We implemented STFT-KAN within a lightweight version of DGCNN, called liteDGCNN, to classify tree species using the TLS data. Our experiments show that STFT-KAN outperforms existing KAN variants by effectively balancing model complexity and performance with parameter count reduction, achieving competitive results compared to MLP-based models. Additionally, we evaluated a hybrid architecture that combines MLP in edge convolution with STFT-KAN in other layers, achieving comparable performance to MLP models while reducing the parameter count by 50% and 75% compared to other KAN-based variants. Furthermore, we compared our model to leading 3D point cloud learning approaches, demonstrating that STFT-KAN delivers competitive results compared to the state-of-the-art method PointMLP lite with an 87% reduction in parameter count.
zh

[CV-93] Uni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers HPCA’25

【速读】：本文旨在解决沉浸式交互中实时神经渲染速度不足的问题，具体表现为缺乏适用于不同应用场景的通用算法解决方案，以及现有设备或加速器仅限于特定渲染管线。为克服这些挑战，论文提出了一种统一的神经渲染加速器，能够支持多种典型的神经渲染管线，实现在不同应用中的实时且本地化渲染，同时保持高效性和兼容性。其关键是通过重新配置硬件架构，动态调整数据流以适应多样化应用的具体渲染指标需求，有效支持典型及最新的混合渲染管线。实验结果验证了所提加速器在边缘设备上实现跨代表性管线实时神经渲染的有效性，这标志着下一代神经图形应用发展的重要一步。

链接: https://arxiv.org/abs/2503.23644
作者: Chaojian Li,Sixu Li,Linrui Jiang,Jingqun Zhang,Yingyan Celine Lin
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Graphics (cs.GR); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by HPCA’25

点击查看摘要

Abstract:Recent advancements in neural rendering technologies and their supporting devices have paved the way for immersive 3D experiences, significantly transforming human interaction with intelligent devices across diverse applications. However, achieving the desired real-time rendering speeds for immersive interactions is still hindered by (1) the lack of a universal algorithmic solution for different application scenarios and (2) the dedication of existing devices or accelerators to merely specific rendering pipelines. To overcome this challenge, we have developed a unified neural rendering accelerator that caters to a wide array of typical neural rendering pipelines, enabling real-time and on-device rendering across different applications while maintaining both efficiency and compatibility. Our accelerator design is based on the insight that, although neural rendering pipelines vary and their algorithm designs are continually evolving, they typically share common operators, predominantly executing similar workloads. Building on this insight, we propose a reconfigurable hardware architecture that can dynamically adjust dataflow to align with specific rendering metric requirements for diverse applications, effectively supporting both typical and the latest hybrid rendering pipelines. Benchmarking experiments and ablation studies on both synthetic and real-world scenes demonstrate the effectiveness of the proposed accelerator. The proposed unified accelerator stands out as the first solution capable of achieving real-time neural rendering across varied representative pipelines on edge devices, potentially paving the way for the next generation of neural graphics applications.
zh

[CV-94] Language-Guided Trajectory Traversal in Disentangled Stable Diffusion Latent Space for Factorized Medical Image Generation

【速读】：该论文试图解决的问题是如何利用预训练视觉-语言基础模型在微调于医学影像数据集后，实现医学影像生成中的潜在因子解耦与控制。论文的关键解决方案在于通过广泛的实验表明，经过微调的语言引导型Stable Diffusion模型能够内在学习分解图像生成的关键属性（如患者的解剖结构或疾病诊断特征），并提出了一种框架，通过生成模型潜在空间的轨迹遍历，识别、隔离和操纵这些关键属性，从而实现对医学影像合成的精确控制。

链接: https://arxiv.org/abs/2503.23623
作者: Zahra TehraniNasab,Amar Kumar,Tal Arbel
机构: McGill University (麦吉尔大学); MILA-Quebec AI Institute (魁北克人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated a remarkable ability to generate photorealistic images from natural language prompts. These high-resolution, language-guided synthesized images are essential for the explainability of disease or exploring causal relationships. However, their potential for disentangling and controlling latent factors of variation in specialized domains like medical imaging remains under-explored. In this work, we present the first investigation of the power of pre-trained vision-language foundation models, once fine-tuned on medical image datasets, to perform latent disentanglement for factorized medical image generation and interpolation. Through extensive experiments on chest X-ray and skin datasets, we illustrate that fine-tuned, language-guided Stable Diffusion inherently learns to factorize key attributes for image generation, such as the patient’s anatomical structures or disease diagnostic features. We devise a framework to identify, isolate, and manipulate key attributes through latent space trajectory traversal of generative models, facilitating precise control over medical image synthesis.
zh

[CV-95] Leverag ing Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

【速读】：该论文试图解决的问题是：通过微调视觉-语言基础模型（Vision-Language Foundation Models, VLMs），能否帮助识别数据集中关键的、甚至可能是未知的数据属性。论文的关键解决方案在于提出了一种方法，并在胸部X射线数据集上进行评估，证明微调后的VLMs能够生成高分辨率且精确编辑的图像，相比基于结构因果模型（Structural Causal Models, SCMs）的方法在多个指标上表现更优，同时揭示了数据集中先前因元数据粒度和模型能力限制而被掩盖的隐藏关系。

链接: https://arxiv.org/abs/2503.23618
作者: Amar Kumar,Anita Kriz,Barak Pertzov,Tal Arbel
机构: McGill University (麦吉尔大学); MILA-Quebec AI Institute (蒙特利尔学习算法研究所); McMaster University (麦克马斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: ‘Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?’ By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.
zh

[CV-96] Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries CVPR2025

【速读】：该论文致力于解决在光子受限(defocused)图像中提取深度信息的挑战，这一问题的核心困难在于深度从散焦（Depth from Defocus, DfD）方法依赖于对散焦模糊程度的精确估计，而后者对图像噪声极为敏感。论文的关键解决方案是提出了一种新颖的方法，通过在散焦边界上稳健地测量物体深度。该方法基于一种名为“Blurry-Edges”的全新图像块表示，能够显式存储和可视化丰富的底层图像块信息，如边界、颜色及平滑性。此外，论文设计了一种深度神经网络架构，可以从一对不同散焦程度的图像中预测Blury-Edges表示，并基于推导出的闭合形式DfD关系计算深度。实验结果表明，该方法在光子受限图像上的深度估计准确性优于多种最先进的DfD方法。

链接: https://arxiv.org/abs/2503.23606
作者: Wei Xu,Charles James Wagner,Junjie Luo,Qi Guo
机构: Elmore Family School of Electrical and Computer Engineering, Purdue University (埃尔莫尔电气与计算机工程学院，普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Extracting depth information from photon-limited, defocused images is challenging because depth from defocus (DfD) relies on accurate estimation of defocus blur, which is fundamentally sensitive to image noise. We present a novel approach to robustly measure object depths from photon-limited images along the defocused boundaries. It is based on a new image patch representation, Blurry-Edges, that explicitly stores and visualizes a rich set of low-level patch information, including boundaries, color, and smoothness. We develop a deep neural network architecture that predicts the Blurry-Edges representation from a pair of differently defocused images, from which depth can be calculated using a closed-form DfD relation we derive. The experimental results on synthetic and real data show that our method achieves the highest depth estimation accuracy on photon-limited images compared to a broad range of state-of-the-art DfD methods.
zh

[CV-97] GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs ICLR2025

【速读】：该论文旨在解决现有算法在解决Raven’s Progressive Matrices (RPMs)任务时难以超越固定谜题集的问题，即缺乏人类所具备的从规则中泛化并生成新谜题的能力。解决方案的关键在于提出了一种名为Generative Visual Puzzles (GenVP) 的框架，该框架能够建模整个RPM生成过程，不仅支持为特定问题提示生成多个解，还能基于一组抽象规则创建全新的谜题。实验表明，GenVP在谜题求解准确率及22个分布外场景的泛化能力方面达到了当前最优水平，并且相较于其他生成模型，在可行解空间增大时仍能高效泛化。此外，通过有效捕捉抽象规则与视觉对象属性之间的关系，GenVP展示了根据抽象规则生成多样完整RPM的能力。

链接: https://arxiv.org/abs/2503.23598
作者: Kalliopi Basioti,Pritish Sahu,Qingze Tony Liu,Zihao Xu,Hao Wang,Vladimir Pavlovic
机构: Rutgers University (罗格斯大学); SRI International (斯坦福国际研究研究院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Raven’s Progressive Matrices (RPMs) is an established benchmark to examine the ability to perform high-level abstract visual reasoning (AVR). Despite the current success of algorithms that solve this task, humans can generalize beyond a given puzzle and create new puzzles given a set of rules, whereas machines remain locked in solving a fixed puzzle from a curated choice list. We propose Generative Visual Puzzles (GenVP), a framework to model the entire RPM generation process, a substantially more challenging task. Our model’s capability spans from generating multiple solutions for one specific problem prompt to creating complete new puzzles out of the desired set of rules. Experiments on five different datasets indicate that GenVP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios. Compared to SOTA generative approaches, which struggle to solve RPMs when the feasible solution space increases, GenVP efficiently generalizes to these challenging setups. Moreover, our model demonstrates the ability to produce a wide range of complete RPMs given a set of abstract rules by effectively capturing the relationships between abstract rules and visual object properties.
zh

[CV-98] PhysPose: Refining 6D Object Poses with Physical Constraints

【速读】：该论文致力于解决从图像中进行精确6D物体位姿估计的问题，现有方法常产生物理上不一致的位姿估计结果，限制了其在实际场景中的应用。论文提出了一种名为PhysPose的新方法，通过后处理优化将物理推理整合到位姿估计中，施加非穿透和重力约束，以确保物理合理性。关键在于利用场景几何信息对位姿估计进行精化，从而提升物理一致性。实验结果显示，PhysPose在YCB-Video和HOPE-Video数据集上达到最先进的精度，并显著提高了机器人挑拣与放置任务的成功率。

链接: https://arxiv.org/abs/2503.23587
作者: Martin Malenický,Martin Cífka,Médéric Fourmy,Louis Montaut,Justin Carpentier,Josef Sivic,Vladimir Petrik
机构: Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague (捷克技术大学布拉格分校捷克智能系统、机器人与自动化研究所); Faculty of Electrical Engineering, Czech Technical University in Prague (捷克技术大学布拉格分校电气工程学院); Inria - Département d’Informatique de l’École normale supérieure, PSL Research University (法国国家信息与自动化研究所 - 巴黎高等师范学院研究大学计算机系)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Accurate 6D object pose estimation from images is a key problem in object-centric scene understanding, enabling applications in robotics, augmented reality, and scene reconstruction. Despite recent advances, existing methods often produce physically inconsistent pose estimates, hindering their deployment in real-world scenarios. We introduce PhysPose, a novel approach that integrates physical reasoning into pose estimation through a postprocessing optimization enforcing non-penetration and gravitational constraints. By leveraging scene geometry, PhysPose refines pose estimates to ensure physical plausibility. Our approach achieves state-of-the-art accuracy on the YCB-Video dataset from the BOP benchmark and improves over the state-of-the-art pose estimation methods on the HOPE-Video dataset. Furthermore, we demonstrate its impact in robotics by significantly improving success rates in a challenging pick-and-place task, highlighting the importance of physical consistency in real-world applications.
zh

[CV-99] DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution

【速读】：该论文旨在探索是否可以采用先进的基于扩散Transformer（diffusion transformer, DiT）的扩散模型来解决真实世界图像超分辨率（Real-World Image Super-Resolution, Real-ISR）问题。论文的关键创新在于提出了DiT4SR，一种将大规模DiT模型应用于Real-ISR的开创性方法。与ControlNet直接注入从低分辨率（Low-Resolution, LR）图像提取的嵌入不同，DiT4SR通过集成LR嵌入到DiT的原始注意力机制中，实现了LR潜空间与生成潜空间之间的双向信息流。这种充分的交互使LR潜空间能够随扩散过程逐步演化，并提供更精确的引导，从而更好地与每个扩散步骤中的生成潜空间对齐。此外，通过跨流卷积层将LR引导注入生成潜空间，进一步弥补了DiT在捕捉局部信息方面的局限性。这些设计显著提升了DiT模型在Real-ISR任务上的性能，验证结果通过大量实验得到展示。

链接: https://arxiv.org/abs/2503.23580
作者: Zheng-Peng Duan,Jiawei Zhang,Xin Jin,Ziheng Zhang,Zheng Xiong,Dongqing Zou,Jimmy Ren,Chun-Le Guo,Chongyi Li
机构: Nankai University (南开大学); SenseTime Research (商汤科技研究部); PBVR (未明确翻译)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pre-trained diffusion models are becoming increasingly popular in solving the Real-World Image Super-Resolution (Real-ISR) problem because of their rich generative priors. The recent development of diffusion transformer (DiT) has witnessed overwhelming performance over the traditional UNet-based architecture in image generation, which also raises the question: Can we adopt the advanced DiT-based diffusion model for Real-ISR? To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. Instead of directly injecting embeddings extracted from low-resolution (LR) images like ControlNet, we integrate the LR embeddings into the original attention mechanism of DiT, allowing for the bidirectional flow of information between the LR latent and the generated latent. The sufficient interaction of these two streams allows the LR stream to evolve with the diffusion process, producing progressively refined guidance that better aligns with the generated latent at each diffusion step. Additionally, the LR guidance is injected into the generated latent via a cross-stream convolution layer, compensating for DiT’s limited ability to capture local information. These simple but effective designs endow the DiT model with superior performance in Real-ISR, which is demonstrated by extensive experiments. Project Page: this https URL.
zh

[CV-100] Multiview Image-Based Localization

【速读】：该论文旨在解决基于图像检索（IR）方法在定位任务中位置与姿态估计精度较低的问题，同时克服传统基于3D重建或深度学习（DNN）方法的复杂性、隐私问题及计算效率不足等局限。论文提出了一种混合方法，存储数据库中的图像特征类似于某些IR方法，但通过潜在的3D重建实现定位，无需保留完整的3D场景重建。其解决方案的关键在于两个创新点：(i) 提出仅依赖相对平移估计而非相对旋转估计来确定查询相机中心的方法，通过解耦两者实现；(ii) 将最优位姿计算从基于估计相对位姿转换为基于多视图对应关系的计算，从而绕过中间步骤。实验结果表明，该方法在7-Scenes和Cambridge Landmarks数据集上实现了更高的定位性能，并且在运行时间和内存占用方面优于现有技术。

链接: https://arxiv.org/abs/2503.23577
作者: Cameron Fiore,Hongyi Fan,Benjamin Kimia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: \em (i) a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and \em (ii) a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man’'. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.
zh

[CV-101] DASH: Detection and Assessment of Systematic Hallucinations of VLMs

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在开放世界场景下容易产生对象幻觉（object hallucinations）的问题。现有基准通过小规模标注数据集量化幻觉现象，但这种方法不足以评估开放世界设置下的幻觉问题，也无法有效检测VLMs中的系统性错误。为了解决这些问题，论文提出DASH（Detection and Assessment of Systematic Hallucinations），这是一种自动化的大规模流水线，旨在识别VLMs在真实图像上的系统性幻觉。DASH的关键组件之一是DASH-OPT，它通过对“自然图像流形”进行优化，生成误导VLM的图像。论文应用DASH分析了PaliGemma及两个LLaVA-NeXT模型在380个对象类别上的表现，并发现超过19,000个包含950,000张图像的集群。研究还表明，使用DASH获得的特定模型图像微调PaliGemma可以减轻对象幻觉现象。解决方案的关键在于开发DASH-OPT方法，以实现对VLM系统性幻觉的有效检测与评估。

链接: https://arxiv.org/abs/2503.23573
作者: Maximilian Augustin,Yannic Neuhaus,Matthias Hein
机构: Tübingen AI Center – University of Tübingen (图宾根人工智能中心 – 图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are prone to object hallucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quantify hallucinations using relatively small, labeled datasets. However, this approach is i) insufficient to assess hallucinations that arise in open-world settings, where VLMs are widely used, and ii) inadequate for detecting systematic errors in VLMs. We propose DASH (Detection and Assessment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinations of VLMs on real-world images in an open-world setting. A key component is DASH-OPT for image-based retrieval, where we optimize over the ‘‘natural image manifold’’ to generate images that mislead the VLM. The output of DASH consists of clusters of real and semantically similar images for which the VLM hallucinates an object. We apply DASH to PaliGemma and two LLaVA-NeXT models across 380 object classes and, in total, find more than 19k clusters with 950k images. We study the transfer of the identified systematic hallucinations to other VLMs and show that fine-tuning PaliGemma with the model-specific images obtained with DASH mitigates object hallucinations. Code and data are available at this https URL.
zh

[CV-102] Enhancing Creative Generation on Stable Diffusion-based Models CVPR2025

【速读】：该论文试图解决现有文本到图像生成模型在创意能力上的局限性问题，即通过包含“创造性”提示词难以获得期望的创意结果。为了解决这一问题，论文提出了一种名为C3（Creative Concept Catalyst）的训练-free方法，其关键在于通过在去噪过程中选择性地增强特征，以促进更具创意的输出，并提供了基于创造力两个主要方面的放大因子选择指南。这种方法首次在不引入大量计算开销的情况下提升了扩散模型的创意能力，并在多种基于Stable Diffusion的模型中验证了其有效性。

链接: https://arxiv.org/abs/2503.23538
作者: Jiyeon Han,Dahee Kwon,Gayoung Lee,Junho Kim,Jaesik Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 accepted paper

点击查看摘要

Abstract:Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative capability remains constrained, as including `creative’ in prompts seldom yields the desired results. This paper introduces C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to foster more creative outputs. We offer practical guidelines for choosing amplification factors based on two main aspects of creativity. C3 is the first study to enhance creativity in diffusion models without extensive computational costs. We demonstrate its effectiveness across various Stable Diffusion-based models.
zh

[CV-103] BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中仅依赖视觉数据而忽略临床医生使用的丰富文本信息的问题。现有方法通常独立处理视觉和文本特征，导致跨模态对齐较弱，且简单融合技术因空间视觉特征与序列文本嵌入之间的固有差异而失效。此外，医学术语偏离通用语言，限制了现成文本编码器的效果，进一步阻碍了视觉-语言对齐。为了解决这些问题，论文提出了一种端到端框架BiPVL-Seg，其关键在于通过架构和训练创新实现视觉-语言融合与嵌入对齐。具体而言，BiPVL-Seg引入双向渐进融合机制，在视觉和文本编码器之间逐步交换信息；同时，通过全局-局部对比对齐的训练目标，在类别和概念层面增强文本编码器的理解能力，从而提升医学图像分割性能。

链接: https://arxiv.org/abs/2503.23534
作者: Rafi Ibn Sultan,Hui Zhu,Chengyin Li,Dongxiao Zhu
机构: Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation typically relies solely on visual data, overlooking the rich textual information clinicians use for diagnosis. Vision-language models attempt to bridge this gap, but existing approaches often process visual and textual features independently, resulting in weak cross-modal alignment. Simple fusion techniques fail due to the inherent differences between spatial visual features and sequential text embeddings. Additionally, medical terminology deviates from general language, limiting the effectiveness of off-the-shelf text encoders and further hindering vision-language alignment. We propose BiPVL-Seg, an end-to-end framework that integrates vision-language fusion and embedding alignment through architectural and training innovations, where both components reinforce each other to enhance medical image segmentation. BiPVL-Seg introduces bidirectional progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders. Additionally, it incorporates global-local contrastive alignment, a training objective that enhances the text encoder’s comprehension by aligning text and vision embeddings at both class and concept levels. Extensive experiments on diverse medical imaging benchmarks across CT and MR modalities demonstrate BiPVL-Seg’s superior performance when compared with state-of-the-art methods in complex multi-class segmentation. Source code is available in this GitHub repository.
zh

[CV-104] ViLAaD: Enhancing "Attracting and Dispersing Source-Free Domain Adaptation with Vision-and-Language Model

【速读】：该论文致力于解决无源域适应（Source-Free Domain Adaptation, SFDA）问题，即在不访问源域数据的情况下，将预训练的源模型适应到来自不同领域的目标数据集。传统方法受限于源模型编码的信息以及未标记的目标数据，而近期引入辅助资源的方法尚处于初级阶段。论文的关键在于提出了一种新颖的方法ViL-enhanced AaD (ViLAaD)，通过扩展现有的SFDA框架并结合视觉与语言（Vision-and-Language, ViL）模型来整合辅助信息。具体而言，该方法基于广泛采用的吸引与排斥（Attracting and Dispersing, AaD）技术，并将其核心原理推广以自然地利用ViL模型作为目标适应的强大初始化。ViLAaD保持了AaD框架的简洁性和灵活性，同时借助ViL模型显著提升了适应性能。此外，ViLAaD的灵活性使其能够无缝集成到交替优化框架中，并可通过ViL提示调优以及添加额外目标来扩展模型适应性。实验证明，增强版ViLAaD++在多种SFDA场景下实现了最先进的性能。

链接: https://arxiv.org/abs/2503.23529
作者: Shuhei Tarashima,Xinqi Shu,Norio Tagawa
机构: NTT Communications Corporation (NTT通讯公司), Tokyo Metropolitan University (东京都立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to a target dataset from a different domain without access to the source data. Conventional SFDA methods are limited by the information encoded in the pre-trained source model and the unlabeled target data. Recently, approaches leveraging auxiliary resources have emerged, yet remain in their early stages, offering ample opportunities for research. In this work, we propose a novel method that incorporates auxiliary information by extending an existing SFDA framework using Vision-and-Language (ViL) models. Specifically, we build upon Attracting and Dispersing (AaD), a widely adopted SFDA technique, and generalize its core principle to naturally integrate ViL models as a powerful initialization for target adaptation. Our approach, called ViL-enhanced AaD (ViLAaD), preserves the simplicity and flexibility of the AaD framework, while leveraging ViL models to significantly boost adaptation performance. We validate our method through experiments using various ViL models, demonstrating that ViLAaD consistently outperforms both AaD and zero-shot classification by ViL models, especially when both the source model and ViL model provide strong initializations. Moreover, the flexibility of ViLAaD allows it to be seamlessly incorporated into an alternating optimization framework with ViL prompt tuning and extended with additional objectives for target model adaptation. Extensive experiments on four SFDA benchmarks show that this enhanced version, ViLAaD++, achieves state-of-the-art performance across multiple SFDA scenarios, including Closed-set SFDA, Partial-set SFDA, and Open-set SFDA.
zh

[CV-105] BoundMatch: Boundary detection applied to semi-supervised segmentation for urban-driving scenes

【速读】：该论文旨在解决半监督语义分割（Semi-supervised Semantic Segmentation, SS-SS）任务中精确描绘物体边界这一关键挑战。当前基于教师-学生一致性正则化的方法虽取得良好效果，但往往未能充分应对这一难题。论文提出了一种名为BoundMatch的新框架，其核心机制是边界一致性正则化多任务学习（Boundary Consistency Regularized Multi-Task Learning, BCRM），通过在教师与学生模型间强制段掩码及详细语义边界的预测一致性来强化边界表达。此外，该框架引入两个轻量级融合模块：边界语义融合（Boundary-Semantic Fusion, BSF）将学习到的边界线索注入分割解码器，空间梯度融合（Spatial Gradient Fusion, SGF）利用掩码梯度优化边界预测，从而生成高质量的边界伪标签。该方法基于SAMTH强教师-学生基线构建，并采用和谐批量归一化（Harmonious Batch Normalization, HBN）更新策略以增强稳定性。实验结果表明，BoundMatch在多个数据集上实现了与现有最先进方法相当的性能，同时显著提升了边界相关评估指标的表现。

链接: https://arxiv.org/abs/2503.23519
作者: Haruya Ishikawa,Yoshimitsu Aoki
机构: Department of Electrical Engineering (电气工程系), Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Semi-supervised semantic segmentation (SS-SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current teacher-student consistency regularization methods achieve strong results, they often overlook a critical challenge: the precise delineation of object boundaries. In this paper, we propose BoundMatch, a novel multi-task SS-SS framework that explicitly integrates semantic boundary detection into the consistency regularization pipeline. Our core mechanism, Boundary Consistency Regularized Multi-Task Learning (BCRM), enforces prediction agreement between teacher and student models on both segmentation masks and detailed semantic boundaries. To further enhance performance and sharpen contours, BoundMatch incorporates two lightweight fusion modules: Boundary-Semantic Fusion (BSF) injects learned boundary cues into the segmentation decoder, while Spatial Gradient Fusion (SGF) refines boundary predictions using mask gradients, leading to higher-quality boundary pseudo-labels. This framework is built upon SAMTH, a strong teacher-student baseline featuring a Harmonious Batch Normalization (HBN) update strategy for improved stability. Extensive experiments on diverse datasets including Cityscapes, BDD100K, SYNTHIA, ADE20K, and Pascal VOC show that BoundMatch achieves competitive performance against state-of-the-art methods while significantly improving boundary-specific evaluation metrics. We also demonstrate its effectiveness in realistic large-scale unlabeled data scenarios and on lightweight architectures designed for mobile deployment.
zh

[CV-106] ReferDINO-Plus: 2nd Solution for 4th PVUW MeViS Challenge at CVPR 2025

【速读】：该论文致力于解决基于文本描述的视频对象分割（Referring Video Object Segmentation, RVOS）问题，旨在通过文本指令从整个视频中分割目标物体。这一任务在计算机视觉领域因其在视频编辑和人机交互中的应用潜力而受到广泛关注。论文的关键在于提出了一种名为ReferDINO-Plus的改进方法：首先，通过结合SAM2在掩码质量与物体一致性方面的优势，进一步增强ReferDINO的能力；其次，引入一种条件掩码融合策略，该策略能够自适应地融合ReferDINO和SAM2产生的掩码，从而有效平衡单目标与多目标场景下的性能表现。最终，ReferDINO-Plus在MeViS测试集上达到了60.43的(\mathcalJ\mathcalF)得分，并在CVPR 2025的MeViS PVUW挑战赛中获得第二名。

链接: https://arxiv.org/abs/2503.23509
作者: Tianming Liang,Haichao Jiang,Wei-Shi Zheng,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human-agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object-level vision-language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single-object and multi-object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO-Plus, achieves 60.43 (\mathcalJ\mathcalF) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: this https URL.
zh

[CV-107] Re-Aligning Language to Visual Objects with an Agent ic Workflow ICLR2025

【速读】：该论文旨在解决语言驱动的目标检测（Language-based Object Detection, LOD）中视觉与语言对齐质量下降的问题。具体而言，利用视觉-语言模型（Vision-Language Models, VLMs）自动生成目标描述时，由于VLM的幻觉效应（hallucinations），导致生成的对象描述（如名称、颜色、形状等）不准确，从而损害了视觉-语言（VL）对齐的质量。论文的关键解决方案是提出了一种由大型语言模型（Large Language Model, LLM）控制的主动工作流（Real-LOD），通过规划、工具使用和反思步骤的循环迭代，自动调整图像和文本提示以优化VLM生成的目标描述。这种基于神经符号设计的工作流能够逐步改进语言表达，从而提高LOD模型的性能，并在标准基准测试中超越现有方法约50%。其核心创新在于通过自动化的方式在增加训练数据量的同时保持数据质量，从数据对齐的角度提升了LOD的整体性能。

链接: https://arxiv.org/abs/2503.23508
作者: Yuming Chen,Jiangyan Feng,Haodong Zhang,Lijun Gong,Feng Zhu,Rui Zhao,Qibin Hou,Ming-Ming Cheng,Yibing Song
机构: VCIP, Nankai University (南开大学); SenseTime Research (商汤科技研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 20 figures, 17 tables, ICLR 2025

点击查看摘要

Abstract:Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.
zh

[CV-108] Federated Self-Supervised Learning for One-Shot Cross-Modal and Cross-Imaging Technique Segmentation

【速读】：该论文试图解决在联邦学习（Federated Learning）场景下，数据稀缺（尤其是单样本分割任务）且数据源包含不同模态（如MRI和CT）时的自监督小样本分割（Self-Supervised Few-Shot Segmentation）问题。论文的关键在于将现有的自监督小样本分割框架CoWPro进行适配以支持联邦学习，并通过引入融合Dice损失函数（Fused Dice Loss）来增强模型性能，从而在性能上显著超越原始CoWPro方法。此外，论文验证了所提出框架在未见过的客户端数据集保留部分上的有效性，展示了其与CoWPro的联邦平均（FedAvg）版本相当甚至更优的表现。

链接: https://arxiv.org/abs/2503.23507
作者: Siladittya Manna,Suresh Das,Sayantari Ghosh,Saumik Bhattacharya
机构: Hong Kong Baptist University (香港浸会大学); Narayana Superspeciality Hospital (Narayana专科医院), Howrah, India; National Institute of Technology Durgapur (印度国家信息技术学院Durgapur分校); Indian Institute of Technology, Kharagpur (印度理工学院克勒格布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Decentralized federated learning enables learning of data representations from multiple sources without compromising the privacy of the clients. In applications like medical image segmentation, where obtaining a large annotated dataset from a single source is a distressing problem, federated self-supervised learning can provide some solace. In this work, we push the limits further by exploring a federated self-supervised one-shot segmentation task representing a more data-scarce scenario. We adopt a pre-existing self-supervised few-shot segmentation framework CoWPro and adapt it to the federated learning scenario. To the best of our knowledge, this work is the first to attempt a self-supervised few-shot segmentation task in the federated learning domain. Moreover, we consider the clients to be constituted of data from different modalities and imaging techniques like MR or CT, which makes the problem even harder. Additionally, we reinforce and improve the baseline CoWPro method using a fused dice loss which shows considerable improvement in performance over the baseline CoWPro. Finally, we evaluate this novel framework on a completely unseen held-out part of the local client dataset. We observe that the proposed framework can achieve performance at par or better than the FedAvg version of the CoWPro framework on the held-out validation dataset.
zh

[CV-109] Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

【速读】：该论文致力于解决现有基于相机的全景立体匹配方法在多样化环境、深度范围及光照条件下深度精度受限的问题，主要由于缺乏真实世界的数据。论文提出了一种名为DFI-OmniStereo的新方法，其关键是结合大规模预训练基础模型的相对单目深度估计能力，并将其嵌入到迭代优化的立体匹配架构中。通过引入专用的两阶段训练策略，先利用相对单目深度特征进行全景立体匹配，随后进行尺度不变微调，从而显著提升了深度估计的准确性。实验结果显示，在Helvipad数据集上，DFI-OmniStereo相比现有最佳方法将视差均方误差（disparity MAE）降低了约16%。

链接: https://arxiv.org/abs/2503.23502
作者: Jannik Endres,Oliver Hahn,Charles Corbière,Simone Schaub-Meyer,Stefan Roth,Alexandre Alahi
机构: EPFL, Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Technical University of Darmstadt, Department of Computer Science (达姆施塔特工业大学计算机科学系); hessian.AI (hessian.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.
zh

[CV-110] Embedding Shift Dissection on CLIP: Effects of Augmentations on VLMs Representation Learning CVPR2025

【速读】：该论文试图研究视觉语言模型（Vision Language Models, VLM），如CLIP，在不同数据增强技术下的表征变化，以提供机制可解释性（Mechanistic Interpretability）方面的深刻见解。论文通过分析9种常见数据增强技术（包括噪声、模糊、颜色抖动、尺度旋转、翻转、弹性变换、透视变换、随机亮度对比度调整以及像素块粗略丢弃）下CLIP嵌入向量的变化，从注意力图相似性、局部区域一致性、边缘保留、细节保持、余弦相似度、L2距离、两两距离以及层次聚类等多个角度审视嵌入向量的偏移，并对样本图像进行定性分析。研究的关键在于量化不同增强技术对嵌入向量的影响程度，发现如噪声、透视变换及尺度缩放等增强技术对嵌入向量的偏移具有较大的冲击力。这一工作为未来VLM在鲁棒性提升及对抗性数据防御方面的研究提供了坚实的理论基础。

链接: https://arxiv.org/abs/2503.23495
作者: Ashim Dahal,Saydul Akbar Murad,Nick Rahimi
机构: University of Southern Mississippi (南密西西比大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at MIV at CVPR 2025

点击查看摘要

Abstract:Understanding the representation shift on Vision Language Models like CLIP under different augmentations provides valuable insights on Mechanistic Interpretability. In this study, we show the shift on CLIP’s embeddings on 9 common augmentation techniques: noise, blur, color jitter, scale and rotate, flip, elastic and perspective transforms, random brightness and contrast, and coarse dropout of pixel blocks. We scrutinize the embedding shifts under similarity on attention map, patch, edge, detail preservation, cosine similarity, L2 distance, pairwise distance and dendrogram clusters and provide qualitative analysis on sample images. Our findings suggest certain augmentations like noise, perspective transform and shift scaling have higher degree of drastic impact on embedding shift. This study provides a concrete foundation for future work on VLM’s robustness for mechanical interpretation and adversarial data defense.
zh

[CV-111] Efficient Dynamic Attention 3D Convolution for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像分类中的三个主要挑战：空间-光谱联合信息利用不足、深度增加导致的梯度消失以及过拟合问题。为提升特征提取效率并跳过冗余信息，论文提出了一种基于改进的3D-DenseNet模型的动态注意力卷积设计（Dynamic Attention Convolution, DAC）。其关键是通过在空间维度上根据图像的空间特性实现自适应特征响应，集中关注关键的空间结构，并在光谱维度上对不同波段进行动态区分，从而缓解因高光谱维数引起的冗余信息和计算复杂性。此外，DAC模块通过基于注意力机制聚合多个卷积核来增强模型的表征能力，而不增加网络的深度或宽度。这种方法在IN、UP和KSC数据集上的推断速度和准确性方面表现出色，优于主流方法。

链接: https://arxiv.org/abs/2503.23472
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK (科大讯飞); Aegon THTF (安信信托)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks face several challenges in hyperspectral image classification, including insufficient utilization of joint spatial-spectral information, gradient vanishing with increasing depth, and overfitting. To enhance feature extraction efficiency while skipping redundant information, this paper proposes a dynamic attention convolution design based on an improved 3D-DenseNet model. The design employs multiple parallel convolutional kernels instead of a single kernel and assigns dynamic attention weights to these parallel convolutions. This dynamic attention mechanism achieves adaptive feature response based on spatial characteristics in the spatial dimension of hyperspectral images, focusing more on key spatial structures. In the spectral dimension, it enables dynamic discrimination of different bands, alleviating information redundancy and computational complexity caused by high spectral dimensionality. The DAC module enhances model representation capability by attention-based aggregation of multiple convolutional kernels without increasing network depth or width. The proposed method demonstrates superior performance in both inference speed and accuracy, outperforming mainstream hyperspectral image classification methods on the IN, UP, and KSC datasets.
zh

[CV-112] Internal Organ Localization Using Depth Images

【速读】：该论文旨在解决自动化磁共振成像（MRI）工作流程中患者定位这一关键步骤的问题，以提高患者吞吐量并优化工作流。论文提出了一种基于学习的框架，利用RGB-D相机获取的身体表面深度信息来推断内部器官的大致位置。解决方案的关键在于使用大规模MRI扫描数据集训练深度学习模型，使其能够仅依赖深度图像准确预测器官的位置和形状，并实现包括骨骼和软组织在内的多种内部器官的精确定位。研究结果表明，将基于RGB-D相机的系统集成到MRI工作流程中，有望通过实现精确且自动化的患者定位来简化扫描程序并提升患者体验。

链接: https://arxiv.org/abs/2503.23468
作者: Eytan Kats,Kai Geißler,Jochen G. Hirsch,Stefan Heldman,Mattias P. Heinrich
机构: Universität zu Lübeck (汉堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for German Conference on Medical Image Computing 2025 (BVM 2025)

点击查看摘要

Abstract:Automated patient positioning is a crucial step in streamlining MRI workflows and enhancing patient throughput. RGB-D camera-based systems offer a promising approach to automate this process by leveraging depth information to estimate internal organ positions. This paper investigates the feasibility of a learning-based framework to infer approximate internal organ positions from the body surface. Our approach utilizes a large-scale dataset of MRI scans to train a deep learning model capable of accurately predicting organ positions and shapes from depth images alone. We demonstrate the effectiveness of our method in localization of multiple internal organs, including bones and soft tissues. Our findings suggest that RGB-D camera-based systems integrated into MRI workflows have the potential to streamline scanning procedures and improve patient experience by enabling accurate and automated patient positioning.
zh

[CV-113] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

【速读】：本文旨在解决端到端自动驾驶中的视觉语言动作（Vision-Language Action, VLA）生成问题，特别关注如何利用预训练的视觉-语言大模型（Vision-Language Models, VLMs）生成可靠的驾驶动作。论文的关键创新在于提出了一个分层的视觉-语言对齐过程（hierarchical vision-language alignment process），通过将2D和3D结构化的视觉标记投影到统一语义空间中，弥合视觉表征与语言嵌入之间的模态差距。此外，通过自回归的Agent-Env-Ego交互过程建模车辆、周围代理及静态道路元素之间的动态关系，确保空间和行为层面的轨迹规划。这些方法共同构成了OpenDriveVLA的核心解决方案，使其在nuScenes数据集上的开环轨迹规划和驾驶相关问答任务中达到最先进的性能。

链接: https://arxiv.org/abs/2503.23463
作者: Xingcheng Zhou,Xuyuan Han,Feng Yang,Yunpu Ma,Alois C. Knoll
机构: Technical University of Munich (慕尼黑工业大学); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving. OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions, conditioned on 3D environmental perception, ego vehicle states, and driver commands. To bridge the modality gap between driving visual representations and language embeddings, we propose a hierarchical vision-language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Besides, OpenDriveVLA models the dynamic relationships between the ego vehicle, surrounding agents, and static road elements through an autoregressive agent-env-ego interaction process, ensuring both spatially and behaviorally informed trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question-answering tasks. Qualitative analyses further illustrate OpenDriveVLA’s superior capability to follow high-level driving commands and robustly generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving. We will release our code to facilitate further research in this domain.
zh

[CV-114] xtCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

【速读】：该论文致力于解决复杂视觉文本生成（Complex Visual Text Generation, CVTG）任务中的挑战，特别是在视觉图像中生成分布于不同区域的复杂文本内容时，现有方法常面临文本扭曲、模糊或缺失等问题。为应对这些挑战，论文提出了一种名为TextCrafter的新颖多视觉文本渲染方法。TextCrafter的关键在于采用渐进策略将复杂视觉文本分解为独立组件，并确保文本内容与其视觉载体之间的鲁棒对齐，同时引入令牌关注增强机制以提升生成过程中视觉文本的显著性。此外，论文构建了一个专门用于评估生成模型在CVTG任务上性能的新基准数据集CVTG-2K。实验结果表明，所提方法在解决CVTG任务中的文本混淆、遗漏和模糊等问题方面超越了现有最先进方法。

链接: https://arxiv.org/abs/2503.23461
作者: Nikai Du,Zhennan Chen,Zhizhou Chen,Shan Gao,Xi Chen,Zhengkai Jiang,Jian Yang,Ying Tai
机构: Nanjing University; China Mobile; The Hong Kong University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper explores the task of Complex Visual Text Generation (CVTG), which centers on generating intricate textual content distributed across diverse regions within visual images. In CVTG, image generation models often rendering distorted and blurred visual text or missing some visual text. To tackle these challenges, we propose TextCrafter, a novel multi-visual text rendering method. TextCrafter employs a progressive strategy to decompose complex visual text into distinct components while ensuring robust alignment between textual content and its visual carrier. Additionally, it incorporates a token focus enhancement mechanism to amplify the prominence of visual text during the generation process. TextCrafter effectively addresses key challenges in CVTG tasks, such as text confusion, omissions, and blurriness. Moreover, we present a new benchmark dataset, CVTG-2K, tailored to rigorously evaluate the performance of generative models on CVTG tasks. Extensive experiments demonstrate that our method surpasses state-of-the-art approaches.
zh

[CV-115] Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach ICME

【速读】：该论文旨在解决视觉Transformer (Vision Transformer, ViT) 在处理大规模tokens时计算成本呈二次增长的问题，并提出了一种数据自适应的token剪枝策略。现有方法大多依赖人工设计，缺乏对输入变化的适应性，且未能充分考虑跨多层token剪枝的顺序特性。论文的关键创新在于首次利用强化学习（Reinforcement Learning, RL）来学习一种数据自适应的剪枝策略，将token剪枝建模为一个马尔可夫博弈（Markov Game），并通过多智能体近端策略优化（Multi-Agent Proximal Policy Optimization, MAPPO）实现每个agent对单个token的个性化剪枝决策。此外，论文设计了奖励函数以平衡效率与精度之间的权衡。实验结果显示，在ImageNet-1k数据集上，该方法在仅造成0.4%精度损失的情况下，将推理速度提升了高达44%。

链接: https://arxiv.org/abs/2503.23459
作者: Chenglong Lu,Shen Liang,Xuewei Wang,Wei Wang
机构: School of Computer Science, Fudan University, China (复旦大学计算机学院); Data Intelligence Institute of Paris (diiP) & LIPADE, Université Paris Cité, Paris, France (巴黎城市大学数据智能研究所 (diiP) 和 LIPADE); State Key Laboratory of Mechanical Behavior and System Safety of Traffic Engineering, Shijiazhuang Tiedao University, Shijiazhuang, China (石家庄铁道大学机械行为与交通安全国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE International Conference on Multimedia Expo (ICME) 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have computational costs scaling quadratically with the number of tokens, calling for effective token pruning policies. Most existing policies are handcrafted, lacking adaptivity to varying inputs. Moreover, they fail to consider the sequential nature of token pruning across multiple layers. In this work, for the first time (as far as we know), we exploit Reinforcement Learning (RL) to data-adaptively learn a pruning policy. Formulating token pruning as a sequential decision-making problem, we model it as a Markov Game and utilize Multi-Agent Proximal Policy Optimization (MAPPO) where each agent makes an individualized pruning decision for a single token. We also develop reward functions that enable simultaneous collaboration and competition of these agents to balance efficiency and accuracy. On the well-known ImageNet-1k dataset, our method improves the inference speed by up to 44% while incurring only a negligible accuracy drop of 0.4%. The source code is available at this https URL.
zh

[CV-116] CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation

【速读】：该论文致力于解决基于语言表达在遥感图像中分割特定目标物体（Referring Remote Sensing Image Segmentation, RRSIS）这一具有挑战性的任务。现有方法通常采用粗粒度单向对齐策略来获取多模态特征，但往往忽视了语言特征作为上下文信息在解码过程中的重要作用，导致视觉与语言特征之间对象级对应关系较弱，特别是在处理复杂表达和精细遥感场景时，容易产生不完整或错误的预测掩膜。为应对这些挑战，论文提出了一种细粒度跨模态对齐与解码Transformer模型CADFormer。其关键在于设计了一个语义互导对齐模块（Semantic Mutual Guidance Alignment Module, SMGAM），实现视觉到语言以及语言到视觉的双向对齐，以全面整合视觉与文本特征；同时引入文本增强的跨模态解码器（Textual-Enhanced Cross-Modal Decoder, TCMD），在解码过程中融入优化后的文本信息作为上下文，强化跨模态特征之间的关联性。此外，为全面评估CADFormer在复杂场景中对微小目标分割的效果，构建了一个新的RRSIS数据集RRSIS-HR，并通过实验验证了CADFormer的有效性和优越性。

链接: https://arxiv.org/abs/2503.23456
作者: Maofu Liu,Xin Jiang,Xiaokang Zhang
机构: School of Computer Science and Technology, Wuhan University of Science and Technology (武汉科技大学计算机科学与技术学院); School of Information Science and Engineering, Wuhan University of Science and Technology (武汉科技大学信息科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task, aiming to segment specific target objects in remote sensing (RS) images based on a given language expression. Existing RRSIS methods typically employ coarse-grained unidirectional alignment approaches to obtain multimodal features, and they often overlook the critical role of language features as contextual information during the decoding process. Consequently, these methods exhibit weak object-level correspondence between visual and language features, leading to incomplete or erroneous predicted masks, especially when handling complex expressions and intricate RS image scenes. To address these challenges, we propose a fine-grained cross-modal alignment and decoding Transformer, CADFormer, for RRSIS. Specifically, we design a semantic mutual guidance alignment module (SMGAM) to achieve both vision-to-language and language-to-vision alignment, enabling comprehensive integration of visual and textual features for fine-grained cross-modal alignment. Furthermore, a textual-enhanced cross-modal decoder (TCMD) is introduced to incorporate language features during decoding, using refined textual information as context to enhance the relationship between cross-modal features. To thoroughly evaluate the performance of CADFormer, especially for inconspicuous targets in complex scenes, we constructed a new RRSIS dataset, called RRSIS-HR, which includes larger high-resolution RS image patches and semantically richer language expressions. Extensive experiments on the RRSIS-HR dataset and the popular RRSIS-D dataset demonstrate the effectiveness and superiority of CADFormer. Datasets and source codes will be available at this https URL.
zh

[CV-117] Efficient Token Compression for Vision Transformer with Spatial Information Preserved

【速读】：该论文旨在解决Transformer模型在资源受限环境下的计算和内存需求过高的问题，通过提出一种高效的硬件兼容的标记压缩方法（Prune and Merge）来实现这一目标。其解决方案的关键在于将标记剪枝与合并操作集成到Transformer模型中，以实现逐层标记压缩。这种方法通过引入可训练的合并和重构矩阵，并利用捷径连接，在保持重要信息的同时高效地合并标记，并恢复被剪枝的标记。此外，论文还引入了一种新颖的基于梯度加权的注意力评分机制，该机制能够在训练阶段计算标记的重要性得分，从而在推理过程中无需额外计算，提升了压缩效率。同时，利用梯度信息捕捉标记的全局影响并自动识别最优压缩结构也是其关键点之一。实验结果表明，所提方法在ImageNet-1k和ADE20K数据集上的表现优于现有技术，例如在DeiT-Small模型上实现了1.64倍的速度提升且仅导致0.2%的精度下降。

链接: https://arxiv.org/abs/2503.23455
作者: Junzhu Mao,Yang Shen,Jinyang Guo,Yazhou Yao,Xiansheng Hua
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院); State Key Laboratory of Software Development Environment, Institute of Artificial Intelligence, Beihang University (北航软件开发环境国家重点实验室, 人工智能研究院); Terminus Group (Terminus 集团); School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Token compression is essential for reducing the computational and memory requirements of transformer models, enabling their deployment in resource-constrained environments. In this work, we propose an efficient and hardware-compatible token compression method called Prune and Merge. Our approach integrates token pruning and merging operations within transformer models to achieve layer-wise token compression. By introducing trainable merge and reconstruct matrices and utilizing shortcut connections, we efficiently merge tokens while preserving important information and enabling the restoration of pruned tokens. Additionally, we introduce a novel gradient-weighted attention scoring mechanism that computes token importance scores during the training phase, eliminating the need for separate computations during inference and enhancing compression efficiency. We also leverage gradient information to capture the global impact of tokens and automatically identify optimal compression structures. Extensive experiments on the ImageNet-1k and ADE20K datasets validate the effectiveness of our approach, achieving significant speed-ups with minimal accuracy degradation compared to state-of-the-art methods. For instance, on DeiT-Small, we achieve a 1.64 \times speed-up with only a 0.2% drop in accuracy on ImageNet-1k. Moreover, by compressing segmenter models and comparing with existing methods, we demonstrate the superior performance of our approach in terms of efficiency and effectiveness. Code and models have been made available at this https URL.
zh

[CV-118] Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning

【速读】：该论文旨在解决遥感图像描述生成中语义准确性不足以及难以精确捕捉与图像上下文最相关对象的问题。现有方法通常侧重于细粒度的视觉特征提取和全局信息捕获，但往往忽视了文本信息在增强视觉语义方面的互补作用，并且在精确定位与图像语境最相关的物体时面临挑战。为了解决这些问题，论文提出了一种结合语义-空间特征融合与动态图优化（SFDR）的方法，其关键是将语义-空间特征融合（SSFF）模块和动态图特征优化（DGFR）模块相结合。SSFF 模块通过利用预训练的 CLIP 特征、网格特征和感兴趣区域（ROI）特征实现多层级特征表示，以整合丰富的语义和空间信息；而 DGFR 模块则通过图注意力网络捕获特征节点之间的关系，并采用动态加权机制优先关注当前场景中最相关的物体，抑制次要信息。实验结果表明，该方法显著提升了生成描述的质量。

链接: https://arxiv.org/abs/2503.23453
作者: Maofu Liu,Jiahui Liu,Xiaokang Zhang
机构: School of Computer Science and Technology, Wuhan University of Science and Technology (武汉科技大学计算机科学与技术学院); School of Information Science and Engineering, Wuhan University of Science and Technology (武汉科技大学信息科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image captioning aims to generate semantically accurate descriptions that are closely linked to the visual features of remote sensing images. Existing approaches typically emphasize fine-grained extraction of visual features and capturing global information. However, they often overlook the complementary role of textual information in enhancing visual semantics and face challenges in precisely locating objects that are most relevant to the image context. To address these challenges, this paper presents a semantic-spatial feature fusion with dynamic graph refinement (SFDR) method, which integrates the semantic-spatial feature fusion (SSFF) and dynamic graph feature refinement (DGFR) modules. The SSFF module utilizes a multi-level feature representation strategy by leveraging pre-trained CLIP features, grid features, and ROI features to integrate rich semantic and spatial information. In the DGFR module, a graph attention network captures the relationships between feature nodes, while a dynamic weighting mechanism prioritizes objects that are most relevant to the current scene and suppresses less significant ones. Therefore, the proposed SFDR method significantly enhances the quality of the generated descriptions. Experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method. The source code will be available at this https URLthis https URL.
zh

[CV-119] VideoGen-Eval: Agent -based System for Video Generation Evaluation

【速读】：该论文旨在解决现有视频生成评估系统因简单提示无法展示模型能力、固定评估算子在领域外（Out-of-Distribution, OOD）情况下的局限性以及计算指标与人类偏好不一致等问题，无法有效评估最先进的视频生成模型。为填补这一差距，论文提出VideoGen-Eval，这是一种集成基于大型语言模型（LLM）的内容结构化、基于多任务大型语言模型（MLLM）的内容判断以及针对时间密集维度设计的补丁工具的评估系统。其关键在于通过引入这些动态、灵活且可扩展的组件，实现对视频生成模型更全面和准确的评价，并通过构建包含700个结构化、内容丰富的提示（文本到视频T2V和图像到视频I2V）以及由20多种模型生成的超过12,000段视频的基准数据集来验证其有效性，最终证明该基于代理的评估系统与人类偏好高度一致且评估结果可靠。

链接: https://arxiv.org/abs/2503.23452
作者: Yuhang Yang,Ke Fan,Shangkun Sun,Hongxiang Li,Ailing Zeng,FeiLin Han,Wei Zhai,Wei Liu,Yang Cao,Zheng-Jun Zha
机构: USTC(中国科学技术大学); SJTU(上海交通大学); PKUSZ(北京大学深圳研究生院); Tencent(腾讯); BFA(北京电影学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project: this https URL

点击查看摘要

Abstract:The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model’s capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
zh

[CV-120] Beyond Academic Benchmarks: Critical Analysis and Best Practices for Visual Industrial Anomaly Detection

【速读】：该论文旨在解决生成式视觉异常检测（Visual Anomaly Detection, VAD）领域中学术研究与工业应用之间的脱节问题。具体而言，现有方法在实验室环境下表现良好，但在实际生产环境中往往因数据分布差异或计算资源需求而失效。论文的关键解决方案包括：(1) 构建基于真实生产数据的基准数据集以强调实际场景的重要性；(2) 使用贴近工业应用的评估指标对现有最先进的方法进行公平比较；(3) 深入分析学术界与工业界之间的差距，并提出新的视角以促进二者融合。通过这些措施，论文希望推动视觉异常检测技术更有效地服务于制造业自动化检测的实际需求。代码已公开发布。

链接: https://arxiv.org/abs/2503.23451
作者: Aimira Baitieva,Yacine Bouaouni,Alexandre Briot,Dick Ameln,Souhaiel Khalfaoui,Samet Akcay
机构: Valeo; Intel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection (AD) is essential for automating visual inspection in manufacturing. This field of computer vision is rapidly evolving, with increasing attention towards real-world applications. Meanwhile, popular datasets are typically produced in controlled lab environments with artificially created defects, unable to capture the diversity of real production conditions. New methods often fail in production settings, showing significant performance degradation or requiring impractical computational resources. This disconnect between academic results and industrial viability threatens to misdirect visual anomaly detection research. This paper makes three key contributions: (1) we demonstrate the importance of real-world datasets and establish benchmarks using actual production data, (2) we provide a fair comparison of existing SOTA methods across diverse tasks by utilizing metrics that are valuable for practical applications, and (3) we present a comprehensive analysis of recent advancements in this field by discussing important challenges and new perspectives for bridging the academia-industry gap. The code is publicly available at this https URL
zh

[CV-121] AU-TTT: Vision Test-Time Training model for Facial Action Unit Detection

【速读】：该论文旨在解决面部动作单元（Action Units, AUs）检测中因标注成本高、数据集有限导致的过拟合问题，以及现有方法在跨数据集应用时性能显著下降的挑战。此外，论文还关注如何缓解Transformer模型在长上下文建模中的二次复杂性问题，并探索提升AUs检测任务泛化能力的方法。为应对这些挑战，论文提出了一个名为AU-TTT的新型视觉主干网络，其关键在于引入测试时训练（Test-Time Training, TTT）线性层，并优化图像扫描机制以增强性能。同时，设计了针对AUs特定的感兴趣区域（Region of Interest, RoI）扫描机制，用于捕捉细粒度的面部特征。实验结果表明，该方法在域内和跨域场景下均表现出竞争力。

链接: https://arxiv.org/abs/2503.23450
作者: Bohao Xing,Kaishen Yuan,Zitong Yu,Xin Liu,Heikki Kälviäinen
机构: Lappeenranta-Lahti University of Technology LUT (拉彭兰塔-拉赫蒂工业大学 LUT); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）); Great Bay University (大湾区大学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Brno University of Technology (布尔诺科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Action Units (AUs) detection is a cornerstone of objective facial expression analysis and a critical focus in affective computing. Despite its importance, AU detection faces significant challenges, such as the high cost of AU annotation and the limited availability of datasets. These constraints often lead to overfitting in existing methods, resulting in substantial performance degradation when applied across diverse datasets. Addressing these issues is essential for improving the reliability and generalizability of AU detection methods. Moreover, many current approaches leverage Transformers for their effectiveness in long-context modeling, but they are hindered by the quadratic complexity of self-attention. Recently, Test-Time Training (TTT) layers have emerged as a promising solution for long-sequence modeling. Additionally, TTT applies self-supervised learning for iterative updates during both training and inference, offering a potential pathway to mitigate the generalization challenges inherent in AU detection tasks. In this paper, we propose a novel vision backbone tailored for AU detection, incorporating bidirectional TTT blocks, named AU-TTT. Our approach introduces TTT Linear to the AU detection task and optimizes image scanning mechanisms for enhanced performance. Additionally, we design an AU-specific Region of Interest (RoI) scanning mechanism to capture fine-grained facial features critical for AU detection. Experimental results demonstrate that our method achieves competitive performance in both within-domain and cross-domain scenarios.
zh

[CV-122] CA2ST: Cross-Attention in Audio Space and Time for Holistic Video Recognition

【速读】：该论文试图解决视频识别任务中空间和时间理解不平衡的问题。现有大多数模型在处理视频时缺乏对时空信息的均衡理解。为了解决这一问题，论文提出了一种基于Transformer的方法Cross-Attention in Audio, Space, and Time (CA²ST)，其核心解决方案在于引入了一种新颖的双流架构——Cross-Attention in Space and Time (CAST)，仅使用RGB输入即可实现空间和时间信息的有效交互与协同预测。CAST中的Bottleneck Cross-Attention (B-CA) 模块是关键，它使空间和时间专家能够交换信息并做出协同决策。此外，为了实现更全面的视频理解，进一步扩展了CAST，通过整合音频专家形成Cross-Attention in Visual and Audio (CAVA)，验证了多专家之间有效信息交换的能力。总结而言，CA²ST通过跨注意力机制结合空间、时间及音频专家，实现了均衡且全面的视频理解。

链接: https://arxiv.org/abs/2503.23447
作者: Jongseo Lee,Joohyun Chang,Dongho Lee,Jinwoo Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages including appendix, TPAMI under review

点击查看摘要

Abstract:We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.
zh

[CV-123] Improving underwater semantic segmentation with underwater image quality attention and muti-scale aggregation attention

【速读】：该论文旨在解决水下图像语义分割因低光照环境导致的成像质量下降问题，尤其是物体区域边界轮廓分割性能严重退化的问题。论文的关键解决方案包括：(1) 提出水下图像质量注意模块（Underwater Image Quality Attention, UIQA），通过通道自注意力机制增强高质量语义信息的特征表示；(2) 设计多尺度聚合注意模块（Multi-scale Aggregation Attention, MAA），从高层特征中提取判别信息以聚合不同尺度的语义特征，补偿水下目标细节的语义损失；(3) 在训练阶段引入边缘学习损失（Edge Learning Loss, ELL），以强化模型对水下物体边缘的学习能力并提升预测准确性。这些方法显著提升了分割完整性、边界清晰度及主观感知细节，在SUIM和DUT-USEG数据集上的实验结果验证了其优越性。

链接: https://arxiv.org/abs/2503.23422
作者: Xin Zuo,Jiaran Jiang,Jifeng Shen,Wankou Yang
机构: School of Computer Science and Engineering, Jiangsu University of Science and Technology (江苏科技大学计算机科学与工程学院), Zhenjiang, 212003, China; School of Electronic and Informatics Engineering, Jiangsu University (江苏大学电子与信息工程学院), Zhenjiang, 212013, China; School of Automation, Southeast University (东南大学自动化学院), Nanjing, Jiangsu, 210096, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Pattern Analysis and Applications

点击查看摘要

Abstract:Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer-based framework for semantic segmentation of low-quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self-attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi-scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high-level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model’s learning of underwater object edges and improve the model’s prediction accuracy. Experiments conducted on the SUIM and DUT-USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at this https URL.
zh

[CV-124] Visual Acuity Consistent Foveated Rendering towards Retinal Resolution

【速读】：该论文旨在解决传统foveated渲染方法在高分辨率显示时阴影负载增加导致效率下降的问题，特别是在处理视网膜级分辨率时。为了解决这一挑战，论文基于人类视觉系统（HVS）的感知本质，提出了视觉敏锐一致性foveated渲染（VaFR），目标是在视网膜级分辨率下实现卓越的渲染性能。方案的关键在于提出了一种新颖的基于人类视觉敏锐度模型的对数极坐标映射函数，该函数适应视觉系统的自然带宽。此映射函数及其相关的着色率确保了无论显示分辨率如何变化，渲染信息输出的一致性。因此，VaFR在保持感知视觉质量的同时，显著提升了渲染速度，尤其是在视网膜分辨率下，相比其他方法实现了高达6.5倍至9.29倍的延迟渲染加速以及10.4倍至16.4倍的光线追踪加速，并大幅提高了双目8K路径追踪的渲染性能。

链接: https://arxiv.org/abs/2503.23410
作者: Zhi Zhang,Meng Gai,Sheng Li
机构: School of Software and Microelectronics, Peking University, China (北京大学软件与微电子学院); School of Computer Science, Peking University, China (北京大学计算机科学学院); National Key Laboratory of Intelligent Parallel Technology (智能平行技术国家重点实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prior foveated rendering methods often suffer from a limitation where the shading load escalates with increasing display resolution, leading to decreased efficiency, particularly when dealing with retinal-level resolutions. To tackle this challenge, we begin with the essence of the human visual system (HVS) perception and present visual acuity-consistent foveated rendering (VaFR), aiming to achieve exceptional rendering performance at retinal-level resolutions. Specifically, we propose a method with a novel log-polar mapping function derived from the human visual acuity model, which accommodates the natural bandwidth of the visual system. This mapping function and its associated shading rate guarantee a consistent output of rendering information, regardless of variations in the display resolution of the VR HMD. Consequently, our VaFR outperforms alternative methods, improving rendering speed while preserving perceptual visual quality, particularly when operating at retinal resolutions. We validate our approach using both the rasterization and ray-casting rendering pipelines. We also validate our approach using different binocular rendering strategies for HMD devices. In diverse testing scenarios, our approach delivers better perceptual visual quality than prior foveated rendering while achieving an impressive speedup of 6.5 \times -9.29 \times for deferred rendering of 3D scenarios and an even more powerful speedup of 10.4 \times -16.4 \times for ray-casting at retinal resolution. Additionally, our approach significantly enhances the rendering performance of binocular 8K path tracing, achieving smooth frame rates.
zh

[CV-125] GMapLatent: Geometric Mapping in Latent Space

【速读】：该论文致力于解决跨域生成模型在基于编码器-解码器架构生成逼真图像时面临的域对齐难题。传统域对齐方法通常直接处理初始分布，但不匹配或混合的聚类可能导致解码器中的模式崩溃（Mode Collapse）和混合问题，从而削弱模型的泛化能力。论文的关键创新在于提出了一种新的跨域对齐与生成模型GMapLatent，其通过引入基于几何映射的规范潜在空间表示，以严格且精确的方式对齐跨域潜在空间，从而避免编码器-解码器生成架构中的模式崩溃和混合问题。该方法的核心是利用带有聚类装饰的潜在空间的规范参数化，在严格的聚类对应约束下无缝对齐潜在空间。具体而言，首先通过对质心平移、最优传输合并和受约束调和映射进行组合，将潜在空间转换到规范参数域；然后在规范参数域上施加聚类约束以计算几何注册。此过程实现了新变换潜在空间之间的双射映射，并生成精确的聚类对齐。最终，通过嵌入编码器-解码器管道的对齐潜在空间实现跨域生成。实验结果验证了GMapLatent在灰度图像和彩色图像上的高效性、有效性及适用性，并表明该模型相比现有方法具有更优的性能。

链接: https://arxiv.org/abs/2503.23407
作者: Wei Zeng,Xuebin Chang,Jianghao Su,Xiang Gu,Jian Sun,Zongben Xu
机构: Dept. of Information Science, School of Mathematics and Statistics, Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain generative models based on encoder-decoder AI architectures have attracted much attention in generating realistic images, where domain alignment is crucial for generation accuracy. Domain alignment methods usually deal directly with the initial distribution; however, mismatched or mixed clusters can lead to mode collapse and mixture problems in the decoder, compromising model generalization capabilities. In this work, we innovate a cross-domain alignment and generation model that introduces a canonical latent space representation based on geometric mapping to align the cross-domain latent spaces in a rigorous and precise manner, thus avoiding mode collapse and mixture in the encoder-decoder generation architectures. We name this model GMapLatent. The core of the method is to seamlessly align latent spaces with strict cluster correspondence constraints using the canonical parameterizations of cluster-decorated latent spaces. We first (1) transform the latent space to a canonical parameter domain by composing barycenter translation, optimal transport merging and constrained harmonic mapping, and then (2) compute geometric registration with cluster constraints over the canonical parameter domains. This process realizes a bijective (one-to-one and onto) mapping between newly transformed latent spaces and generates a precise alignment of cluster pairs. Cross-domain generation is then achieved through the aligned latent spaces embedded in the encoder-decoder pipeline. Experiments on gray-scale and color images validate the efficiency, efficacy and applicability of GMapLatent, and demonstrate that the proposed model has superior performance over existing models.
zh

[CV-126] Diffusion Meets Few-shot Class Incremental Learning

【速读】：本文旨在解决少样本增量学习（Few-shot Class-Incremental Learning, FSCIL）中的挑战，即在极有限的训练数据下同时减少灾难性遗忘（Catastrophic Forgetting）并学习新信息。为应对这一问题，论文提出了一种名为Diffusion-FSCIL的新方法，其核心是利用冻结的文本到图像扩散模型作为固定主干网络。解决方案的关键在于充分发挥大规模生成模型的能力，包括通过大规模预训练获得的生成能力（1）、多尺度表征能力（2）以及通过文本编码器实现的表征灵活性（3）。为了最大化表示能力，作者提出提取多个互补的扩散特征作为潜在回放，并结合轻量级特征蒸馏以防止生成偏差。此外，该框架通过采用冻结主干网络、最小化可训练组件以及批量处理多种特征提取来实现高效性。实验结果表明，Diffusion-FSCIL在CUB-200、miniImageNet和CIFAR-100数据集上的表现超越现有最先进方法，既保持了对已有类别的性能保留，又能有效适应新类别。

链接: https://arxiv.org/abs/2503.23402
作者: Junsu Kim,Yunhoe Ku,Dongyoon Han,Seungryul Baek
机构: UNIST; NAVER AI Lab (NAVER人工智能实验室); DeepBrain AI (DeepBrain人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: pre-print

点击查看摘要

Abstract:Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model’s capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, miniImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.
zh

[CV-127] A Large Scale Analysis of Gender Biases in Text-to-Image Generative Models

【速读】：该论文试图解决文本到图像（Text-to-Image, T2I）模型中的性别偏见问题，重点关注日常情境下的性别关联。论文的关键解决方案在于创建了一个包含3,217个性别中性提示的数据集，并从五个领先的T2I模型中每条提示生成200张图像，最终筛选出2,293,295张符合条件的图像。通过自动检测生成图像中人物的感知性别，并将提示按语义相似性分组，计算每条提示中男性和女性形象的比例，论文揭示了T2I模型如何强化传统性别角色、反映常见的性别刻板印象，并在金融相关活动中低估女性的表现，同时展示了女性多被描绘在护理和以人类为中心的情境中，而男性则更多出现在技术或体力劳动场景中。

链接: https://arxiv.org/abs/2503.23398
作者: Leander Girrbach,Stephan Alaniz,Genevieve Smith,Zeynep Akata
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); MDSI (MDSI未明确翻译); Helmholtz Munich (赫尔姆霍兹协会慕尼黑研究中心); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:With the increasing use of image generation technology, understanding its social biases, including gender bias, is essential. This paper presents the first large-scale study on gender bias in text-to-image (T2I) models, focusing on everyday situations. While previous research has examined biases in occupations, we extend this analysis to gender associations in daily activities, objects, and contexts. We create a dataset of 3,217 gender-neutral prompts and generate 200 images per prompt from five leading T2I models. We automatically detect the perceived gender of people in the generated images and filter out images with no person or multiple people of different genders, leaving 2,293,295 images. To enable a broad analysis of gender bias in T2I models, we group prompts into semantically similar concepts and calculate the proportion of male- and female-gendered images for each prompt. Our analysis shows that T2I models reinforce traditional gender roles, reflect common gender stereotypes in household roles, and underrepresent women in financial related activities. Women are predominantly portrayed in care- and human-centered scenarios, and men in technical or physical labor scenarios.
zh

[CV-128] COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation CVPR2025

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在测试阶段适应新领域（novel domains）时面临的显著挑战。传统基于缓存的方法虽然通过利用历史信息表现出了潜力，但存在两个主要局限：一是难以处理缓存中不可靠的特征-标签对；二是查询过程中过度依赖单一类别的信息，导致适应精度大幅下降。为克服这些限制，论文提出了一种名为COSMIC（Clique-Oriented Semantic Multi-space Integration for CLIP）的鲁棒测试阶段自适应框架。其关键创新在于引入了Dual Semantics Graph (DSG) 和 Clique Guided Hyper-class (CGH) 两个组件。DSG 构建互补语义空间，结合文本特征、粗粒度CLIP特征和细粒度DINOv2特征以捕捉丰富的语义关系；基于此，CGH 组件利用结构化的类别关系，通过相关类别选择增强预测的鲁棒性。实验结果表明，COSMIC在多个基准数据集上的性能显著优于现有方法，在分布外任务上提升了15.81%，在跨域生成任务中使用CLIP RN-50时提升了5.33%。

链接: https://arxiv.org/abs/2503.23388
作者: Fanding Huang,Jingyan Jiang,Qinting Jiang,Hebei Li,Faisal Nadeem Khan,Zhi Wang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Shenzhen Technology University (深圳技术大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC’s superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50. Code is available at this http URL.
zh

[CV-129] Enhancing Human Motion Prediction via Multi-range Decoupling Decoding with Gating-adjusting Aggregation

【速读】：该论文旨在解决人类运动预测（Human Motion Prediction, HMP）中运动表示学习受历史信息与未来时刻之间时序相关性差异影响的问题。虽然基于深度学习的方法在学习运动表示方面表现出了潜力，但它们通常未能充分考虑短期预测与远期预测之间的相关性差异，即短期预测具有较强的关联性，而远期预测的关联性较弱。这种局限性阻碍了运动表示的有效学习，从而影响了预测性能。

论文的关键解决方案是提出了一种名为多范围解耦解码与门控调整聚合（Multi-Range Decoupling Decoding with Gating-Adjusting Aggregation, MD2GA）的新方法。MD2GA 利用时序相关性来优化运动表示的学习过程。该方法采用两阶段策略：第一阶段通过多范围解耦解码将共享特征解码为不同未来时间长度，从而以多样化的方式洞察运动模式；第二阶段通过门控调整聚合动态融合这些多样化的洞察，由输入的运动数据引导。实验表明，该方法能够轻松集成到其他运动预测方法中，并显著提升其预测性能。

链接: https://arxiv.org/abs/2503.23381
作者: Jiexin Wang,Wenwen Qiang,Zhao Yang,Bing Su
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Institute of Software Chinese Academy of Sciences (中国科学院软件研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Expressive representation of pose sequences is crucial for accurate motion modeling in human motion prediction (HMP). While recent deep learning-based methods have shown promise in learning motion representations, these methods tend to overlook the varying relevance and dependencies between historical information and future moments, with a stronger correlation for short-term predictions and weaker for distant future predictions. This limits the learning of motion representation and then hampers prediction performance. In this paper, we propose a novel approach called multi-range decoupling decoding with gating-adjusting aggregation ( MD2GA ), which leverages the temporal correlations to refine motion representation learning. This approach employs a two-stage strategy for HMP. In the first stage, a multi-range decoupling decoding adeptly adjusts feature learning by decoding the shared features into distinct future lengths, where different decoders offer diverse insights into motion patterns. In the second stage, a gating-adjusting aggregation dynamically combines the diverse insights guided by input motion data. Extensive experiments demonstrate that the proposed method can be easily integrated into other motion prediction methods and enhance their prediction performance.
zh

[CV-130] KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters

【速读】：该论文旨在解决动态卷积在提升模型容量时面临的权重量化与推理速度之间的关键权衡问题。具体而言，传统方法或因线性增加核数量导致显著的参数开销，或通过复杂的核交互影响推理效率，或难以同时优化动态注意力与静态核。此外，研究还发现预训练卷积神经网络(CNNs)存在类似大语言模型(LLMs)中的层间冗余，可通过共享“父”卷积核生成的“子”层高效替代密集卷积层。为此，论文提出了一种轻量级卷积核插件KernelDNA，其通过解耦输入相关的动态路由与预训练的静态调制，实现了参数高效且硬件友好的推理过程。关键在于引入跨层权值共享与基于适配器的调制机制，无需改变标准卷积结构即可实现动态核的专业化，从而在保持标准卷积计算效率的同时增强表征能力。实验表明，KernelDNA在动态卷积变体中达到了最先进的准确率-效率平衡。

链接: https://arxiv.org/abs/2503.23379
作者: Haiduo Huang,Yadong Zhang,Pengju Ren
机构: Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所), Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dynamic convolution enhances model capacity by adaptively combining multiple kernels, yet faces critical trade-offs: prior works either (1) incur significant parameter overhead by scaling kernel numbers linearly, (2) compromise inference speed through complex kernel interactions, or (3) struggle to jointly optimize dynamic attention and static kernels. We also observe that pre-trained Convolutional Neural Networks (CNNs) exhibit inter-layer redundancy akin to that in Large Language Models (LLMs). Specifically, dense convolutional layers can be efficiently replaced by derived child" layers generated from a shared parent" convolutional kernel through an adapter. To address these limitations and implement the weight-sharing mechanism, we propose a lightweight convolution kernel plug-in, named KernelDNA. It decouples kernel adaptation into input-dependent dynamic routing and pre-trained static modulation, ensuring both parameter efficiency and hardware-friendly inference. Unlike existing dynamic convolutions that expand parameters via multi-kernel ensembles, our method leverages cross-layer weight sharing and adapter-based modulation, enabling dynamic kernel specialization without altering the standard convolution structure. This design preserves the native computational efficiency of standard convolutions while enhancing representation power through input-adaptive kernel adjustments. Experiments on image classification and dense prediction tasks demonstrate that KernelDNA achieves state-of-the-art accuracy-efficiency balance among dynamic convolution variants. Our codes are available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2503.23379 [cs.CV] (or arXiv:2503.23379v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.23379 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-131] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

【速读】：该论文致力于解决同步音频视频生成（Joint Audio-Video Generation, JAVG）任务中的高质量生成与精确同步问题。为实现这一目标，论文提出了一种名为JavisDiT的新方法，这是一种基于扩散变换器（Diffusion Transformer, DiT）架构的联合音频视频扩散变换器。JavisDiT的关键创新在于引入了一个细粒度时空对齐机制，通过分层空间-时间同步先验估计器（Hierarchical Spatial-Temporal Synchronized Prior, HiST-Sypo Estimator）确保音视频之间的最佳同步。此模块能够提取全局和细粒度的时空先验信息，从而指导视觉与听觉组件间的同步。此外，论文还构建了一个包含10,140段高质量带文本描述的声音视频的新基准数据集JavisBench，并提出了一种新的鲁棒评估指标用于衡量真实复杂场景下生成音频视频对的同步性。实验结果表明，JavisDiT在保证高生成质量的同时实现了精确的同步，为JAVG任务设立了新的标准。

链接: https://arxiv.org/abs/2503.23377
作者: Kai Liu,Wei Li,Lai Chen,Shengqiong Wu,Yanhao Zheng,Jiayi Ji,Fan Zhou,Rongxin Jiang,Jiebo Luo,Hao Fei,Tat-Seng Chua
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress. Homepage: this https URL

点击查看摘要

Abstract:This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at this https URL.
zh

[CV-132] Map Feature Perception Metric for Map Generation Quality Assessment and Loss Optimization

【速读】：该论文旨在解决智能制图任务中由生成模型驱动的地图合成真实性评估问题。当前方法主要依赖基于计算机视觉的图像评估指标（如L1、L2、SSIM和FID）进行像素级比较，但这些指标难以捕捉地图的全局特征和空间相关性，导致生成结果出现语义-结构失真。论文的关键解决方案是提出了一种新的地图特征感知度量（Map Feature Perception Metric, MFP），通过提取元素级深度特征来全面编码地图的结构完整性和拓扑关系，从而实现对地图全局特性和空间一致性的有效评价。实验验证表明，MFP在评估地图语义特征方面优于传统损失函数，并在多个基准数据集上实现了2%到50%的性能提升。这一研究结论强调了显式考虑地图全局属性和空间连贯性对于提升生成模型优化效果的重要性，显著增强了合成地图的地理可信度。

链接: https://arxiv.org/abs/2503.23370
作者: Chenxing Sun,Jing Bai
机构: Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences (中国地质大学), Wuhan 430074, China.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In intelligent cartographic generation tasks empowered by generative models, the authenticity of synthesized maps constitutes a critical determinant. Concurrently, the selection of appropriate evaluation metrics to quantify map authenticity emerges as a pivotal research challenge. Current methodologies predominantly adopt computer vision-based image assessment metrics to compute discrepancies between generated and reference maps. However, conventional visual similarity metrics-including L1, L2, SSIM, and FID-primarily operate at pixel-level comparisons, inadequately capturing cartographic global features and spatial correlations, consequently inducing semantic-structural artifacts in generated outputs. This study introduces a novel Map Feature Perception Metric designed to evaluate global characteristics and spatial congruence between synthesized and target maps. Diverging from pixel-wise metrics, our approach extracts elemental-level deep features that comprehensively encode cartographic structural integrity and topological relationships. Experimental validation demonstrates MFP’s superior capability in evaluating cartographic semantic features, with classification-enhanced implementations outperforming conventional loss functions across diverse generative frameworks. When employed as optimization objectives, our metric achieves performance gains ranging from 2% to 50% across multiple benchmarks compared to traditional L1, L2, and SSIM baselines. This investigation concludes that explicit consideration of cartographic global attributes and spatial coherence substantially enhances generative model optimization, thereby significantly improving the geographical plausibility of synthesized maps.
zh

[CV-133] owards Physically Plausible Video Generation via VLM Planning

【速读】：该论文旨在解决视频扩散模型（VDMs）在生成物理上合理视频时存在的局限性，即由于缺乏对物理规律的理解，导致生成的视频中动态和事件序列不正确。为了解决这一问题，论文提出了一种新颖的两阶段图像到视频生成框架，关键在于显式地结合物理学知识。第一阶段使用视觉语言模型（VLM）作为粗粒度的运动规划器，通过链式思维和具备物理意识的推理预测粗糙的运动轨迹/变化，以近似真实世界的物理动力学并确保帧间一致性；第二阶段利用这些预测的运动轨迹/变化来引导VDM进行视频生成。由于预测的运动轨迹/变化较为粗糙，在推理过程中添加噪声，赋予VDM生成更多细节运动的自由度。实验结果表明，该框架能够生成物理上合理的运动，并且与现有方法相比具有显著优势。

链接: https://arxiv.org/abs/2503.23368
作者: Xindi Yang,Baolu Li,Yiming Zhang,Zhenfei Yin,Lei Bai,Liqian Ma,Zhiyong Wang,Jianfei Cai,Tien-Tsin Wong,Huchuan Lu,Xu Jia
机构: Monash University (蒙纳士大学); Dalian University of Technology (大连理工大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Oxford University (牛津大学); The University of Sydney (悉尼大学); ZMO AI (ZMO AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 11 figures

点击查看摘要

Abstract:Video diffusion models (VDMs) have advanced significantly in recent years, enabling the generation of highly realistic videos and drawing the attention of the community in their potential as world simulators. However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics. In the first stage, we employ a Vision Language Model (VLM) as a coarse-grained motion planner, integrating chain-of-thought and physics-aware reasoning to predict a rough motion trajectories/changes that approximate real-world physical dynamics while ensuring the inter-frame consistency. In the second stage, we use the predicted motion trajectories/changes to guide the video generation of a VDM. As the predicted motion trajectories/changes are rough, noise is added during inference to provide freedom to the VDM in generating motion with more fine details. Extensive experimental results demonstrate that our framework can produce physically plausible motion, and comparative evaluations highlight the notable superiority of our approach over existing methods. More video results are available on our Project Page: this https URL.
zh

[CV-134] FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning

【速读】：该论文旨在解决现有视觉自回归（Visual Autoregressive, VAR）模型在处理高分辨率图像时，计算复杂度和运行时间随图像分辨率呈指数增长的问题。为应对这一挑战，论文提出了一种名为FastVAR的后训练加速方法，用于高效支持VAR模型的分辨率扩展。FastVAR的关键创新在于引入缓存令牌剪枝策略（cached token pruning strategy），该策略仅针对特定尺度进行建模时传递关键令牌（pivotal tokens），同时利用之前尺度步骤中的缓存令牌恢复被剪枝的槽位。这种方法显著减少了需要传递的令牌数量，从而大幅提升了大分辨率下的效率。实验表明，FastVAR不仅能够将基于FlashAttention加速的VAR模型进一步提速2.7倍，且性能损失可忽略不计（仅1%）。此外，FastVAR还实现了零样本生成更高分辨率图像的能力，在单个NVIDIA 3090 GPU上以15GB内存占用生成一张2K图像仅需1.5秒。

链接: https://arxiv.org/abs/2503.23367
作者: Hang Guo,Yawei Li,Taolin Zhang,Jiangshan Wang,Tao Dai,Shu-Tao Xia,Luca Benini
机构: Tsinghua University (清华大学); ETH Zürich (瑞士苏黎世联邦理工学院); Shenzhen University (深圳大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Visual Autoregressive (VAR) modeling has gained popularity for its shift towards next-scale prediction. However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. Our key finding is that the majority of latency arises from the large-scale step where most tokens have already converged. Leveraging this observation, we develop the cached token pruning strategy that only forwards pivotal tokens for scale-specific modeling while using cached tokens from previous scale steps to restore the pruned slots. This significantly reduces the number of forwarded tokens and improves the efficiency at larger resolutions. Experiments show the proposed FastVAR can further speedup FlashAttention-accelerated VAR by 2.7 \times with negligible performance drop of 1%. We further extend FastVAR to zero-shot generation of higher resolution images. In particular, FastVAR can generate one 2K image with 15GB memory footprints in 1.5s on a single NVIDIA 3090 GPU. Code is available at this https URL.
zh

[CV-135] OnSiteVRU: A High-Resolution Trajectory Dataset for High-Density Vulnerable Road Users

【速读】：该论文旨在解决现有交通数据集在捕捉弱势道路使用者（Vulnerable Road Users, VRUs）行为多样性与动态特性方面的不足，以支持复杂交通环境下自动驾驶系统的发展与优化。解决方案的关键在于开发了一套名为OnSiteVRU的数据集，其涵盖了多种场景（如交叉口、路段及城中村），提供了约17,429条轨迹数据，并以0.04秒的时间精度记录了机动车、电动自行车及人力自行车的运动信息。此外，该数据集整合了鸟瞰视角自然驾驶数据、车载实时动态检测数据以及环境信息（如交通信号、障碍物和实时地图），从而实现交互事件的全面重建。这一方法显著提升了弱势道路使用者密度和场景覆盖范围的表现，为交通流建模、轨迹预测及自动驾驶虚拟测试提供了重要支持。

链接: https://arxiv.org/abs/2503.23365
作者: Zhangcun Yan,Jianqing Li,Peng Hang,Jian Sun
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:With the acceleration of urbanization and the growth of transportation demands, the safety of vulnerable road users (VRUs, such as pedestrians and cyclists) in mixed traffic flows has become increasingly prominent, necessitating high-precision and diverse trajectory data to support the development and optimization of autonomous driving systems. However, existing datasets fall short in capturing the diversity and dynamics of VRU behaviors, making it difficult to meet the research demands of complex traffic environments. To address this gap, this study developed the OnSiteVRU datasets, which cover a variety of scenarios, including intersections, road segments, and urban villages. These datasets provide trajectory data for motor vehicles, electric bicycles, and human-powered bicycles, totaling approximately 17,429 trajectories with a precision of 0.04 seconds. The datasets integrate both aerial-view natural driving data and onboard real-time dynamic detection data, along with environmental information such as traffic signals, obstacles, and real-time maps, enabling a comprehensive reconstruction of interaction events. The results demonstrate that VRU_Data outperforms traditional datasets in terms of VRU density and scene coverage, offering a more comprehensive representation of VRU behavioral characteristics. This provides critical support for traffic flow modeling, trajectory prediction, and autonomous driving virtual testing. The dataset is publicly available for download at: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2503.23365 [cs.CV] (or arXiv:2503.23365v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.23365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-136] VideoFusion: A Spatio-Temporal Collaborative Network for Mutli-modal Video Fusion and Restoration

【速读】：该论文旨在解决视频多模态融合领域的两大难题：1）大规模多传感器视频数据集的缺乏限制了视频融合研究的发展；2）在统一框架中同时建模空间和时间依赖关系的固有难度。为应对这些挑战，论文提出了两个关键贡献：首先构建了一个包含220对时空同步且空间配准的红外-可见光视频对（总计153,797帧）的基准数据集M3SVD，填补了数据空白；其次提出了一种名为VideoFusion的多模态视频融合模型。该模型的关键在于通过差分增强模块实现跨模态信息交互与增强，采用完整的模态引导融合策略自适应整合多模态特征，并设计双时间协同注意机制动态聚合前后时序上下文以强化跨帧特征表示，从而有效缓解时序不一致性和干扰问题。

链接: https://arxiv.org/abs/2503.23359
作者: Linfeng Tang,Yeda Wang,Meiqi Gong,Zizhuo Li,Yuxin Deng,Xunpeng Yi,Chunyu Li,Han Xu,Hao Zhang,Jiayi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly integrates complementary context from multiple images rather than videos. This primarily stems from two factors: 1) the scarcity of large-scale multi-sensor video datasets, limiting research in video fusion, and 2) the inherent difficulty of jointly modeling spatial and temporal dependencies in a unified framework. This paper proactively compensates for the dilemmas. First, we construct M3SVD, a benchmark dataset with 220 temporally synchronized and spatially registered infrared-visible video pairs comprising 153,797 frames, filling the data gap for the video fusion community. Secondly, we propose VideoFusion, a multi-modal video fusion model that fully exploits cross-modal complementarity and temporal dynamics to generate spatio-temporally coherent videos from (potentially degraded) multi-modal inputs. Specifically, 1) a differential reinforcement module is developed for cross-modal information interaction and enhancement, 2) a complete modality-guided fusion strategy is employed to adaptively integrate multi-modal features, and 3) a bi-temporal co-attention mechanism is devised to dynamically aggregate forward-backward temporal contexts to reinforce cross-frame feature representations. Extensive experiments reveal that VideoFusion outperforms existing image-oriented fusion paradigms in sequential scenarios, effectively mitigating temporal inconsistency and interference.
zh

[CV-137] ControlFusion: A Controllable Image Fusion Framework with Language-Vision Degradation Prompts

【速读】：该论文旨在解决当前图像融合方法在应对真实世界成像场景中的复合退化问题时表现不佳，且缺乏适应用户特定需求的灵活性。论文提出了一种名为ControlFusion的可控图像融合框架，其关键在于通过语言-视觉提示（language-vision prompts）自适应地消除复合退化。一方面，研究开发了一个集成物理成像机制（包括Retinex理论和大气散射原理）的退化成像模型，用于模拟复合退化，从而从数据层面为处理真实世界的复杂退化提供可能；另一方面，设计了一种提示调节的修复与融合网络，通过动态增强特征来处理不同程度的复合退化。此外，为了考虑用户对质量感知的个体差异，引入文本编码器将用户指定的退化类型和严重程度嵌入为退化提示，并设计了一个空间频率协作的视觉适配器，以自主感知源图像中的退化，减少对用户指令的完全依赖。实验表明，ControlFusion在融合质量和退化处理方面优于现有的最先进的方法，特别是在处理具有多种级别的真实世界复合退化时表现出色。

链接: https://arxiv.org/abs/2503.23356
作者: Linfeng Tang,Yeda Wang,Zhanchuan Cai,Junjun Jiang,Jiayi Ma
机构: Wuhan University (武汉大学); Macau University of Science and Technology (澳门科技大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current image fusion methods struggle to address the composite degradations encountered in real-world imaging scenarios and lack the flexibility to accommodate user-specific requirements. In response to these challenges, we propose a controllable image fusion framework with language-vision prompts, termed ControlFusion, which adaptively neutralizes composite degradations. On the one hand, we develop a degraded imaging model that integrates physical imaging mechanisms, including the Retinex theory and atmospheric scattering principle, to simulate composite degradations, thereby providing potential for addressing real-world complex degradations from the data level. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features with degradation prompts, enabling our method to accommodate composite degradation of varying levels. Specifically, considering individual variations in quality perception of users, we incorporate a text encoder to embed user-specified degradation types and severity levels as degradation prompts. We also design a spatial-frequency collaborative visual adapter that autonomously perceives degradations in source images, thus eliminating the complete dependence on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly in countering real-world and compound degradations with various levels.
zh

[CV-138] DSPFusion: Image Fusion via Degradation and Semantic Dual-Prior Guidance

【速读】：该论文旨在解决现有图像融合方法在高质量图像上的表现良好，但在恶劣条件下捕获的退化图像融合方面表现不佳的问题，从而限制了图像融合技术的实际应用潜力。论文提出了一种基于退化先验与语义先验双引导的退化图像融合框架（Degradation and Semantic Prior dual-guided Framework for Fusion, DSPFusion）。其关键在于利用通过扩散模型恢复的退化解先验和高质量场景语义先验，同时指导信息恢复与融合过程。具体而言，该方法首先分别提取模态特定的退化解先验，并联合捕捉全面的低质量语义先验；随后通过扩散模型在紧凑潜空间中迭代恢复高质量语义先验，使方法比主流基于扩散模型的图像融合方案快20倍以上；最后结合退化解先验和高质量语义先验，借助双先验引导模块和先验引导融合模块实现信息增强与聚合。大量实验表明，DSPFusion能够有效缓解典型退化问题，同时以较低的计算成本整合互补上下文，极大拓展了图像融合的应用范围。

链接: https://arxiv.org/abs/2503.23355
作者: Linfeng Tang,Chunyu Li,Guoqing Wang,Yixuan Yuan,Jiayi Ma
机构: Wuhan University (武汉大学); University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing fusion methods are tailored for high-quality images but struggle with degraded images captured under harsh circumstances, thus limiting the practical potential of image fusion. This work presents a \textbfDegradation and \textbfSemantic \textbfPrior dual-guided framework for degraded image \textbfFusion (\textbfDSPFusion), utilizing degradation priors and high-quality scene semantic priors restored via diffusion models to guide both information recovery and fusion in a unified model. In specific, it first individually extracts modality-specific degradation priors, while jointly capturing comprehensive low-quality semantic priors. Subsequently, a diffusion model is developed to iteratively restore high-quality semantic priors in a compact latent space, enabling our method to be over 20 \times faster than mainstream diffusion model-based image fusion schemes. Finally, the degradation priors and high-quality semantic priors are employed to guide information enhancement and aggregation via the dual-prior guidance and prior-guided fusion modules. Extensive experiments demonstrate that DSPFusion mitigates most typical degradations while integrating complementary context with minimal computational cost, greatly broadening the application scope of image fusion.
zh

[CV-139] Object Isolated Attention for Consistent Story Visualization

【速读】：该论文致力于解决开放性故事可视化任务中维持角色一致性的问题，这是现有方法普遍面临的挑战。论文的关键解决方案在于提出了一种增强型Transformer模块，该模块采用分离的自注意力（self attention）和交叉注意力（cross attention）机制，并利用预训练扩散模型的先验知识确保场景逻辑性。其中，分离的自注意力机制通过优化注意力图来减少对无关区域的关注，突出同一角色的关键特征，从而提升角色一致性；而分离的交叉注意力机制则独立处理每个角色的特征，避免特征融合，进一步强化一致性。此外，该方法无需额外训练即可实现新角色和新故事线的连续生成，体现了其实用性和创新性。

链接: https://arxiv.org/abs/2503.23353
作者: Xiangyang Luo,Junhao Cheng,Yifan Xie,Xin Zhang,Tao Feng,Zhou Liu,Fei Ma,Fei Yu
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China (广东人工智能与数字经济实验室（深圳）); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院), Shenzhen, China; Sun Yat-sen University (中山大学), Shenzhen, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Open-ended story visualization is a challenging task that involves generating coherent image sequences from a given storyline. One of the main difficulties is maintaining character consistency while creating natural and contextually fitting scenes–an area where many existing methods struggle. In this paper, we propose an enhanced Transformer module that uses separate self attention and cross attention mechanisms, leveraging prior knowledge from pre-trained diffusion models to ensure logical scene creation. The isolated self attention mechanism improves character consistency by refining attention maps to reduce focus on irrelevant areas and highlight key features of the same character. Meanwhile, the isolated cross attention mechanism independently processes each character’s features, avoiding feature fusion and further strengthening consistency. Notably, our method is training-free, allowing the continuous generation of new characters and storylines without re-tuning. Both qualitative and quantitative evaluations show that our approach outperforms current methods, demonstrating its effectiveness.
zh

[CV-140] Physically Ground Commonsense Knowledge for Articulated Object Manipulation with Analytic Concepts

【速读】：该论文旨在解决将大型语言模型（Large Language Models, LLM）生成的语义级常识知识有效关联到物理世界，并以此全面指导机器人在通用关节物体操作中的应用这一挑战。论文的关键在于引入基于数学符号定义的分析性概念，这些概念可以直接被机器计算和模拟。通过利用这些分析性概念作为桥梁，连接LLMs推理出的语义知识与真实机器人运行的物理环境，论文实现了以物理信息引导的方式表征物体结构与功能的知识，并据此制定可推广、可解释且精确的机器人控制策略，从而实现通用关节物体操作。实验结果验证了所提方法的优越性。

链接: https://arxiv.org/abs/2503.23348
作者: Jianhua Sun,Jiude Wei,Yuxuan Li,Cewu Lu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We human rely on a wide range of commonsense knowledge to interact with an extensive number and categories of objects in the physical world. Likewise, such commonsense knowledge is also crucial for robots to successfully develop generalized object manipulation skills. While recent advancements in Large Language Models (LLM) have showcased their impressive capabilities in acquiring commonsense knowledge and conducting commonsense reasoning, effectively grounding this semantic-level knowledge produced by LLMs to the physical world to thoroughly guide robots in generalized articulated object manipulation remains a challenge that has not been sufficiently addressed. To this end, we introduce analytic concepts, procedurally defined upon mathematical symbolism that can be directly computed and simulated by machines. By leveraging the analytic concepts as a bridge between the semantic-level knowledge inferred by LLMs and the physical world where real robots operate, we are able to figure out the knowledge of object structure and functionality with physics-informed representations, and then use the physically grounded knowledge to instruct robot control policies for generalized, interpretable and accurate articulated object manipulation. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our approach.
zh

[CV-141] From Panels to Prose: Generating Literary Narratives from Comics

【速读】：该论文旨在解决漫画（尤其是日漫）对于视觉障碍者的故事可及性问题，通过开发一个自动化系统将漫画转换为基于文本的文学叙述。解决方案的关键在于提出了一种统一模型Magiv3，它在理解漫画方面表现出色，包括面板定位、角色与文本识别、OCR处理以及角色语境化等任务，并结合大规模视觉-语言模型生成连贯的文学叙事，使视觉障碍者能够体验漫画故事的深度与丰富性。

链接: https://arxiv.org/abs/2503.23344
作者: Ragav Sachdeva,Andrew Zisserman
机构: Visual Geometry Group, Dept. of Engineering Science, University of Oxford (牛津大学工程科学系视觉几何小组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside. To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.23344 [cs.CV] (or arXiv:2503.23344v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.23344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-142] Enhancing 3D Gaussian Splatting Compression via Spatial Condition-based Prediction ICME2025

【速读】：本文旨在解决基于锚点的3D高斯表示在新型视图合成（NVS）中的存储和传输成本过高的问题（数百兆字节甚至达到吉字节级别）。为应对这一挑战，论文引入了基于预测的技术，通过设计一种基于空间条件的预测模块来利用网格捕获的场景信息进行预测，并采用残差补偿策略以学习缺失的细粒度信息。此外，为了进一步压缩残差，提出了一种实例感知的超先验模型，构建了结构感知和实例感知的熵模型。实验结果表明，所提出的基于预测的压缩框架及其各项技术组件均表现出显著效果，与最先进的压缩方法相比，仍实现了24.42%的比特率节省。关键在于结合预测机制和残差补偿策略，以及创新的熵建模方法来优化压缩效率。

链接: https://arxiv.org/abs/2503.23337
作者: Jingui Ma,Yang Hu,Luyang Tang,Jiayu Yang,Yongqi Zhai,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清沉浸式媒体技术重点实验室), Shenzhen Graduate School (深圳研究生院), Peking University (北京大学); Pengcheng Laboratory (鹏城实验室), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: The paper has been accepted by ICME2025 in March,2025

点击查看摘要

Abstract:Recently, 3D Gaussian Spatting (3DGS) has gained widespread attention in Novel View Synthesis (NVS) due to the remarkable real-time rendering performance. However, the substantial cost of storage and transmission of vanilla 3DGS hinders its further application (hundreds of megabytes or even gigabytes for a single scene). Motivated by the achievements of prediction in video compression, we introduce the prediction technique into the anchor-based Gaussian representation to effectively reduce the bit rate. Specifically, we propose a spatial condition-based prediction module to utilize the grid-captured scene information for prediction, with a residual compensation strategy designed to learn the missing fine-grained information. Besides, to further compress the residual, we propose an instance-aware hyper prior, developing a structure-aware and instance-aware entropy model. Extensive experiments demonstrate the effectiveness of our prediction-based compression framework and each technical component. Even compared with SOTA compression method, our framework still achieves a bit rate savings of 24.42 percent. Code is to be released!
zh

[CV-143] raceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement

【速读】：该论文试图解决生成式 AI (Generative AI) 中 Latent Diffusion Model (LDM) 模型在图像生成过程中因现有归因方法（如直接嵌入水印）导致生成内容质量下降及鲁棒性不足的问题。论文的关键解决方案是提出了一种名为 TraceMark-LDM 的新型算法，该算法通过将水印作为引导信号重新排列从高斯分布采样的随机变量，同时优化 LDM 编码器以增强水印的鲁棒性。这种方法不仅保证了生成内容的非破坏性性能，还显著提升了图像质量和归因准确性，并表现出对多种常见攻击方法的卓越鲁棒性。

链接: https://arxiv.org/abs/2503.23332
作者: Wenhao Luo,Zhangyi Shen,Ye Yao,Feng Ding,Guopu Zhu,Weizhi Meng
机构: Hangzhou Dianzi University (杭州电子科技大学); Nanchang University (南昌大学); School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学计算机科学与技术学院); School of Computing and Communications, Lancaster University (兰开斯特大学计算与通信学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures,

点击查看摘要

Abstract:Image generation algorithms are increasingly integral to diverse aspects of human society, driven by their practical applications. However, insufficient oversight in artificial Intelligence generated content (AIGC) can facilitate the spread of malicious content and increase the risk of copyright infringement. Among the diverse range of image generation models, the Latent Diffusion Model (LDM) is currently the most widely used, dominating the majority of the Text-to-Image model market. Currently, most attribution methods for LDMs rely on directly embedding watermarks into the generated images or their intermediate noise, a practice that compromises both the quality and the robustness of the generated content. To address these limitations, we introduce TraceMark-LDM, an novel algorithm that integrates watermarking to attribute generated images while guaranteeing non-destructive performance. Unlike current methods, TraceMark-LDM leverages watermarks as guidance to rearrange random variables sampled from a Gaussian distribution. To mitigate potential deviations caused by inversion errors, the small absolute elements are grouped and rearranged. Additionally, we fine-tune the LDM encoder to enhance the robustness of the watermark. Experimental results show that images synthesized using TraceMark-LDM exhibit superior quality and attribution accuracy compared to state-of-the-art (SOTA) techniques. Notably, TraceMark-LDM demonstrates exceptional robustness against various common attack methods, consistently outperforming SOTA methods.
zh

[CV-144] HiPART: Hierarchical Pose AutoRegressive Transformer for Occluded 3D Human Pose Estimation CVPR2025

【速读】：该论文试图解决现有2D-to-3D人体姿态估计（HPE）方法在处理遮挡问题时面临的挑战，这些方法通过引入时间线索和视觉线索等增强信息来提升性能，但忽视了稀疏骨架2D输入表示的局限性，这从根本上限制了2D到3D的提升，并加剧了遮挡问题。论文的关键解决方案是提出了一种名为分层姿态自回归Transformer（HiPART）的新型两阶段生成式密集化方法，通过从原始稀疏2D姿态生成分层密集姿态，解决了输入表示的局限性。具体而言，首先开发了一个多尺度骨架标记化模块以将高度密集的2D姿态量化为分层标记，并提出了骨骼感知对齐机制以强化标记间的连接；随后设计了分层自回归建模方案用于生成分层2D姿态。利用生成的分层姿态作为输入进行2D-to-3D提升，该方法在遮挡场景中表现出较强的鲁棒性，并在基于单帧的3D HPE任务上达到了最先进的性能，同时在参数和计算复杂度较低的情况下优于许多多帧方法，还能与其他方法互补以进一步提升性能和鲁棒性。

链接: https://arxiv.org/abs/2503.23331
作者: Hongwei Zheng,Han Li,Wenrui Dai,Ziyang Zheng,Chenglin Li,Junni Zou,Hongkai Xiong
机构: Shanghai Jiao Tong University (上海交通大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR2025

点击查看摘要

Abstract:Existing 2D-to-3D human pose estimation (HPE) methods struggle with the occlusion issue by enriching information like temporal and visual cues in the lifting stage. In this paper, we argue that these methods ignore the limitation of the sparse skeleton 2D input representation, which fundamentally restricts the 2D-to-3D lifting and worsens the occlusion issue. To address these, we propose a novel two-stage generative densification method, named Hierarchical Pose AutoRegressive Transformer (HiPART), to generate hierarchical 2D dense poses from the original sparse 2D pose. Specifically, we first develop a multi-scale skeleton tokenization module to quantize the highly dense 2D pose into hierarchical tokens and propose a Skeleton-aware Alignment to strengthen token connections. We then develop a Hierarchical AutoRegressive Modeling scheme for hierarchical 2D pose generation. With generated hierarchical poses as inputs for 2D-to-3D lifting, the proposed method shows strong robustness in occluded scenarios and achieves state-of-the-art performance on the single-frame-based 3D HPE. Moreover, it outperforms numerous multi-frame methods while reducing parameter and computational complexity and can also complement them to further enhance performance and robustness.
zh

[CV-145] EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing

【速读】：该论文旨在解决现有多模态大型语言模型（MLLMs）在遥感（RS）任务中面对高分辨率图像和小目标比例时，在精确物体定位和细粒度属性描述方面的不足。这些模型在物体中心任务上的表现不如经典的视觉感知模型，因为它们仅提供粗略的图像理解，导致实际场景中的性能提升有限。论文的关键解决方案是提出EagleVision，这是一种专为遥感设计的MLLM，具备卓越的物体检测与属性理解能力。其核心创新在于引入了属性解耦模块（Attribute Disentangle Module），通过学习解耦视觉标记来表达不同的属性特征，并构建了EVAttrs-95K数据集和新的评估基准EVBench，以支持物体级别的视觉-语言对齐，从而实现对细粒度物体检测和属性理解任务的领先性能。

链接: https://arxiv.org/abs/2503.23330
作者: Hongxiang Jiang,Jihao Yin,Qixiong Wang,Jiaqi Feng,Guo Chen
机构: Beihang University (北京航空航天大学); XiaoHongShu (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have demonstrated impressive results in various visual tasks. However, in remote sensing (RS), high resolution and small proportion of objects pose challenges to existing MLLMs, which struggle with object-centric tasks, particularly in precise localization and fine-grained attribute description for each object. These RS MLLMs have not yet surpassed classical visual perception models, as they only provide coarse image understanding, leading to limited gains in real-world scenarios. To address this gap, we establish EagleVision, an MLLM tailored for remote sensing that excels in object detection and attribute comprehension. Equipped with the Attribute Disentangle module, EagleVision learns disentanglement vision tokens to express distinct attributes. To support object-level visual-language alignment, we construct EVAttrs-95K, the first large-scale object attribute understanding dataset in RS for instruction tuning, along with a novel evaluation benchmark, EVBench. EagleVision achieves state-of-the-art performance on both fine-grained object detection and object attribute understanding tasks, highlighting the mutual promotion between detection and understanding capabilities in MLLMs. The code, model, data, and demo will be available at this https URL.
zh

[CV-146] SpINR: Neural Volumetric Reconstruction for FMCW Radars

【速读】：该论文旨在解决传统雷达成像技术（如回波投影 Backprojection）在分辨率和泛化能力上的局限性，这些问题源于对理想信号模型的假设以及对密集孔径采样的依赖。论文提出的解决方案关键在于引入SpINR框架，该框架结合了完全可微的频率域前向模型与隐式神经表示（Implicit Neural Representations, INRs）。这种方法利用FMCW雷达系统中拍频与散射体距离之间的线性关系，以更高效且精确的方式学习场景几何结构。此外，通过仅计算相关频率bins的输出，SpINR的前向模型相较于时域方法实现了更高的计算效率。实验结果表明，SpINR在复杂场景的高分辨率重建方面显著优于经典方法及现有基于学习的技术。

链接: https://arxiv.org/abs/2503.23313
作者: Harshvardhan Takawale,Nirupam Roy
机构: University of Maryland, College Park (马里兰大学帕克分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce SpINR, a novel framework for volumetric reconstruction using Frequency-Modulated Continuous-Wave (FMCW) radar data. Traditional radar imaging techniques, such as backprojection, often assume ideal signal models and require dense aperture sampling, leading to limitations in resolution and generalization. To address these challenges, SpINR integrates a fully differentiable forward model that operates natively in the frequency domain with implicit neural representations (INRs). This integration leverages the linear relationship between beat frequency and scatterer distance inherent in FMCW radar systems, facilitating more efficient and accurate learning of scene geometry. Additionally, by computing outputs for only the relevant frequency bins, our forward model achieves greater computational efficiency compared to time-domain approaches that process the entire signal before transformation. Through extensive experiments, we demonstrate that SpINR significantly outperforms classical backprojection methods and existing learning-based approaches, achieving higher resolution and more accurate reconstructions of complex scenes. This work represents the first application of neural volumetic reconstruction in the radar domain, offering a promising direction for future research in radar-based imaging and perception systems.
zh

[CV-147] LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation

【速读】：该论文旨在解决视觉驱动推荐领域中，仅依赖文本信息无法充分捕捉产品属性（如颜色、风格或设计）的问题。为应对这一挑战，论文提出了一种名为LaViC（大型视觉-语言对话推荐框架）的新方法，其关键在于通过两阶段过程将紧凑的图像表示融入基于对话的推荐系统：第一阶段是视觉知识自蒸馏，将数百个标记的产品图像压缩为小规模的视觉标记，大幅降低计算开销；第二阶段是推荐提示微调，使模型能够结合对话上下文与蒸馏后的视觉标记，实现文本和视觉特征的统一捕获机制。

链接: https://arxiv.org/abs/2503.23312
作者: Hyunsik Jeon,Satoshi Koide,Yu Wang,Zhankui He,Julian McAuley
机构: UC San Diego(加州大学圣地亚哥分校); Toyota Motor North America, Inc.(丰田汽车北美公司); UC San Diego(加州大学圣地亚哥分校); UC San Diego(加州大学圣地亚哥分校); UC San Diego(加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conversational recommender systems engage users in dialogues to refine their needs and provide more personalized suggestions. Although textual information suffices for many domains, visually driven categories such as fashion or home decor potentially require detailed visual information related to color, style, or design. To address this challenge, we propose LaViC (Large Vision-Language Conversational Recommendation Framework), a novel approach that integrates compact image representations into dialogue-based recommendation systems. LaViC leverages a large vision-language model in a two-stage process: (1) visual knowledge self-distillation, which condenses product images from hundreds of tokens into a small set of visual tokens in a self-distillation manner, significantly reducing computational overhead, and (2) recommendation prompt tuning, which enables the model to incorporate both dialogue context and distilled visual tokens, providing a unified mechanism for capturing textual and visual features. To support rigorous evaluation of visually-aware conversational recommendation, we construct a new dataset by aligning Reddit conversations with Amazon product listings across multiple visually oriented categories (e.g., fashion, beauty, and home). This dataset covers realistic user queries and product appearances in domains where visual details are crucial. Extensive experiments demonstrate that LaViC significantly outperforms text-only conversational recommendation methods and open-source vision-language baselines. Moreover, LaViC achieves competitive or superior accuracy compared to prominent proprietary baselines (e.g., GPT-3.5-turbo, GPT-4o-mini, and GPT-4o), demonstrating the necessity of explicitly using visual data for capturing product attributes and showing the effectiveness of our vision-language integration. Our code and dataset are available at this https URL.
zh

[CV-148] MoCha: Towards Movie-Grade Talking Character Synthesis

【速读】：该论文旨在解决视频生成领域中普遍忽视角色驱动叙事的问题，这是实现自动化电影或动画生成的关键任务。论文提出了一种名为Talking Characters的新任务，目标是从语音和文本直接生成具有真实感的角色动画，超越面部区域以呈现完整的人物肖像。论文的关键解决方案是提出了MoCha模型，它首次实现了这一任务，并通过引入一种语音-视频窗口注意力机制确保视频与语音之间的精确同步。此外，为了应对大规模带标注语音视频数据集稀缺的问题，该方法设计了一种联合训练策略，同时利用带语音标注和文本标注的视频数据，显著提升了不同人物动作场景下的泛化能力。同时，通过设计带有角色标签的结构化提示模板，首次实现了基于回合对话的多角色互动，使AI生成的角色能够进行具有场景连贯性的上下文感知对话。这些创新使得MoCha在逼真性、表现力、可控性和泛化能力方面树立了新的基准。

链接: https://arxiv.org/abs/2503.23307
作者: Cong Wei,Bo Sun,Haoyu Ma,Ji Hou,Felix Juefei-Xu,Zecheng He,Xiaoliang Dai,Luxin Zhang,Kunpeng Li,Tingbo Hou,Animesh Sinha,Peter Vajda,Wenhu Chen
机构: University of Waterloo (滑铁卢大学); GenAI, Meta (Meta 的生成人工智能部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Recent advancements in video generation have achieved impressive motion realism, yet they often overlook character-driven storytelling, a crucial task for automated film, animation generation. We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. In this paper, we propose MoCha, the first of its kind to generate talking characters. To ensure precise synchronization between video and speech, we propose a speech-video window attention mechanism that effectively aligns speech and video tokens. To address the scarcity of large-scale speech-labeled video datasets, we introduce a joint training strategy that leverages both speech-labeled and text-labeled video data, significantly improving generalization across diverse character actions. We also design structured prompt templates with character tags, enabling, for the first time, multi-character conversation with turn-based dialogue-allowing AI-generated characters to engage in context-aware conversations with cinematic coherence. Extensive qualitative and quantitative evaluations, including human preference studies and benchmark comparisons, demonstrate that MoCha sets a new standard for AI-generated cinematic storytelling, achieving superior realism, expressiveness, controllability and generalization.
zh

[CV-149] Learning Predictive Visuomotor Coordination

【速读】：该论文旨在解决理解与预测人类视觉运动协调（Visuomotor Coordination）的问题，这对机器人学、人机交互及辅助技术等领域具有重要意义。论文提出了一种基于预测的任务框架，目标是从第一人称视觉和运动学观测中推断头部姿态、视线方向以及上半身运动。解决方案的关键在于引入了“视觉运动协调表示（Visuomotor Coordination Representation, VCR）”，该方法能够学习多模态信号之间的结构化时间依赖关系，并结合基于扩散模型的运动建模框架，将第一人称视觉序列与运动学序列进行整合，从而实现时间上一致且精确的视觉运动预测。通过在大规模EgoExo4D数据集上的评估，验证了该方法在多种真实场景活动中的强泛化能力。

链接: https://arxiv.org/abs/2503.23300
作者: Wenqi Jia,Bolin Lai,Miao Liu,Danfei Xu,James M. Rehg
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Georgia Tech (佐治亚理工学院); Meta AI (Meta AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textitVisuomotor Coordination Representation (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling.
zh

[CV-150] Reason Grounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

【速读】：该论文旨在解决开放词汇量（open-vocabulary）的三维视觉定位与推理问题，特别是针对场景中被遮挡物体的定位挑战。传统方法因依赖于三维标注和掩码提议，难以有效处理多样化语义和常识知识，从而限制了其在实际任务中的表现。为应对这些局限性，论文提出了一种名为ReasonGrounder的框架，其关键在于利用层级化的三维特征高斯场进行自适应分组，并结合物理尺度信息实现基于大视觉语言模型（Large Vision-Language Model, LVLM）引导的开放词汇量三维定位与推理。通过引入SAM提供的二维分割掩码及多视图CLIP嵌入，ReasonGrounder能够根据物体尺度选择高斯群组，从而实现显式和隐式语言理解下的精确定位，即使在新颖或被遮挡的视角下亦然。此外，论文还贡献了一个包含超过10,000个场景和200万条标注的新数据集——ReasoningGD，用于评估遮挡条件下的开放词汇量三维定位与全模态感知能力。实验结果表明，ReasonGrounder显著提升了真实世界场景中的三维定位准确性。

链接: https://arxiv.org/abs/2503.23297
作者: Zhenyang Liu,Yikai Wang,Sixiao Zheng,Tongying Pan,Longfei Liang,Yanwei Fu,Xiangyang Xue
机构: Fudan University (复旦大学); Nanyang Technological University (南洋理工大学); Shanghai Innovation Institute; NeuHelium Co., Ltd
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.
zh

[CV-151] SketchVideo: Sketch-based Video Generation and Editing CVPR2025

【速读】：该论文旨在解决通过文本精确控制全局布局和几何细节，以及通过图像支持运动控制和局部修改的挑战。论文的关键在于提出了一种基于记忆高效的控制结构，其中引入了草图控制块来预测跳过DiT模块的残差特征，并通过在任意时间点的关键帧上绘制草图实现易于交互的时空控制。此外，为了在所有帧间传播稀疏的草图条件，提出了帧间注意力机制分析关键帧与每一视频帧之间的关系。对于基于草图的视频编辑，设计了额外的视频插入模块以保持新编辑内容与原始视频的空间特征和动态运动一致性。推理阶段采用潜在融合以准确保留未编辑区域。实验表明，所提出的SketchVideo在可控视频生成和编辑方面表现出色。

链接: https://arxiv.org/abs/2503.23284
作者: Feng-Lin Liu,Hongbo Fu,Xintao Wang,Weicai Ye,Pengfei Wan,Di Zhang,Lin Gao
机构: Beijing Key Laboratory of Mobile Computing and Pervasive Device (移动计算与普适设备北京重点实验室), Institute of Computing Technology (计算技术研究所), Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Hong Kong University of Science and Technology (香港科技大学); Kuaishou Technology (快手科技)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video’s spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.
zh

[CV-152] Language Guided Concept Bottleneck Models for Interpretable Continual Learning CVPR2025

【速读】：本文旨在解决连续学习（Continual Learning, CL）中的两个核心挑战：缓解灾难性遗忘（catastrophic forgetting）的同时保持跨任务的可解释性（interpretability）。现有大多数连续学习方法主要关注于通过保留已学知识来提升模型性能，但随着新信息的引入，理解学习过程的可解释性变得至关重要，而这一点却鲜有探索。论文的关键解决方案在于提出了一种新颖的框架，将语言引导的概念瓶颈模型（Concept Bottleneck Models, CBMs）集成到连续学习中。其关键创新点在于利用概念瓶颈层（Concept Bottleneck Layer），通过与CLIP模型对齐以学习人类可理解的概念，这些概念能够在不同任务间泛化。通过聚焦于可解释的概念，该方法不仅增强了模型随着时间推移保留知识的能力，还提供了透明的决策洞察力，从而在多个数据集上实现了优于现有最先进方法的性能，并在ImageNet子集上提升了高达3.06%的最终平均准确率。此外，论文还提供了模型预测的概念可视化，进一步推动了可解释连续学习的理解。

链接: https://arxiv.org/abs/2503.23283
作者: Lu Yu,Haoyu Han,Zhe Tao,Hantao Yao,Changsheng Xu
机构: School of Computer Science and Engineering, Tianjin University of Technology (天津理工大学计算机科学与工程学院); School of Information Science and Technology, University of Science and Technology of China (中国科学技术大学信息科学技术学院); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, University of Chinese Academy of Sciences (中国科学院自动化研究所多模态人工智能系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025; Project Page: this https URL

点击查看摘要

Abstract:Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the models ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning.
zh

[CV-153] AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos CVPR2025

【速读】：该论文试图解决从随意采集的动态视频中估计相机运动与内参这一计算机视觉领域的核心挑战。传统基于束调整的方法（如SfM和SLAM）难以在任意数据上可靠运行，而专门针对动态场景的SfM方法要么依赖于已知的内参，要么需要昂贵的测试时优化，且性能常不尽人意。近年来的数据驱动方法虽有所改进，但仍存在对动态物体鲁棒性不足以及依赖标注数据进行监督训练的问题。

论文提出的解决方案关键在于AnyCam，这是一种快速Transformer模型，能够以前馈方式直接从动态视频序列中估计相机姿态与内参。其核心思想是通过网络学习真实的相机姿态先验。为了扩大训练规模，采用了基于不确定性的损失函数，并利用预训练的深度和光流网络，而非传统的运动或轨迹监督。这使得可以使用来自YouTube等平台的大规模未标注视频数据集。此外，通过轻量级的轨迹细化步骤，确保预测轨迹不会随时间累积漂移。实验结果表明，AnyCam不仅在已有数据集上提供了精确的相机姿态与内参，而且在处理动态场景的SfM任务中比现有方法显著更快，同时结合相机信息、不确定性及深度，还能生成高质量的4D点云。

链接: https://arxiv.org/abs/2503.23282
作者: Felix Wimbauer,Weirong Chen,Dominik Muhle,Christian Rupprecht,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); MCML (未知); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 - For more details and code, please check out our project page under this https URL

点击查看摘要

Abstract:Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
zh

[CV-154] Improved Ear Verification with Vision Transformers and Overlapping Patches

【速读】：该论文试图解决的问题是Vision Transformers (ViTs) 在耳部识别任务中的效率问题，主要由于缺乏对重叠 patches 的关注，而重叠 patches 对捕捉复杂的耳部特征至关重要。论文的关键解决方案在于提出了一种重叠 patches 选择策略，并评估了ViT-Tiny (ViT-T)、ViT-Small (ViT-S)、ViT-Base (ViT-B) 和 ViT-Large (ViT-L) 不同配置在多个数据集（OPIB、AWE、WPUT 和 EarVN1.0）上的性能。结果表明，重叠 patches 策略显著提升了模型性能，在48次实验中有44次表现出色，尤其在EarVN1.0数据集上性能提升最高可达10%。此外，ViT-T模型在AWE、WPUT和EarVN1.0数据集上的表现优于其他更大规模的模型，其最佳性能是在patch大小为28x28且stride为14像素的配置下实现的。这一研究证实了采用重叠patches选择策略的transformer架构能够高效且高性能地应用于验证场景下的基于耳部的生物特征识别任务。

链接: https://arxiv.org/abs/2503.23275
作者: Deeksha Arun,Kagan Ozturk,Kevin W. Bowyer,Patrick Flynn
机构: Department of Computer Science and Engineering, University of Notre Dame (圣母大学计算机科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ear recognition has emerged as a promising biometric modality due to the relative stability in appearance during adulthood. Although Vision Transformers (ViTs) have been widely used in image recognition tasks, their efficiency in ear recognition has been hampered by a lack of attention to overlapping patches, which is crucial for capturing intricate ear features. In this study, we evaluate ViT-Tiny (ViT-T), ViT-Small (ViT-S), ViT-Base (ViT-B) and ViT-Large (ViT-L) configurations on a diverse set of datasets (OPIB, AWE, WPUT, and EarVN1.0), using an overlapping patch selection strategy. Results demonstrate the critical importance of overlapping patches, yielding superior performance in 44 of 48 experiments in a structured study. Moreover, upon comparing the results of the overlapping patches with the non-overlapping configurations, the increase is significant, reaching up to 10% for the EarVN1.0 dataset. In terms of model performance, the ViT-T model consistently outperformed the ViT-S, ViT-B, and ViT-L models on the AWE, WPUT, and EarVN1.0 datasets. The highest scores were achieved in a configuration with a patch size of 28x28 and a stride of 14 pixels. This patch-stride configuration represents 25% of the normalized image area (112x112 pixels) for the patch size and 12.5% of the row or column size for the stride. This study confirms that transformer architectures with overlapping patch selection can serve as an efficient and high-performing option for ear-based biometric recognition tasks in verification scenarios.
zh

[CV-155] OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition

【速读】：该论文旨在解决低光照环境下人体动作识别性能受限的问题，现有方法在训练过程中未能充分利用亮度信息，导致性能不佳。为克服这一局限，论文提出了一种名为OwlSight的生物启发式框架，其关键在于通过全阶段光照增强与动作分类的交互来实现精确的黑暗视频人体动作识别。具体而言，OwlSight引入时间一致性模块（Time-Consistency Module, TCM）以捕获浅层时空特征并保持时间一致性，随后通过亮度适应模块（Luminance Adaptation Module, LAM）动态调整亮度。此外，反射增强模块（Reflect Augmentation Module, RAM）被设计用于最大化光照利用率并通过两条交互路径同时提升动作识别效果。论文还构建了一个包含18,310个黑暗视频的大规模数据集Dark-101，涵盖了101种类别，显著超越了现有数据集。实验结果表明，OwlSight在四个低光照动作识别基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2503.23266
作者: Shihao Cheng,Jinlu Zhang,Yue Liu,Zhigang Tu
机构: Wuhan University (武汉大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human action recognition in low-light environments is crucial for various real-world applications. However, the existing approaches overlook the full utilization of brightness information throughout the training phase, leading to suboptimal performance. To address this limitation, we propose OwlSight, a biomimetic-inspired framework with whole-stage illumination enhancement to interact with action classification for accurate dark video human action recognition. Specifically, OwlSight incorporates a Time-Consistency Module (TCM) to capture shallow spatiotemporal features meanwhile maintaining temporal coherence, which are then processed by a Luminance Adaptation Module (LAM) to dynamically adjust the brightness based on the input luminance distribution. Furthermore, a Reflect Augmentation Module (RAM) is presented to maximize illumination utilization and simultaneously enhance action recognition via two interactive paths. Additionally, we build Dark-101, a large-scale dataset comprising 18,310 dark videos across 101 action categories, significantly surpassing existing datasets (e.g., ARID1.5 and Dark-48) in scale and diversity. Extensive experiments demonstrate that the proposed OwlSight achieves state-of-the-art performance across four low-light action recognition benchmarks. Notably, it outperforms previous best approaches by 5.36% on ARID1.5 and 1.72% on Dark-101, highlighting its effectiveness in challenging dark environments.
zh

[CV-156] FIESTA: Fisher Information-based Efficient Selective Test-time Adaptation

【速读】：本文旨在解决无约束“野生”环境中面部表情识别面临的挑战，特别是在训练与测试数据分布存在显著领域偏移的情况下。为应对这一问题，传统测试时适应（Test-Time Adaptation, TTA）方法通常依赖人工选择需更新的模型参数，这可能导致次优适应效果及较高的计算开销。论文提出了一种新颖的基于Fisher信息的选择性适应框架，通过动态识别并仅更新模型中由Fisher信息量化的重要参数，实现了高效且有效的适应。关键在于结合参数重要性评估与时间一致性约束，使该方法特别适用于基于视频的面部表情识别任务。实验表明，与现有TTA方法相比，该方法在AffWild2基准上的F1分数提升了7.7%，仅需调整22,000个参数，约为同类方法的二十分之一。消融研究进一步证明，从极少量数据即可有效估计参数的重要性，采样1-3帧即可显著提升性能。此方法不仅提高了识别精度，还大幅降低了计算开销，使测试时适应更适用于实际情感计算应用。

链接: https://arxiv.org/abs/2503.23257
作者: Mohammadmahdi Honarmand,Onur Cezmi Mutlu,Parnian Azizian,Saimourya Surabhi,Dennis P. Wall
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Robust facial expression recognition in unconstrained, “in-the-wild” environments remains challenging due to significant domain shifts between training and testing distributions. Test-time adaptation (TTA) offers a promising solution by adapting pre-trained models during inference without requiring labeled test data. However, existing TTA approaches typically rely on manually selecting which parameters to update, potentially leading to suboptimal adaptation and high computational costs. This paper introduces a novel Fisher-driven selective adaptation framework that dynamically identifies and updates only the most critical model parameters based on their importance as quantified by Fisher information. By integrating this principled parameter selection approach with temporal consistency constraints, our method enables efficient and effective adaptation specifically tailored for video-based facial expression recognition. Experiments on the challenging AffWild2 benchmark demonstrate that our approach significantly outperforms existing TTA methods, achieving a 7.7% improvement in F1 score over the base model while adapting only 22,000 parameters-more than 20 times fewer than comparable methods. Our ablation studies further reveal that parameter importance can be effectively estimated from minimal data, with sampling just 1-3 frames sufficient for substantial performance gains. The proposed approach not only enhances recognition accuracy but also dramatically reduces computational overhead, making test-time adaptation more practical for real-world affective computing applications.
zh

[CV-157] Context in object detection: a systematic literature review

【速读】：本文旨在解决利用上下文信息提升目标检测精度和效率的问题。论文的关键在于探索并整合不同类型的上下文信息（如场景上下文、空间上下文等）到目标检测框架中，以克服孤立目标识别的挑战。通过综述超过265篇相关文献，研究从多个角度分析了上下文在通用目标检测、视频目标检测、小目标检测、伪装目标检测以及少样本学习等领域的应用，并比较了最新的基于上下文的目标检测方法，从而为未来研究指明方向并揭示研究空白。

链接: https://arxiv.org/abs/2503.23249
作者: Mahtab Jamali,Paul Davidsson,Reza Khoshkangini,Martin Georg Ljungqvist,Radu-Casian Mihailescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Artificial Intelligence Review Journal

点击查看摘要

Abstract:Context is an important factor in computer vision as it offers valuable information to clarify and analyze visual data. Utilizing the contextual information inherent in an image or a video can improve the precision and effectiveness of object detectors. For example, where recognizing an isolated object might be challenging, context information can improve comprehension of the scene. This study explores the impact of various context-based approaches to object detection. Initially, we investigate the role of context in object detection and survey it from several perspectives. We then review and discuss the most recent context-based object detection approaches and compare them. Finally, we conclude by addressing research questions and identifying gaps for further studies. More than 265 publications are included in this survey, covering different aspects of context in different categories of object detection, including general object detection, video object detection, small object detection, camouflaged object detection, zero-shot, one-shot, and few-shot object detection. This literature review presents a comprehensive overview of the latest advancements in context-based object detection, providing valuable contributions such as a thorough understanding of contextual information and effective methods for integrating various context types into object detection, thus benefiting researchers.
zh

[CV-158] Geometry in Style: 3D Stylization via Surface Normal Deformation CVPR2025

【速读】：该论文旨在解决身份保持的网格风格化（identity-preserving mesh stylization）问题，现有方法要么通过过于严格的变形（如凹凸贴图）严格保留原始形状，要么使用表达性强的变形显著修改输入形状，但可能引入伪影或改变源形状的身份。论文的关键解决方案在于提出了一种新的方法Geometry in Style，它将三角网格的变形表示为每个顶点邻域的目标法向量，并通过一种新颖的可微分的尽可能刚性（differentiable As-Rigid-As-Possible, dARAP）层来实现这些变形，该层是对经典ARAP算法的神经网络友好型适应，用于求解每个顶点的旋转和平滑变形后的顶点。同时，结合来自文本到图像模型的视觉损失函数以引导变形向风格提示（style prompts），从而实现既具有表现力又能保持形状身份的风格化效果。

链接: https://arxiv.org/abs/2503.23241
作者: Nam Anh Dinh,Itai Lang,Hyunwoo Kim,Oded Stein,Rana Hanocka
机构: University of Chicago(芝加哥大学); University of Southern California(南加州大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Our project page is at this https URL

点击查看摘要

Abstract:We present Geometry in Style, a new method for identity-preserving mesh stylization. Existing techniques either adhere to the original shape through overly restrictive deformations such as bump maps or significantly modify the input shape using expressive deformations that may introduce artifacts or alter the identity of the source shape. In contrast, we represent a deformation of a triangle mesh as a target normal vector for each vertex neighborhood. The deformations we recover from target normals are expressive enough to enable detailed stylizations yet restrictive enough to preserve the shape’s identity. We achieve such deformations using our novel differentiable As-Rigid-As-Possible (dARAP) layer, a neural-network-ready adaptation of the classical ARAP algorithm which we use to solve for per-vertex rotations and deformed vertices. As a differentiable layer, dARAP is paired with a visual loss from a text-to-image model to drive deformations toward style prompts, altogether giving us Geometry in Style. Our project page is at this https URL.
zh

[CV-159] Z-SASLM: Zero-Shot Style-Aligned SLI Blending Latent Manipulation CVPR2025

【速读】：该论文旨在解决现有多风格融合方法在处理非线性潜空间（latent space）时的局限性。传统方法基于线性插值假设潜空间为平坦结构，导致在整合多种参考风格时产生次优结果。论文的关键解决方案是提出了一种名为Z-SASLM的零样本风格对齐SLI（Spherical Linear Interpolation）潜空间操作管道，通过利用SLI插值沿超球面上的测地线进行加权风格表示的组合，保持了潜空间的内在结构，实现了高保真且一致的多样化风格融合，无需微调。此外，论文还提出了一个新的定量评估指标Weighted Multi-Style DINO ViT-B/8来衡量融合风格的一致性，并通过实验验证了其在理论与实践上的有效性。

链接: https://arxiv.org/abs/2503.23234
作者: Alessio Borgi,Luca Maiano,Irene Amerini
机构: Sapienza University of Rome (罗马大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the CVPR 2025 Workshop AI for Creative Visual Content Generation Editing and Understanding

点击查看摘要

Abstract:We introduce Z-SASLM, a Zero-Shot Style-Aligned SLI (Spherical Linear Interpolation) Blending Latent Manipulation pipeline that overcomes the limitations of current multi-style blending methods. Conventional approaches rely on linear blending, assuming a flat latent space leading to suboptimal results when integrating multiple reference styles. In contrast, our framework leverages the non-linear geometry of the latent space by using SLI Blending to combine weighted style representations. By interpolating along the geodesic on the hypersphere, Z-SASLM preserves the intrinsic structure of the latent space, ensuring high-fidelity and coherent blending of diverse styles - all without the need for fine-tuning. We further propose a new metric, Weighted Multi-Style DINO ViT-B/8, designed to quantitatively evaluate the consistency of the blended styles. While our primary focus is on the theoretical and practical advantages of SLI Blending for style manipulation, we also demonstrate its effectiveness in a multi-modal content fusion setting through comprehensive experimental studies. Experimental results show that Z-SASLM achieves enhanced and robust style alignment. The implementation code can be found at: this https URL.
zh

[CV-160] Synthetic Art Generation and DeepFake Detection A Study on Jamini Roy Inspired Dataset

【速读】：该论文旨在解决在识别合成艺术作品（Synthetic Artworks）时面临的挑战，特别是在高质量深度伪造（Deepfakes）且具有特定文化背景的情况下。论文的关键在于通过微调Stable Diffusion 3，并结合ControlNet和IPAdapter等技术，生成包含真实与AI创作艺术品的新数据集。这一方法为详细分析生成模型的能力奠定了基础。同时，论文采用频域评估（Fourier Domain Assessments）和自相关度量（Autocorrelation Metrics）等定性与定量方法，揭示合成图像与原作之间的细微差异。研究指出，现有检测深度伪造的技术在面对高保真且文化特定的伪造作品时面临显著困难，从而突显了当前检测技术的重要缺口。因此，论文不仅展示了生成模型的复杂性，还为未来有效检测合成艺术的研究提供了重要的基础。

链接: https://arxiv.org/abs/2503.23226
作者: Kushal Agrawal,Romi Banerjee
机构: IIT Jodhpur (印度技术学院焦特布尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, 6 tables

点击查看摘要

Abstract:The intersection of generative AI and art is a fascinating area that brings both exciting opportunities and significant challenges, especially when it comes to identifying synthetic artworks. This study takes a unique approach by examining diffusion-based generative models in the context of Indian art, specifically focusing on the distinctive style of Jamini Roy. To explore this, we fine-tuned Stable Diffusion 3 and used techniques like ControlNet and IPAdapter to generate realistic images. This allowed us to create a new dataset that includes both real and AI-generated artworks, which is essential for a detailed analysis of what these models can produce. We employed various qualitative and quantitative methods, such as Fourier domain assessments and autocorrelation metrics, to uncover subtle differences between synthetic images and authentic pieces. A key takeaway from recent research is that existing methods for detecting deepfakes face considerable challenges, especially when the deepfakes are of high quality and tailored to specific cultural contexts. This highlights a critical gap in current detection technologies, particularly in light of the challenges identified above, where high-quality and culturally specific deepfakes are difficult to detect. This work not only sheds light on the increasing complexity of generative models but also sets a crucial foundation for future research aimed at effective detection of synthetic art.
zh

[CV-161] Large Self-Supervised Models Bridge the Gap in Domain Adaptive Object Detection CVPR2025

【速读】：该论文旨在解决领域自适应目标检测（Domain Adaptive Object Detection, DAOD）中的标签生成与模型训练耦合问题。当前最先进的方法依赖Mean Teacher自标注策略，通过学生模型的指数移动平均值生成目标域标签，并以此改进模型。然而，这种耦合方式存在脆弱性和过度约束的问题：无法保证仅基于源域数据训练的学生模型能够生成准确的目标域标签以启动正反馈循环，而更大的预训练网络可能生成更优的目标域标签。为了解决这一问题，论文提出DINO Teacher方案，其关键在于利用冻结的Vision基础模型（DINOv2）的强大泛化能力生成更精确的目标域标签，并通过对齐学生模型在源域和目标域的图像块特征与DINO编码器的表示，推动源域和目标域表征向可泛化的DINO表示靠近。最终，该方法在多个DAOD数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.23220
作者: Marc-Antoine Lavoie,Anas Mahmoud,Steven L. Waslander
机构: University of Toronto Robotics Institute (多伦多大学机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages (8 main), 5 figures, accepted at CVPR 2025

点击查看摘要

Abstract:The current state-of-the-art methods in domain adaptive object detection (DAOD) use Mean Teacher self-labelling, where a teacher model, directly derived as an exponential moving average of the student model, is used to generate labels on the target domain which are then used to improve both models in a positive loop. This couples learning and generating labels on the target domain, and other recent works also leverage the generated labels to add additional domain alignment losses. We believe this coupling is brittle and excessively constrained: there is no guarantee that a student trained only on source data can generate accurate target domain labels and initiate the positive feedback loop, and much better target domain labels can likely be generated by using a large pretrained network that has been exposed to much more data. Vision foundational models are exactly such models, and they have shown impressive task generalization capabilities even when frozen. We want to leverage these models for DAOD and introduce DINO Teacher, which consists of two components. First, we train a new labeller on source data only using a large frozen DINOv2 backbone and show it generates more accurate labels than Mean Teacher. Next, we align the student’s source and target image patch features with those from a DINO encoder, driving source and target representations closer to the generalizable DINO representation. We obtain state-of-the-art performance on multiple DAOD datasets. Code available at this https URL
zh

[CV-162] Action Recognition in Real-World Ambient Assisted Living Environment

【速读】：该论文旨在解决实际 Ambient Assisted Living (AAL) 应用中动作识别面临的挑战，包括遮挡（occlusions）、噪声数据以及实时性能需求。尽管已有方法在提升准确性、鲁棒性和计算效率方面取得进展，但在这些指标之间实现平衡仍具挑战性。为此，论文提出了一种名为Robust and Efficient Temporal Convolution network (RE-TCN) 的解决方案，其关键在于结合三种核心元素：自适应时间加权（Adaptive Temporal Weighting, ATW）、深度可分离卷积（Depthwise Separable Convolutions, DSC）以及数据增强技术。这些设计旨在提高模型在真实世界AAL场景中的准确性、抗噪能力及遮挡鲁棒性，同时保持高效计算。实验结果显示，RE-TCN在NTU RGB+D 60、Northwestern-UCLA、SHREC’17和DHG-14/28四个基准数据集上的表现优于现有模型。

链接: https://arxiv.org/abs/2503.23214
作者: Vincent Gbouna Zakka,Zhuangzhuang Dai,Luis J. Manso
机构: School of Computer Science and Digital Technologies, Aston University (阿斯顿大学), Birmingham, B4 7ET, United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing ageing population and their preference to maintain independence by living in their own homes require proactive strategies to ensure safety and support. Ambient Assisted Living (AAL) technologies have emerged to facilitate ageing in place by offering continuous monitoring and assistance within the home. Within AAL technologies, action recognition plays a crucial role in interpreting human activities and detecting incidents like falls, mobility decline, or unusual behaviours that may signal worsening health conditions. However, action recognition in practical AAL applications presents challenges, including occlusions, noisy data, and the need for real-time performance. While advancements have been made in accuracy, robustness to noise, and computation efficiency, achieving a balance among them all remains a challenge. To address this challenge, this paper introduces the Robust and Efficient Temporal Convolution network (RE-TCN), which comprises three main elements: Adaptive Temporal Weighting (ATW), Depthwise Separable Convolutions (DSC), and data augmentation techniques. These elements aim to enhance the model’s accuracy, robustness against noise and occlusion, and computational efficiency within real-world AAL contexts. RE-TCN outperforms existing models in terms of accuracy, noise and occlusion robustness, and has been validated on four benchmark datasets: NTU RGB+D 60, Northwestern-UCLA, SHREC’17, and DHG-14/28. The code is publicly available at: this https URL
zh

[CV-163] Convolutional Neural Networks Can (Meta-)Learn the Same-Different Relation

【速读】：该论文试图解决卷积神经网络（CNNs）在处理涉及关系的视觉任务时，难以泛化同一-不同关系（same-different relation）的问题。传统训练方法下，尽管CNNs可以在特定设置中学习这一关系，但其泛化能力通常较弱。论文的关键解决方案是采用元学习（meta-learning）方法，通过显式鼓励跨任务的抽象与泛化，使相同的CNN架构能够成功掌握并泛化该关系。

链接: https://arxiv.org/abs/2503.23212
作者: Max Gupta,Sunayana Rane,R. Thomas McCoy,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); Yale University (耶鲁大学); Wu Tsai Institute (吴-tsai研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While convolutional neural networks (CNNs) have come to match and exceed human performance in many settings, the tasks these models optimize for are largely constrained to the level of individual objects, such as classification and captioning. Humans remain vastly superior to CNNs in visual tasks involving relations, including the ability to identify two objects as same' or different’. A number of studies have shown that while CNNs can be coaxed into learning the same-different relation in some settings, they tend to generalize poorly to other instances of this relation. In this work we show that the same CNN architectures that fail to generalize the same-different relation with conventional training are able to succeed when trained via meta-learning, which explicitly encourages abstraction and generalization across tasks.
zh

[CV-164] A GAN-Enhanced Deep Learning Framework for Rooftop Detection from Historical Aerial Imagery

【速读】：该论文旨在解决从历史航拍影像中准确检测屋顶（rooftop detection）的问题，这一任务对于研究长期城市化发展和人类居住模式至关重要。然而，由于黑白模拟照片存在空间分辨率低、缺乏色彩信息以及因档案保存导致的退化等问题，现代目标检测框架面临显著挑战。为克服这些限制，论文提出了一种基于生成式对抗网络（Generative Adversarial Networks, GANs）的两阶段图像增强管道：首先利用DeOldify进行图像着色，然后通过Real-ESRGAN实现超分辨率增强。关键在于结合色彩化与超分辨率技术，将增强后的图像用于训练和评估屋顶检测模型（如Faster R-CNN、DETR和YOLOv11n），从而大幅提升检测性能，其中YOLOv11n的平均精度均值（mean Average Precision, mAP）超过85%，相较于原始黑白图像提升了约40%，比仅通过色彩化增强的图像提升了约20%。这种方法有效弥合了档案影像与当代深度学习技术之间的差距，实现了从历史航拍照片中更可靠地提取建筑物轮廓的能力。

链接: https://arxiv.org/abs/2503.23200
作者: Pengyu Chen,Sicheng Wang,Cuizhen Wang,Senrong Wang,Beiao Huang,Lu Huang,Zhe Zang
机构: University of South Carolina (南卡罗来纳大学); Wuhan University of Technology (武汉理工大学); Wuhan University (武汉大学); Ocean University of China (中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate rooftop detection from historical aerial imagery is vital for examining long-term urban development and human settlement patterns. However, black-and-white analog photographs pose significant challenges for modern object detection frameworks due to their limited spatial resolution, lack of color information, and archival degradation. To address these limitations, this study introduces a two-stage image enhancement pipeline based on Generative Adversarial Networks (GANs): image colorization using DeOldify, followed by super-resolution enhancement with Real-ESRGAN. The enhanced images were then used to train and evaluate rooftop detection models, including Faster R-CNN, DETReg, and YOLOv11n. Results show that combining colorization with super-resolution substantially improves detection performance, with YOLOv11n achieving a mean Average Precision (mAP) exceeding 85%. This reflects an improvement of approximately 40% over original black-and-white images and 20% over images enhanced through colorization alone. The proposed method effectively bridges the gap between archival imagery and contemporary deep learning techniques, enabling more reliable extraction of building footprints from historical aerial photographs.
zh

[CV-165] Real-time Video Prediction With Fast Video Interpolation Model and Prediction Training ICIP2024

【速读】：该论文旨在解决实时交互中由于传输延迟 (Transmission Latency) 对用户体验质量 (Quality of Experience) 的显著影响问题。论文提出了一种名为中间特征优化视频预测 (Intermediate Feature Refinement Video Prediction, IFRVP) 的方法，以实现接近零延迟的网络交互。解决方案的关键在于：首先，通过扩展帧插值模型提出了三种视频预测训练方法，并基于IFRNet构建了一个仅包含卷积操作的简单帧插值网络；其次，引入ELAN-based残差块 (residual blocks)，在提升推理速度的同时改善预测准确性。这些创新点共同实现了现有视频预测方法中最佳的预测精度与计算效率之间的权衡。

链接: https://arxiv.org/abs/2503.23185
作者: Shota Hirose,Kazuki Kotoyori,Kasidis Arunruangsirilert,Fangzheng Lin,Heming Sun,Jiro Katto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICIP 2024

点击查看摘要

Abstract:Transmission latency significantly affects users’ quality of experience in real-time interaction and actuation. As latency is principally inevitable, video prediction can be utilized to mitigate the latency and ultimately enable zero-latency transmission. However, most of the existing video prediction methods are computationally expensive and impractical for real-time applications. In this work, we therefore propose real-time video prediction towards the zero-latency interaction over networks, called IFRVP (Intermediate Feature Refinement Video Prediction). Firstly, we propose three training methods for video prediction that extend frame interpolation models, where we utilize a simple convolution-only frame interpolation network based on IFRNet. Secondly, we introduce ELAN-based residual blocks into the prediction models to improve both inference speed and accuracy. Our evaluations show that our proposed models perform efficiently and achieve the best trade-off between prediction accuracy and computational speed among the existing video prediction methods. A demonstration movie is also provided at this http URL.
zh

[CV-166] Enhancing Weakly Supervised Video Grounding via Diverse Inference Strategies for Boundary and Prediction Selection

【速读】：该论文致力于解决弱监督视频定位（Weakly Supervised Video Grounding）任务中的两个关键问题：(1) 边界预测的准确性不足，现有方法通过在高斯分布均值两侧固定偏移来确定边界，未能精确捕捉最优的时间边界；(2) 在推理阶段，现有方法依赖与其他提案的交集进行 top-1 预测选择，而未充分考虑各提案的质量差异。论文的关键解决方案在于引入了新的边界预测方法以从多个高斯分布中捕获多样化的时间边界，并提出结合提案质量的新选择策略，从而提升整体性能。实验结果表明，所提出的推理策略在 ActivityNet Captions 和 Charades-STA 数据集上均有效提升了性能，且无需额外的训练数据。

链接: https://arxiv.org/abs/2503.23181
作者: Sunoh Kim,Daeho Um
机构: Dankook University ( Dankook 大学 ); Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised video grounding aims to localize temporal boundaries relevant to a given query without explicit ground-truth temporal boundaries. While existing methods primarily use Gaussian-based proposals, they overlook the importance of (1) boundary prediction and (2) top-1 prediction selection during inference. In their boundary prediction, boundaries are simply set at half a standard deviation away from a Gaussian mean on both sides, which may not accurately capture the optimal boundaries. In the top-1 prediction process, these existing methods rely heavily on intersections with other proposals, without considering the varying quality of each proposal. To address these issues, we explore various inference strategies by introducing (1) novel boundary prediction methods to capture diverse boundaries from multiple Gaussians and (2) new selection methods that take proposal quality into account. Extensive experiments on the ActivityNet Captions and Charades-STA datasets validate the effectiveness of our inference strategies, demonstrating performance improvements without requiring additional training.
zh

[CV-167] Intelligent Bear Prevention System Based on Computer Vision: An Approach to Reduce Human-Bear Conflicts in the Tibetan Plateau Area China

【速读】：该论文旨在解决人与熊在青藏高原地区的冲突问题，此类冲突对当地社区构成重大威胁，并阻碍野生动物保护工作的开展。论文提出了一种结合计算机视觉（Computer Vision）与物联网（Internet of Things, IoT）技术的创新策略，以缓解这一问题。解决方案的关键在于利用专为恶劣环境设计的K210开发板，结合YOLO目标检测框架以及定制化的熊类驱离机制，实现低功耗、高实时性的熊类识别与驱离功能。实验评估显示，该模型的平均精度均值（mean Average Precision, mAP）达到91.4%，证明其具有卓越的精确性和可靠性。通过采用节能组件，所提出的系统能够克服偏远无电网区域的挑战，确保在偏远地点的持续运行。这一研究为缓解人熊冲突提供了可行、环保且可扩展的解决方案，从而提升了人类安全并促进了如中国玉树等孤立地区的熊类保护工作。

链接: https://arxiv.org/abs/2503.23178
作者: Pengyu Chen,Teng Fei,Yunyan Du,Jiawei Yi,Yi Li,John A. Kupfer
机构: School of Resources and Environmental Sciences, Wuhan University (武汉大学), China; Department of Geography, University of South Carolina (南卡罗来纳大学), USA; State Key Laboratory of Resources and Environmental Information System, Chinese Academy of Sciences (中国科学院资源与环境信息系统国家重点实验室), China; Institute of Zoology, Chinese Academy of Sciences (中国科学院动物研究所), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conflicts between humans and bears on the Tibetan Plateau present substantial threats to local communities and hinder wildlife preservation initiatives. This research introduces a novel strategy that incorporates computer vision alongside Internet of Things (IoT) technologies to alleviate these issues. Tailored specifically for the harsh environment of the Tibetan Plateau, the approach utilizes the K210 development board paired with the YOLO object detection framework along with a tailored bear-deterrent mechanism, offering minimal energy usage and real-time efficiency in bear identification and deterrence. The model’s performance was evaluated experimentally, achieving a mean Average Precision (mAP) of 91.4%, demonstrating excellent precision and dependability. By integrating energy-efficient components, the proposed system effectively surpasses the challenges of remote and off-grid environments, ensuring uninterrupted operation in secluded locations. This study provides a viable, eco-friendly, and expandable solution to mitigate human-bear conflicts, thereby improving human safety and promoting bear conservation in isolated areas like Yushu, China.
zh

[CV-168] NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations

【速读】：本文旨在解决基于3D高斯点云（3D Gaussian Splatting, 3DGS）的大规模场景表示在存储和传输成本上的高昂开销问题。现有方法主要通过压缩Scaffold-GS来实现，但需要引入额外的体素结构以及复杂的编码和量化策略。本文提出了一种名为NeuralGS的简单而有效的解决方案，其关键在于利用神经场（neural field）表示方法，采用多层感知机（Multi-Layer Perceptron, MLP）直接对原始3DGS中的3D高斯属性进行紧凑编码，无需体素结构或复杂的量化策略。具体而言，NeuralGS通过聚类策略将3D高斯分配到不同簇，并为每个簇拟合一个小的MLP模型作为权重，从而实现高效压缩。实验结果表明，该方法在多个数据集上实现了平均45倍的模型大小缩减，同时保持视觉质量不变，证明了直接使用神经场压缩原始3DGS的巨大潜力。

链接: https://arxiv.org/abs/2503.23162
作者: Zhenyu Tang,Chaoran Feng,Xinhua Cheng,Wangbo Yu,Junwu Zhang,Yuan Liu,Xiaoxiao Long,Wenping Wang,Li Yuan
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) demonstrates superior quality and rendering speed, but with millions of 3D Gaussians and significant storage and transmission costs. Recent 3DGS compression methods mainly concentrate on compressing Scaffold-GS, achieving impressive performance but with an additional voxel structure and a complex encoding and quantization strategy. In this paper, we aim to develop a simple yet effective method called NeuralGS that explores in another way to compress the original 3DGS into a compact representation without the voxel structure and complex quantization strategies. Our observation is that neural fields like NeRF can represent complex 3D scenes with Multi-Layer Perceptron (MLP) neural networks using only a few megabytes. Thus, NeuralGS effectively adopts the neural field representation to encode the attributes of 3D Gaussians with MLPs, only requiring a small storage size even for a large-scale scene. To achieve this, we adopt a clustering strategy and fit the Gaussians with different tiny MLPs for each cluster, based on importance scores of Gaussians as fitting weights. We experiment on multiple datasets, achieving a 45-times average model size reduction without harming the visual quality. The compression performance of our method on original 3DGS is comparable to the dedicated Scaffold-GS-based compression methods, which demonstrate the huge potential of directly compressing original 3DGS with neural fields.
zh

[CV-169] LSNet: See Large Focus Small CVPR2025

【速读】：该论文旨在解决现有轻量级视觉网络在实时应用中因复杂计算导致部署挑战的问题。现有轻量级模型主要依赖自注意力机制和卷积进行令牌混合，这在感知和聚合过程中存在有效性与效率的局限性，难以在有限的计算预算下实现性能与效率的平衡。论文的关键创新在于受高效人类视觉系统动态异尺度视觉能力的启发，提出了“大视角感知、小范围聚焦”（“See Large, Focus Small”）的设计策略，并引入LS（\textbf{Large-Small}）卷积，通过结合大核感知和小核聚合，能够高效捕获广泛的感知信息并实现精确的特征聚合，从而有效处理动态复杂的视觉表示。基于此，论文提出了一种新的轻量级模型家族LSNet，实验结果表明其在多种视觉任务中实现了优于现有轻量级网络的性能和效率。

链接: https://arxiv.org/abs/2503.23135
作者: Ao Wang,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding
机构: School of Software, Tsinghua University (软件学院，清华大学); BNRist, Tsinghua University (BNRist，清华大学); Department of Automation, Tsinghua University (自动化系，清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 Camera-ready Version

点击查看摘要

Abstract:Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a ``See Large, Focus Small’’ strategy for lightweight vision network design. We introduce LS (\textbfLarge-\textbfSmall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks. Codes and models are available at this https URL.
zh

[CV-170] RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

【速读】：该论文旨在解决图表视觉理解领域中视觉元素与文本信息对齐不足的问题，现有方法在图表问答（ChartQA）任务中主要关注于直接回答问题，而未能显式识别支持预测的视觉元素。为填补这一空白，论文提出了RefChartQA，这是一个将ChartQA与视觉定位结合的新基准，使模型能够在图表图像中多粒度地引用视觉元素。解决方案的关键在于通过引入空间感知的视觉定位机制提升模型性能，实验表明这种方法可提高响应准确性超15%，减少幻觉并增强模型可靠性。此外，研究还发现如TinyChart中采用的token-merging模块等架构改进是影响文本-空间对齐的重要因素之一。

链接: https://arxiv.org/abs/2503.23131
作者: Alexander Vogel,Omar Moured,Yufan Chen,Jiaming Zhang,Rainer Stiefelhagen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: All models and code will be publicly available at this https URL

点击查看摘要

Abstract:Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at this https URL.
zh

[CV-171] Evaluating Compositional Scene Understanding in Multimodal Generative Models

【速读】：该论文试图解决的问题是如何评估当前文本到图像模型（如DALL-E 3）以及多模态视觉语言模型（如GPT-4V、GPT-4o、Claude Sonnet 3.5、QWEN2-VL-72B和InternVL2.5-38B）在处理组合视觉场景中的能力，并将其性能与人类参与者进行比较。论文的关键在于通过系统性的评估揭示这些模型在生成和理解包含多个对象及其关系的复杂场景时的表现局限性，强调了实现更高水平组合性视觉理解的必要性。

链接: https://arxiv.org/abs/2503.23125
作者: Shuhao Fu,Andrew Jun Lee,Anna Wang,Ida Momennejad,Trevor Bihl,Hongjing Lu,Taylor W. Webb
机构: Department of Psychology, University of California, Los Angeles (加州大学洛杉矶分校); Microsoft Research, NYC (微软研究院, 纽约); Air Force Research Laboratory (空军研究实验室); Department of Psychology, Department of Statistics, University of California, Los Angeles (加州大学洛杉矶分校); Microsoft Research, NYC (微软研究院, 纽约)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ( 5 ) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.
zh

[CV-172] Efficient Explicit Joint-level Interaction Modeling with Mamba for Text-guided HOI Generation ICME2025

【速读】：该论文旨在解决现有方法在生成文本引导的人体-物体交互（HOIs）时难以精确建模关节级交互的问题。传统方法将人体视为单一标记，这导致无法捕捉细微的关节级交互细节，从而产生不真实的HOIs。然而，将每个单独的关节视为一个标记会显著增加计算开销。为了解决这些挑战，论文提出了一种高效的显式关节级交互模型（EJIM）。EJIM的关键在于其双分支HOI Mamba模块，用于高效分离且建模时空HOI信息，以及双分支条件注入器模块，用于将文本语义和物体几何信息融入人体和物体运动中。此外，设计了动态交互块和渐进掩码机制来逐步筛选无关关节，确保准确且细致的交互建模。实验结果表明，EJIM在保持高精度的同时，仅需前人方法5%的推理时间。

链接: https://arxiv.org/abs/2503.23121
作者: Guohong Huang,Ling-An Zeng,Zexin Zheng,Shengbo Gu,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025

点击查看摘要

Abstract:We propose a novel approach for generating text-guided human-object interactions (HOIs) that achieves explicit joint-level interaction modeling in a computationally efficient manner. Previous methods represent the entire human body as a single token, making it difficult to capture fine-grained joint-level interactions and resulting in unrealistic HOIs. However, treating each individual joint as a token would yield over twenty times more tokens, increasing computational overhead. To address these challenges, we introduce an Efficient Explicit Joint-level Interaction Model (EJIM). EJIM features a Dual-branch HOI Mamba that separately and efficiently models spatiotemporal HOI information, as well as a Dual-branch Condition Injector for integrating text semantics and object geometry into human and object motions. Furthermore, we design a Dynamic Interaction Block and a progressive masking mechanism to iteratively filter out irrelevant joints, ensuring accurate and nuanced interaction modeling. Extensive quantitative and qualitative evaluations on public datasets demonstrate that EJIM surpasses previous works by a large margin while using only 5% of the inference time. Code is available \hrefthis https URLhere.
zh

[CV-173] Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction

【速读】：该论文旨在解决自动驾驶高精地图（HD Map）构建在跨陌生驾驶场景中的泛化能力不足问题。现有方法虽然在性能上有一定提升，但在未见过的场景中表现欠佳。为应对这一挑战，论文提出了一种名为UIGenMap的方法，其关键是通过不确定性引导的结构注入策略实现可泛化的高精地图矢量化。具体而言，UIGenMap关注统计分布中的不确定性重采样，并利用显式的实例特征减少对训练数据的过度依赖。为此，引入了透视图检测分支以获取显式的结构特征，并设计了考虑场景差异的不确定性感知解码器来动态采样概率分布。此外，通过概率嵌入与选择提出了UI2DPrompt以构造透视图可学习提示，并通过混合注入方式将其整合到地图解码器中，补偿被忽略的实例结构。为确保实时推理，还设计了轻量级的Mimic Query Distillation，作为透视图分支流程的高效替代方案。实验结果表明，UIGenMap在nuScenes数据集上的mAP提升了5.7个百分点。

链接: https://arxiv.org/abs/2503.23109
作者: Xiaolu Liu,Ruizi Yang,Song Wang,Wentong Li,Junbo Chen,Jianke Zhu
机构: Zhejiang University (浙江大学); NUAA (南京航空航天大学); Udeer.ai (未知中文名称)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Reliable high-definition (HD) map construction is crucial for the driving safety of autonomous vehicles. Although recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce excessive reliance on training data. Specifically, we introduce the perspective-view (PV) detection branch to obtain explicit structural features, in which the uncertainty-aware decoder is designed to dynamically sample probability distributions considering the difference in scenes. With probabilistic embedding and selection, UI2DPrompt is proposed to construct PV-learnable prompts. These PV prompts are integrated into the map decoder by designed hybrid injection to compensate for neglected instance structures. To ensure real-time inference, a lightweight Mimic Query Distillation is designed to learn from PV prompts, which can serve as an efficient alternative to the flow of PV branches. Extensive experiments on challenging geographically disjoint (geo-based) data splits demonstrate that our UIGenMap achieves superior performance, with +5.7 mAP improvement on the nuScenes dataset. Source code will be available at this https URL.
zh

[CV-174] Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments

【速读】：该论文旨在解决自主辅助机器人在复杂建筑环境中导航时面临的两个主要挑战：一是现有基于深度学习的封闭词表检测系统难以理解直观且随意的人类指令；二是大多数现有方法忽视场景识别问题中的不确定性，导致在模糊和复杂环境中的成功率较低。为了解决这些问题，论文提出了一种基于视觉语言模型（Vision Language Models, VLMs）和大型语言模型（Large Language Models, LLMs）的开放词表场景语义分割与检测管道。其关键在于采用“分割-检测-选择”（Segment Detect Select）框架，实现开放词表场景分类，从而支持辅助机器人在建筑环境中的自适应和直观导航。

链接: https://arxiv.org/abs/2503.23105
作者: Yifan Xu,Vineet Kamat,Carol Menassa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 7 figures

点击查看摘要

Abstract:The global rise in the number of people with physical disabilities, in part due to improvements in post-trauma survivorship and longevity, has amplified the demand for advanced assistive technologies to improve mobility and independence. Autonomous assistive robots, such as smart wheelchairs, require robust capabilities in spatial segmentation and semantic recognition to navigate complex built environments effectively. Place segmentation involves delineating spatial regions like rooms or functional areas, while semantic recognition assigns semantic labels to these regions, enabling accurate localization to user-specific needs. Existing approaches often utilize deep learning; however, these close-vocabulary detection systems struggle to interpret intuitive and casual human instructions. Additionally, most existing methods ignore the uncertainty of the scene recognition problem, leading to low success rates, particularly in ambiguous and complex environments. To address these challenges, we propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs). Our approach follows a ‘Segment Detect Select’ framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.
zh

[CV-175] FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video CVPR2025

【速读】：该论文致力于解决基于头戴式面向身体的立体相机的自我中心运动捕捉问题，特别是在虚拟现实（VR）和增强现实（AR）应用中的挑战，如严重的遮挡和有限的真实标注数据。现有方法依赖于合成预训练，在真实环境中难以生成平滑且精确的预测，尤其是对于下肢。论文的关键在于引入了一个轻量级的基于VR的数据收集设置，配备了实时6D姿态跟踪功能。通过此设置，收集了迄今为止最大规模且具有高运动变异性的真实世界数据集。为了解决多模态输入（设备姿态与相机馈送）整合的难题，提出了FRAME架构，该架构通过几何上合理的多模态集成实现了最先进的身体姿态预测，并能在现代硬件上以300 FPS运行。此外，论文展示了一种新颖的训练策略以提升模型的泛化能力，利用问题的几何特性实现了高质量的运动捕捉，避免了先前工作的常见伪影。

链接: https://arxiv.org/abs/2503.23094
作者: Andrea Boscolo Camiletto,Jian Wang,Eduardo Alvarado,Rishabh Dabral,Thabo Beeler,Marc Habermann,Christian Theobalt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications but presents significant challenges such as heavy occlusions and limited annotated real-world data. Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings, particularly for lower limbs. Our work addresses these limitations by introducing a lightweight VR-based data collection setup with on-board, real-time 6D pose tracking. Using this setup, we collected the most extensive real-world dataset for ego-facing ego-mounted cameras to date in size and motion variability. Effectively integrating this multimodal input – device pose and camera feeds – is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. Lastly, we showcase a novel training strategy to enhance the model’s generalization capabilities. Our approach exploits the problem’s geometric properties, yielding high-quality motion capture free from common artifacts in prior works. Qualitative and quantitative evaluations, along with extensive comparisons, demonstrate the effectiveness of our method. Data, code, and CAD designs will be available at this https URL
zh

[CV-176] InkFM: A Foundational Model for Full-Page Online Handwritten Note Understanding

【速读】：该论文试图解决的问题是如何准确地分析和理解手写数字笔记的内容，以优化基于平板和手写笔的笔记体验并提升工作效率。解决方案的关键在于提出了一种名为InkFM的基模型，该模型通过在多样化任务上的训练，实现了文本识别（支持28种不同书写系统）、数学表达式识别以及页面元素分割（如文本与绘图分离）等能力的统一。通过微调或LoRA调优，InkFM不仅在页面分割和文本识别（涵盖DeepWriting、CASIA、SCUT、Mathwriting数据集）方面达到了当前最优性能，还在草图分类（QuickDraw数据集）上表现出色，展现了其强大的适应性和应用潜力。

链接: https://arxiv.org/abs/2503.23081
作者: Anastasiia Fadeeva,Vincent Coriou,Diego Antognini,Claudiu Musat,Andrii Maksai
机构: Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tablets and styluses are increasingly popular for taking notes. To optimize this experience and ensure a smooth and efficient workflow, it’s important to develop methods for accurately interpreting and understanding the content of handwritten digital notes. We introduce a foundational model called InkFM for analyzing full pages of handwritten content. Trained on a diverse mixture of tasks, this model offers a unique combination of capabilities: recognizing text in 28 different scripts, mathematical expressions recognition, and segmenting pages into distinct elements like text and drawings. Our results demonstrate that these tasks can be effectively unified within a single model, achieving SoTA text line segmentation out-of-the-box quality surpassing public baselines like docTR. Fine- or LoRA-tuning our base model on public datasets further improves the quality of page segmentation, achieves state-of the art text recognition (DeepWriting, CASIA, SCUT, and Mathwriting datasets) and sketch classification (QuickDraw). This adaptability of InkFM provides a powerful starting point for developing applications with handwritten input.
zh

[CV-177] VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

【速读】：该论文旨在解决大型视觉语言模型（Large Vision-Language Models, LVLMs）在解决需要精确感知、规则理解及逻辑推理的谜题（puzzles）时表现不佳的问题。现有基准测试主要评估未经额外训练或微调的预训练模型，缺乏对推理能力的专门关注，并未能建立系统的评估框架。为应对这些局限性，论文引入了VGRP-Bench，这是一个包含20种多样化谜题的视觉网格推理谜题基准。通过系统实验，研究揭示了当前最先进的LVLMs在解决这类问题上的根本局限性，并分析了影响其性能的关键因素，如线索数量、网格大小及规则复杂度。此外，论文探索了两种监督微调策略：基于解法的微调（S-SFT）和基于合成推理过程的微调（R-SFT），尽管这两种方法显著提升了已训练谜题的性能，但其泛化能力有限。因此，该研究的关键在于通过构建VGRP-Bench基准和深入分析，识别LVLMs在复杂推理任务中的不足，并提出针对性的改进策略。

链接: https://arxiv.org/abs/2503.23064
作者: Yufan Ren,Konstantinos Tertikas,Shalini Maiti,Junlin Han,Tong Zhang,Sabine Süsstrunk,Filippos Kokkinos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs’ puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving.
zh

[CV-178] Shape and Texture Recognition in Large Vision-Language Models

【速读】：该论文试图解决视觉感知领域中形状与纹理识别的核心问题，特别是评估当前领先的大型视觉-语言模型（Large Vision-Language Models, LVLMs）在理解二维和三维场景中形状、纹理及材质方面的有效性。论文的关键在于引入了一个名为Large Shape Textures (LAST) 的大规模数据集，该数据集自动从真实世界图像中提取出多样化的形状和纹理。通过此数据集，论文测试了模型在面对形状方向、纹理、颜色或环境变化时的识别能力，并进一步分析了模型在纹理与材质识别上的表现差异。研究发现，LVLMs 在形状识别方面显著低于人类水平，且对抽象形状的理解尤为困难；而在材质识别上，尽管在三维场景中接近人类性能，在二维简单纹理识别上则表现欠佳。因此，LAST 数据集及其基准为未来提升模型在形状与纹理理解方面的性能提供了重要的资源支持。

链接: https://arxiv.org/abs/2503.23062
作者: Sagi Eppel,Mor Bismut,Alona Faktor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shape and texture recognition is fundamental to visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures independently of their associated objects, is essential for general visual understanding of the world. We introduce the Large Shape Textures dataset (LAST), a giant collection of diverse shapes and textures automatically extracted from real-world images. This dataset is used to evaluate how effectively leading Large Vision-Language Models (LVLMs) understand shapes, textures, and materials in both 2D and 3D scenes. For shape recognition, we test models’ ability to match identical shapes that differ in orientation, texture, color, or environment. Our results show that LVLMs’ shape identification capabilities remain significantly below human performance. Single alterations (orientation, texture) cause minor decreases in matching accuracy, while multiple changes precipitate dramatic drops. LVLMs appear to rely predominantly on high-level and semantic features and struggle with abstract shapes lacking clear class associations. For texture and material recognition, we evaluate models’ ability to identify identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler 2D textures. The LAST dataset and benchmark, the largest and most diverse resource for shape and texture evaluation, is freely available with generation and testing scripts.
zh

[CV-179] Prediction of 30-day hospital readmission with clinical notes and EHR information

【速读】：该论文试图解决医院再入院率高所带来的显著成本和健康风险问题，目标是开发预测模型以支持临床医生判断患者在短期内（如30天内）是否可能再次入院。论文的关键在于整合结构化电子健康记录（EHR）与非结构化临床笔记中的信息，利用图神经网络（Graph Neural Network, GNN）将这些多模态数据表示为节点，并结合大型语言模型（LLMs）表征临床笔记，从而提升预测性能。最终，该模型达到了0.72的AUROC值和66.7%的平衡准确率，验证了多模态信息融合的重要性。

链接: https://arxiv.org/abs/2503.23050
作者: Tiago Almeida,Plinio Moreno,Catarina Barata
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High hospital readmission rates are associated with significant costs and health risks for patients. Therefore, it is critical to develop predictive models that can support clinicians to determine whether or not a patient will return to the hospital in a relatively short period of time (e.g, 30-days). Nowadays, it is possible to collect both structured (electronic health records - EHR) and unstructured information (clinical notes) about a patient hospital event, all potentially containing relevant information for a predictive model. However, their integration is challenging. In this work we explore the combination of clinical notes and EHRs to predict 30-day hospital readmissions. We address the representation of the various types of information available in the EHR data, as well as exploring LLMs to characterize the clinical notes. We collect both information sources as the nodes of a graph neural network (GNN). Our model achieves an AUROC of 0.72 and a balanced accuracy of 66.7%, highlighting the importance of combining the multimodal information.
zh

[CV-180] CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

【速读】：该论文致力于解决3D Gaussian Splatting在大规模场景重建中的核心挑战，包括处理速度慢、计算成本高以及几何精度有限等问题。这些问题源于其固有的无结构设计和缺乏高效并行化。为同时克服这些挑战，论文提出了CityGS-X，这是一种基于新型并行化混合分层3D表示（PH^2-3D）的可扩展架构。解决方案的关键在于抛弃繁琐的合并与分割过程，采用全新设计的批处理级多任务渲染流程，并通过动态细节层次（Level-of-Detail）体素分配实现高效的多GPU渲染，从而显著提升可扩展性和性能。实验结果表明，CityGS-X在训练速度、渲染容量和几何细节准确性方面均优于现有方法。

链接: https://arxiv.org/abs/2503.23044
作者: Yuanyuan Gao,Hao Li,Jiaqi Chen,Zhengyu Zou,Zhihang Zhong,Dingwen Zhang,Xiao Sun,Junwei Han
机构: Northwestern Polytechnical University (西北工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy. These core issues arise from its inherently unstructured design and the absence of efficient parallelization. To overcome these challenges simultaneously, we introduce CityGS-X, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH^2-3D). As an early attempt, CityGS-X abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance. Through extensive experiments, CityGS-X consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. Notably, CityGS-X can train and render a scene with 5,000+ images in just 5 hours using only 4 * 4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-X is far beyond the capacity of other existing methods.
zh

[CV-181] STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing ICME2025

【速读】：该论文旨在解决现有基于音频驱动的动态人脸视觉配音方法中存在的语义歧义问题，即空间域与时间域之间的语义不一致显著影响了合成结果的稳定性。为了解决这一问题，论文提出了一种空间-时间语义对齐（Spatial-Temporal Semantic Alignment, STSA）方法。该方法的关键在于引入了双路径对齐机制和可微分语义表示：双路径对齐机制通过一致信息学习（Consistent Information Learning, CIL）模块，在多尺度上最大化互信息，从而减少空间域与时间域之间的流形差异；而可微分语义表示则利用概率热图作为抗语义抖动的指导，避免因轻微语义波动导致的合成人脸异常动态现象。

链接: https://arxiv.org/abs/2503.23039
作者: Zijun Ding,Mingdie Xiong,Congcong Zhu,Jingrun Chen
机构: Key Laboratory of the Ministry of Education for Mathematical Foundations and Applications of Digital Technology (教育部数学基础与数字技术应用重点实验室), University of Science and Technology of China (中国科学技术大学), Hefei, China (中国); Suzhou Institute for Advanced Research, USTC (苏州高等研究院), USTC (中国科学技术大学), Suzhou, China (中国); Harbin Engineering University (哈尔滨工程大学), Harbin, China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME 2025

点击查看摘要

Abstract:Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at this https URL.
zh

[CV-182] FreeInv: Free Lunch for Improving DDIM Inversion

【速读】：该论文旨在解决 naive DDIM（Denoising Diffusion Implicit Models）反演过程中常见的轨迹偏移问题，即在重建过程中潜在轨迹与反演过程中的轨迹发生偏离。为缓解这一问题，以往方法要么学习如何减轻偏差，要么设计复杂的补偿策略以减少不匹配误差，但这些方法往往需要较高的时间和计算成本。本文提出了一种几乎无需额外代价的方法（命名为 FreeInv），以更高效且更有效地解决该问题。FreeInv 的关键在于随机变换潜在表示，并在对应的反演和重建时间步保持这种变换一致。这种方法从统计学角度出发，认为对多个轨迹进行 DDIM 反演过程的集合平均可以在期望上减小轨迹不匹配误差。此外，通过理论分析和实证研究，作者表明 FreeInv 能够高效实现多轨迹的集合操作。FreeInv 可以无缝集成到现有的基于反演的图像和视频编辑技术中，在视频序列反演方面尤其能够带来显著的保真度和效率提升。在 PIE benchmark 和 DAVIS 数据集上的综合定量和定性评估表明，FreeInv 显著优于传统的 DDIM 反演方法，并在计算效率方面具有竞争力，同时达到或超越先前最先进的反演方法的表现。

链接: https://arxiv.org/abs/2503.23035
作者: Yuxiang Bao,Huijie Liu,Xun Gao,Huan Fu,Guoliang Kang
机构: Alibaba Group (阿里巴巴集团); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Naive DDIM inversion process usually suffers from a trajectory deviation issue, i.e., the latent trajectory during reconstruction deviates from the one during inversion. To alleviate this issue, previous methods either learn to mitigate the deviation or design cumbersome compensation strategy to reduce the mismatch error, exhibiting substantial time and computation cost. In this work, we present a nearly free-lunch method (named FreeInv) to address the issue more effectively and efficiently. In FreeInv, we randomly transform the latent representation and keep the transformation the same between the corresponding inversion and reconstruction time-step. It is motivated from a statistical perspective that an ensemble of DDIM inversion processes for multiple trajectories yields a smaller trajectory mismatch error on expectation. Moreover, through theoretical analysis and empirical study, we show that FreeInv performs an efficient ensemble of multiple trajectories. FreeInv can be freely integrated into existing inversion-based image and video editing techniques. Especially for inverting video sequences, it brings more significant fidelity and efficiency improvements. Comprehensive quantitative and qualitative evaluation on PIE benchmark and DAVIS dataset shows that FreeInv remarkably outperforms conventional DDIM inversion, and is competitive among previous state-of-the-art inversion methods, with superior computation efficiency.
zh

[CV-183] Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning CVPR2025

【速读】：该论文旨在解决广义零样本学习（Generalized Zero-Shot Learning, GZSL）中视觉与语义信息一致性对齐的问题。现有方法通过利用已见类别（seen-class）数据微调视觉主干网络以获取语义相关的视觉特征，但可能因训练图像数量有限而在已见类别上产生过拟合。为解决此问题，论文提出了一种新颖的视觉与语义提示协作框架（Visual-Semantic Prompt Collaboration Framework），其关键在于结合提示调整技术实现高效特征适配。具体而言，设计了视觉提示（visual prompt）以整合视觉信息用于判别性特征学习，以及语义提示（semantic prompt）以整合语义信息用于视觉-语义对齐。此外，为实现有效提示信息融合，在网络的浅层采用弱提示融合机制（weak prompt fusion mechanism），在深层采用强提示融合机制（strong prompt fusion mechanism）。通过视觉与语义提示的协作，该框架能够为广义零样本图像识别任务生成判别性和语义相关的特征。实验结果表明，该方法在传统零样本学习和广义零样本学习基准测试中均优于其他最新方法。

链接: https://arxiv.org/abs/2503.23030
作者: Huajie Jiang,Zhengxian Li,Xiaohan Yu,Yongli Hu,Baocai Yin,Jian Yang,Yuankai Qi
机构: Beijing University of Technology (北京工业大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025

点击查看摘要

Abstract:Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visualsemantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.
zh

[CV-184] Empowering Large Language Models with 3D Situation Awareness CVPR2025

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Models, LLMs）的方法在处理三维场景理解时忽视主体（egocentric）视角的问题。具体而言，当前方法通常基于全局视角的数据集，而未充分利用三维场景中观察者位置和方向变化导致的不同描述（如“左”或“右”）。为了解决这一局限性，论文提出的关键方案包括：利用数据收集过程中的扫描轨迹自动生成情境感知的数据集，并结合视觉-语言模型（Vision-Language Models, VLMs）生成高质量的图像标题和问答对；同时引入情境定位模块（situation grounding module），显式预测观察者的视角位置和方向，从而使LLMs能够在三维场景中更好地关联情境描述。这种创新方法不仅提升了LLMs的三维情境理解能力，还显著扩展了现有数据集并减少了人工标注的工作量。

链接: https://arxiv.org/abs/2503.23024
作者: Zhihao Yuan,Yibo Peng,Jinke Ren,Yinghong Liao,Yatong Han,Chun-Mei Feng,Hengshuang Zhao,Guanbin Li,Shuguang Cui,Zhen Li
机构: FNii-Shenzhen, CUHKSZ (深港创新研究院，香港中文大学（深圳）); SSE, CUHKSZ (理工学院，香港中文大学（深圳）); IHPC, A*STAR, Singapore (新加坡科技研究局信息通信研究院，新加坡); HKU (香港大学); SYSU (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Driven by the great success of Large Language Models (LLMs) in the 2D image domain, their applications in 3D scene understanding has emerged as a new trend. A key difference between 3D and 2D is that the situation of an egocentric observer in 3D scenes can change, resulting in different descriptions (e.g., ''left" or ''right"). However, current LLM-based methods overlook the egocentric perspective and simply use datasets from a global viewpoint. To address this issue, we propose a novel approach to automatically generate a situation-aware dataset by leveraging the scanning trajectory during data collection and utilizing Vision-Language Models (VLMs) to produce high-quality captions and question-answer pairs. Furthermore, we introduce a situation grounding module to explicitly predict the position and orientation of observer’s viewpoint, thereby enabling LLMs to ground situation description in 3D scenes. We evaluate our approach on several benchmarks, demonstrating that our method effectively enhances the 3D situational awareness of LLMs while significantly expanding existing datasets and reducing manual effort.
zh

[CV-185] MeshCraft: Exploring Efficient and Controllable Mesh Generation with Flow-based DiTs

【速读】：本文旨在解决3D内容创作领域中通过AI模型实现最优网格拓扑的问题。传统方法如MeshGPT采用网格自回归技术生成即用型3D物体，但其逐token预测的方式导致生成速度极慢且生成的网格面数不可控。为克服这些局限性，本文提出MeshCraft，一种高效且可控的网格生成框架，利用连续空间扩散生成离散三角形面片。MeshCraft的关键在于其由两部分组成：1）基于Transformer的变分自编码器（VAE），用于将原始网格编码为连续的面级tokens并解码回原网格；2）基于流的扩散Transformer，以预设的面数为条件，实现了高质量3D网格的生成。通过在整个网格拓扑的同时生成，MeshCraft相比自回归方法显著提升了生成速度（生成800面网格仅需3.2秒，比现有基线快35倍）。实验结果表明，MeshCraft在ShapeNet数据集上的定性和定量评估中均优于当前最先进技术，并在Objaverse数据集上表现出色，同时与现有的条件引导策略无缝集成，展示了减轻艺术家手动网格创建负担的潜力。

链接: https://arxiv.org/abs/2503.23022
作者: Xianglong He,Junyi Chen,Di Huang,Zexiang Liu,Xiaoshui Huang,Wanli Ouyang,Chun Yuan,Yangguang Li
机构: Tsinghua University (清华大学); Shanghai Jiaotong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Sydney (悉尼大学); VAST; The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the domain of 3D content creation, achieving optimal mesh topology through AI models has long been a pursuit for 3D artists. Previous methods, such as MeshGPT, have explored the generation of ready-to-use 3D objects via mesh auto-regressive techniques. While these methods produce visually impressive results, their reliance on token-by-token predictions in the auto-regressive process leads to several significant limitations. These include extremely slow generation speeds and an uncontrollable number of mesh faces. In this paper, we introduce MeshCraft, a novel framework for efficient and controllable mesh generation, which leverages continuous spatial diffusion to generate discrete triangle faces. Specifically, MeshCraft consists of two core components: 1) a transformer-based VAE that encodes raw meshes into continuous face-level tokens and decodes them back to the original meshes, and 2) a flow-based diffusion transformer conditioned on the number of faces, enabling the generation of high-quality 3D meshes with a predefined number of faces. By utilizing the diffusion model for the simultaneous generation of the entire mesh topology, MeshCraft achieves high-fidelity mesh generation at significantly faster speeds compared to auto-regressive methods. Specifically, MeshCraft can generate an 800-face mesh in just 3.2 seconds (35 \times faster than existing baselines). Extensive experiments demonstrate that MeshCraft outperforms state-of-the-art techniques in both qualitative and quantitative evaluations on ShapeNet dataset and demonstrates superior performance on Objaverse dataset. Moreover, it integrates seamlessly with existing conditional guidance strategies, showcasing its potential to relieve artists from the time-consuming manual work involved in mesh creation.
zh

[CV-186] he impact of tissue detection on diagnostic artificial intelligence algorithms in digital pathology

【速读】：该论文旨在解决数字病理学应用中组织检测质量对下游任务性能的影响问题，并比较经典方法与基于人工智能（AI）的组织检测方法的性能差异。论文的关键在于评估不同组织检测算法（阈值法 vs. UNet++）对前列腺癌Gleason分级性能的影响，特别是当组织检测失败或表现不佳时对临床诊断结果的潜在影响。研究发现，使用UNet++的AI模型相较于传统阈值法能够显著减少完全未检测到组织样本的数量（从0.43%降至0.08%），从而降低极端情况下的失效风险。尽管在组织均可被检测的情况下整体分级性能无显著差异，但在3.5%的恶性样本中观察到了由组织检测依赖引起的临床上重要的分级变化，强调了稳健组织检测对于诊断AI系统临床性能优化的重要性。因此，该研究的关键解决方案在于验证基于AI的组织检测方法在提高组织检测可靠性方面的潜力，并揭示其对下游诊断任务性能的潜在影响。

链接: https://arxiv.org/abs/2503.23021
作者: Sol Erika Boman,Nita Mulliqi,Anders Blilie,Xiaoyi Ji,Kelvin Szolnoky,Einar Gudlaugsson,Emiel A.M. Janssen,Svein R. Kjosavik,José Asenjo,Marcello Gambacorta,Paolo Libretti,Marcin Braun,Radzislaw Kordek,Roman Łowicki,Kristina Hotakainen,Päivi Väre,Bodil Ginnerup Pedersen,Karina Dalsgaard Sørensen,Benedicte Parm Ulhøi,Lars Egevad,Kimmo Kartasalo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 2 tables, 3 figures, 1 supplementary figure

点击查看摘要

Abstract:Tissue detection is a crucial first step in most digital pathology applications. Details of the segmentation algorithm are rarely reported, and there is a lack of studies investigating the downstream effects of a poor segmentation algorithm. Disregarding tissue detection quality could create a bottleneck for downstream performance and jeopardize patient safety if diagnostically relevant parts of the specimen are excluded from analysis in clinical applications. This study aims to determine whether performance of downstream tasks is sensitive to the tissue detection method, and to compare performance of classical and AI-based tissue detection. To this end, we trained an AI model for Gleason grading of prostate cancer in whole slide images (WSIs) using two different tissue detection algorithms: thresholding (classical) and UNet++ (AI). A total of 33,823 WSIs scanned on five digital pathology scanners were used to train the tissue detection AI model. The downstream Gleason grading algorithm was trained and tested using 70,524 WSIs from 13 clinical sites scanned on 13 different scanners. There was a decrease from 116 (0.43%) to 22 (0.08%) fully undetected tissue samples when switching from thresholding-based tissue detection to AI-based, suggesting an AI model may be more reliable than a classical model for avoiding total failures on slides with unusual appearance. On the slides where tissue could be detected by both algorithms, no significant difference in overall Gleason grading performance was observed. However, tissue detection dependent clinically significant variations in AI grading were observed in 3.5% of malignant slides, highlighting the importance of robust tissue detection for optimal clinical performance of diagnostic AI. Comments: 25 pages, 2 tables, 3 figures, 1 supplementary figure Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.23021 [cs.CV] (or arXiv:2503.23021v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.23021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-187] Multi-label classification for multi-temporal multi-spatial coral reef condition monitoring using vision foundation model with adapter learning

【速读】：该论文旨在解决复杂水下生态图像中珊瑚礁状态自动分类性能不足以及传统模型在处理多时间与空间场景下的泛化能力有限的问题。为应对这些挑战，论文提出了一种将DINOv2视觉基础模型与Low-Rank Adaptation (LoRA) 微调方法相结合的方法。其关键是利用LoRA显著减少了可训练参数量（从1,100M降至5.91M），同时保持了高精度和跨域泛化能力，从而实现高效的基础模型适应性微调，用于多标签分类任务。实验结果表明，DINOv2-LoRA模型在不同时间和空间设置下的卓越表现，验证了解决方案的有效性。

链接: https://arxiv.org/abs/2503.23012
作者: Xinlei Shao,Hongruixuan Chen,Fan Zhao,Kirsty Magson,Jundong Chen,Peiran Li,Jiaqi Wang,Jun Sasaki
机构: Department of Socio-Cultural Environmental Studies, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8563, Japan; Department of Environment Systems, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, 277-8563, Japan; New Heaven Reef Conservation Program, Koh Tao, Surat Thani, 84360, Thailand; Data Science and AI Innovation Research Promotion Center, Shiga University, Hikone, Shiga, 522-8522, Japan; Interfaculty Initiative in Information Studies, The University of Tokyo, Tokyo, 113-0033, Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coral reef ecosystems provide essential ecosystem services, but face significant threats from climate change and human activities. Although advances in deep learning have enabled automatic classification of coral reef conditions, conventional deep models struggle to achieve high performance when processing complex underwater ecological images. Vision foundation models, known for their high accuracy and cross-domain generalizability, offer promising solutions. However, fine-tuning these models requires substantial computational resources and results in high carbon emissions. To address these challenges, adapter learning methods such as Low-Rank Adaptation (LoRA) have emerged as a solution. This study introduces an approach integrating the DINOv2 vision foundation model with the LoRA fine-tuning method. The approach leverages multi-temporal field images collected through underwater surveys at 15 dive sites at Koh Tao, Thailand, with all images labeled according to universal standards used in citizen science-based conservation programs. The experimental results demonstrate that the DINOv2-LoRA model achieved superior accuracy, with a match ratio of 64.77%, compared to 60.34% achieved by the best conventional model. Furthermore, incorporating LoRA reduced the trainable parameters from 1,100M to 5.91M. Transfer learning experiments conducted under different temporal and spatial settings highlight the exceptional generalizability of DINOv2-LoRA across different seasons and sites. This study is the first to explore the efficient adaptation of foundation models for multi-label classification of coral reef conditions under multi-temporal and multi-spatial settings. The proposed method advances the classification of coral reef conditions and provides a tool for monitoring, conserving, and managing coral reef ecosystems.
zh

[CV-188] On Geometrical Properties of Text Token Embeddings for Strong Semantic Binding in Text-to-Image Generation

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）模型在处理复杂场景中多对象和属性时普遍存在的文本-图像错位问题。为缓解此问题，研究聚焦于通过精确关联生成的属性与对象及其对应的名词短语（Noun Phrases, NPs）来实现语义绑定（Semantic Binding）。现有方法主要依赖于文本或潜在空间优化，但影响语义绑定的因素尚未被充分探索。

论文的关键在于揭示文本标记嵌入（text token embeddings）的几何特性，特别是角度距离和范数，在跨注意力（Cross-Attention, CA）图区分中的重要作用，并据此提出了一种无需训练的语义绑定增强框架——\textbfTeeMo。TeeMo的核心创新包括因果感知投影消除（Causality-Aware Projection-Out, CAPO），用于生成不同名词短语间的独立CA图，以及自适应标记混合（Adaptive Token Mixing, ATM），结合特定损失函数，以增强名词短语间的分离性同时保持其内部一致性。大量实验验证了TeeMo在多种基线模型和数据集上的优越性能。

链接: https://arxiv.org/abs/2503.23011
作者: Hoigi Seo,Junseo Bang,Haechang Lee,Joohoon Lee,Byung Hyun Lee,Se Young Chun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models often suffer from text-image misalignment in complex scenes involving multiple objects and attributes. Semantic binding aims to mitigate this issue by accurately associating the generated attributes and objects with their corresponding noun phrases (NPs). Existing methods rely on text or latent optimizations, yet the factors influencing semantic binding remain underexplored. Here we investigate the geometrical properties of text token embeddings and their cross-attention (CA) maps. We empirically and theoretically analyze that the geometrical properties of token embeddings, specifically both angular distances and norms, play a crucial role in CA map differentiation. Then, we propose \textbfTeeMo, a training-free text embedding-aware T2I framework with strong semantic binding. TeeMo consists of Causality-Aware Projection-Out (CAPO) for distinct inter-NP CA maps and Adaptive Token Mixing (ATM) with our loss to enhance inter-NP separation while maintaining intra-NP cohesion in CA maps. Extensive experiments confirm TeeMo consistently outperforms prior arts across diverse baselines and datasets.
zh

[CV-189] FreeSplat: Generalizable 3D Gaussian Splatting for Efficient Indoor Scene Reconstruction

【速读】：本文旨在解决现有基于3D高斯点 splatting (3D Gaussian Splatting, 3DGS) 的方法在大规模室内场景整体重建中的性能不足问题，特别是无法同时保证高质量和高效率的问题。论文的关键创新在于提出了FreeSplat++，通过以下三个主要方面实现突破：首先，引入低成本跨视图聚合框架以高效处理超长输入序列；其次，设计了一种像素级三元组融合方法，自适应地减少多视角重叠3D高斯基元的冗余；最后，提出一种加权漂移体去除策略，有效减少漂移体，并作为显式的深度融合方法，这对整体场景重建至关重要。此外，在3DGS基元前向重建后，通过深度正则化场景级微调进一步优化几何精度与渲染质量。这些方案显著提升了大规模室内场景的整体重建速度与几何准确性。

链接: https://arxiv.org/abs/2503.22986
作者: Yunsong Wang,Tianxin Huang,Hanlin Chen,Gim Hee Lee
机构: School of Computing, National University of Singapore (计算机学院, 新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the integration of the efficient feed-forward scheme into 3D Gaussian Splatting (3DGS) has been actively explored. However, most existing methods focus on sparse view reconstruction of small regions and cannot produce eligible whole-scene reconstruction results in terms of either quality or efficiency. In this paper, we propose FreeSplat++, which focuses on extending the generalizable 3DGS to become an alternative approach to large-scale indoor whole-scene reconstruction, which has the potential of significantly accelerating the reconstruction speed and improving the geometric accuracy. To facilitate whole-scene reconstruction, we initially propose the Low-cost Cross-View Aggregation framework to efficiently process extremely long input sequences. Subsequently, we introduce a carefully designed pixel-wise triplet fusion method to incrementally aggregate the overlapping 3D Gaussian primitives from multiple views, adaptively reducing their redundancy. Furthermore, we propose a weighted floater removal strategy that can effectively reduce floaters, which serves as an explicit depth fusion approach that is crucial in whole-scene reconstruction. After the feed-forward reconstruction of 3DGS primitives, we investigate a depth-regularized per-scene fine-tuning process. Leveraging the dense, multi-view consistent depth maps obtained during the feed-forward prediction phase for an extra constraint, we refine the entire scene’s 3DGS primitive to enhance rendering quality while preserving geometric accuracy. Extensive experiments confirm that our FreeSplat++ significantly outperforms existing generalizable 3DGS methods, especially in whole-scene reconstructions. Compared to conventional per-scene optimized 3DGS approaches, our method with depth-regularized per-scene fine-tuning demonstrates substantial improvements in reconstruction accuracy and a notable reduction in training time.
zh

[CV-190] Optimal Transport-Guided Source-Free Adaptation for Face Anti-Spoofing

【速读】：该论文旨在解决因训练数据集与实际测试数据之间存在的领域差距（Domain Gap），导致难以开发满足全球客户安全需求的人脸反欺骗（Face Anti-Spoofing）模型的问题。此外，出于安全和隐私考虑，客户端通常不愿意向服务提供商共享大量人脸数据。为应对这些挑战，论文提出了一种创新方法：客户端可以在测试时通过少量样本自适应调整预训练模型到目标域，同时无需访问模型参数或训练数据。该方案的关键在于开发了一个基于原型的基线模型（Prototype-based Base Model）以及一个以最优传输（Optimal Transport）为导向的适配器（Adaptor），使得模型能够以轻量级微调或无训练的方式完成自适应，而无需更新基线模型参数。此外，论文还提出了基于测地线混合（Geodesic Mixup）的数据增强方法，通过在源域原型与目标域分布之间的测地线路径上生成合成数据，从而训练出轻量级分类器，有效适应目标域特定特性，同时保留从源域学到的核心知识。实验结果表明，在跨领域和跨攻击场景下，该方法相较于现有方法在HTER指标上平均提升了19.17%，AUC指标上提升了8.58%。

链接: https://arxiv.org/abs/2503.22984
作者: Zhuowei Li,Tianchen Zhao,Xiang Xu,Zheng Zhang,Zhihua Li,Xuanbai Chen,Qin Zhang,Alessandro Bergamo,Anil K. Jain,Yifan Xing
机构: Rutgers University (罗格斯大学); AWS AI Labs (AWS人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Developing a face anti-spoofing model that meets the security requirements of clients worldwide is challenging due to the domain gap between training datasets and diverse end-user test data. Moreover, for security and privacy reasons, it is undesirable for clients to share a large amount of their face data with service providers. In this work, we introduce a novel method in which the face anti-spoofing model can be adapted by the client itself to a target domain at test time using only a small sample of data while keeping model parameters and training data inaccessible to the client. Specifically, we develop a prototype-based base model and an optimal transport-guided adaptor that enables adaptation in either a lightweight training or training-free fashion, without updating base model’s parameters. Furthermore, we propose geodesic mixup, an optimal transport-based synthesis method that generates augmented training data along the geodesic path between source prototypes and target data distribution. This allows training a lightweight classifier to effectively adapt to target-specific characteristics while retaining essential knowledge learned from the source domain. In cross-domain and cross-attack settings, compared with recent methods, our method achieves average relative improvements of 19.17% in HTER and 8.58% in AUC, respectively.
zh

[CV-191] ndiSplit: Bringing Severity Cognizance to Image Decomposition in Fluorescence Microscopy

【速读】：该论文旨在解决荧光显微镜成像中因技术限制导致的多结构信号混合问题，即如何从单一混合图像中准确分离出多个细胞结构。现有方法在固定强度比下训练，无法应对实际应用中可能遇到的任意混合比例，从而限制了其适用性。论文提出的解决方案——indiSplit方法的关键在于引入了一个适配训练的回归网络，用于预测输入图像的退化程度（混合不对称性），并结合特定退化归一化模块，使模型能够在未知且变化的混合比例下进行有效的推理。这种方法不仅解决了荧光显微镜中的图像分割与串色去除任务，还通过实验证明了其在五个公开数据集上的通用性和有效性。

链接: https://arxiv.org/abs/2503.22983
作者: Ashesh Ashesh,Florian Jug
机构: Human Technopole (人类技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fluorescence microscopy, while being a key driver for progress in the life sciences, is also subject to technical limitations. To overcome them, computational multiplexing techniques have recently been proposed, which allow multiple cellular structures to be captured in a single image and later be unmixed. Existing image decomposition methods are trained on a set of superimposed input images and the respective unmixed target images. It is critical to note that the relative strength (mixing ratio) of the superimposed images for a given input is a priori unknown. However, existing methods are trained on a fixed intensity ratio of superimposed inputs, making them not cognizant to the range of relative intensities that can occur in fluorescence microscopy. In this work, we propose a novel method called indiSplit that is cognizant of the severity of the above mentioned mixing ratio. Our idea is based on InDI, a popular iterative method for image restoration, and an ideal starting point to embrace the unknown mixing ratio in any given input. We introduce (i) a suitably trained regressor network that predicts the degradation level (mixing asymmetry) of a given input image and (ii) a degradation-specific normalization module, enabling degradation-aware inference across all mixing ratios. We show that this method solves two relevant tasks in fluorescence microscopy, namely image splitting and bleedthrough removal, and empirically demonstrate the applicability of indiSplit on 5 public datasets. We will release all sources under a permissive license.
zh

[CV-192] From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在空间感知（spatial perception）方面的不足，这些不足限制了其对复杂三维场景进行推理的能力。不同于以往通过将三维表示融入模型来提升空间理解的方法，本文提出利用与空间相关的二维图像数据来释放VLMs的潜力。关键在于引入了一种基于具有三维真实标签（3D ground-truth）场景数据的新颖二维空间数据生成与标注流水线，该流水线能够创建涵盖从基本感知任务到复杂推理任务的多样化空间任务集。基于此流水线构建的SPAR-7M大规模数据集以及设计的SPAR-Bench基准测试平台，不仅提供了比现有空间基准更全面的评估能力，还支持单视图和多视图输入。通过在SPAR-7M及大规模二维数据集上的训练，模型实现了当前最优的二维空间基准性能；进一步在特定三维任务数据集上的微调则展示了所提数据集在增强空间推理方面的重要作用。

链接: https://arxiv.org/abs/2503.22976
作者: Jiahui Zhang,Yurui Chen,Yanpeng Zhou,Yueming Xu,Ze Huang,Jilin Mei,Junhui Chen,Yu-Jie Yuan,Xinyue Cai,Guowei Huang,Xingyue Quan,Hang Xu,Li Zhang
机构: School of Data Science, Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
zh

[CV-193] Pallet Detection And Localisation From Synthetic Data WWW

【速读】：该论文旨在解决仓储行业中高效托盘检测与定位系统开发所面临的高成本手动数据标注问题。传统方法通常需要大量人工标注图像（平均每张耗时35秒），而论文提出了一种创新方法，通过使用完全合成的数据以及从托盘侧面提取的几何特征来增强托盘检测与定位性能。关键在于利用Unity中的领域随机化引擎，无需人工标注即可实现高性能结果，在真实世界数据集上单个托盘的mAP50达到0.995，同时在5米范围内实现了小于4.2厘米的位置精度和8.2°的旋转精度，从而显著降低了开发成本并提高了效率。

链接: https://arxiv.org/abs/2503.22965
作者: Henri Mueller,Yechan Kim,Trevor Gee,Mahla Nejati
机构: Centre for Automation and Robotic Engineering Science (自动化与机器人工程科学中心), The University of Auckland (奥克兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 images, 4 tables, submitted and accepted to ACRA 2024 ( this https URL )

点击查看摘要

Abstract:The global warehousing industry is experiencing rapid growth, with the market size projected to grow at an annual rate of 8.1% from 2024 to 2030 [Grand View Research, 2021]. This expansion has led to a surge in demand for efficient pallet detection and localisation systems. While automation can significantly streamline warehouse operations, the development of such systems often requires extensive manual data annotation, with an average of 35 seconds per image, for a typical computer vision project. This paper presents a novel approach to enhance pallet detection and localisation using purely synthetic data and geometric features derived from their side faces. By implementing a domain randomisation engine in Unity, the need for time-consuming manual annotation is eliminated while achieving high-performance results. The proposed method demonstrates a pallet detection performance of 0.995 mAP50 for single pallets on a real-world dataset. Additionally, an average position accuracy of less than 4.2 cm and an average rotation accuracy of 8.2° were achieved for pallets within a 5-meter range, with the pallet positioned head-on.
zh

[CV-194] SuperEIO: Self-Supervised Event Feature Learning for Event Inertial Odometry

【速读】：该论文旨在解决事件相机（Event Camera）在高动态运动和挑战性光照条件下进行鲁棒事件特征检测与匹配的问题。传统基于手工设计的方法难以应对事件相机因运动依赖特性带来的持续挑战，尤其是在剧烈运动和高动态范围（HDR）场景下。为解决这一问题，论文提出了一种名为SuperEIO的新框架，其关键在于结合学习方法实现仅基于事件的特征检测与IMU测量的融合，以完成事件-惯性里程计（Event-Inertial Odometry, EIO）。具体而言，该框架利用卷积神经网络（CNN）处理连续事件流实现事件特征检测，并采用图神经网络（Graph Neural Network）优化事件描述符匹配用于闭环检测。此外，通过TensorRT加速深度网络推理速度，确保了低延迟处理及资源受限平台上的实时运行能力。实验结果表明，该方法在多个公开数据集上表现出色，优于现有其他基于事件的方法。

链接: https://arxiv.org/abs/2503.22963
作者: Peiyu Chen,Fuling Lin,Weipeng Guan,Peng Lu
机构: Adaptive Robotic Controls Lab (ArcLab), Department of Mechanical Engineering, The University of Hong Kong (香港大学), Hong Kong SAR, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras asynchronously output low-latency event streams, promising for state estimation in high-speed motion and challenging lighting conditions. As opposed to frame-based cameras, the motion-dependent nature of event cameras presents persistent challenges in achieving robust event feature detection and matching. In recent years, learning-based approaches have demonstrated superior robustness over traditional handcrafted methods in feature detection and matching, particularly under aggressive motion and HDR scenarios. In this paper, we propose SuperEIO, a novel framework that leverages the learning-based event-only detection and IMU measurements to achieve event-inertial odometry. Our event-only feature detection employs a convolutional neural network under continuous event streams. Moreover, our system adopts the graph neural network to achieve event descriptor matching for loop closure. The proposed system utilizes TensorRT to accelerate the inference speed of deep networks, which ensures low-latency processing and robust real-time operation on resource-limited platforms. Besides, we evaluate our method extensively on multiple public datasets, demonstrating its superior accuracy and robustness compared to other state-of-the-art event-based methods. We have also open-sourced our pipeline to facilitate research in the field: this https URL.
zh

[CV-195] OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts CVPR2025

【速读】：该论文旨在解决在流媒体视频场景下评估Omni语言模型（OmniLLMs）真实世界交互能力的挑战。现有视频基准测试在流媒体视频理解与主动推理方面存在不足，尤其缺乏针对这些能力的全面评估。为应对这一问题，论文引入了OmniMMI，这是一个专为流媒体场景设计的综合多模态交互基准，包含超过1,121个视频和2,290个问题，并涵盖了六个特定子任务。此外，论文提出了一种名为多模态复用建模（Multi-modal Multiplexing Modeling, M4）的新框架，其关键在于通过高效的推理机制实现模型的同时感知、聆听与生成能力，从而支持在流媒体环境下的有效交互。

链接: https://arxiv.org/abs/2503.22952
作者: Yuxuan Wang,Yueqian Wang,Bo Chen,Tong Wu,Dongyan Zhao,Zilong Zheng
机构: Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); State Key Laboratory of General Artificial Intelligence (通用人工智能国家重点实验室); Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学); X-LANCE Lab, Shanghai Jiao Tong University (X-LANCE 实验室，上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at CVPR 2025

点击查看摘要

Abstract:The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
zh

[CV-196] owards Mobile Sensing with Event Cameras on High-mobility Resource-constrained Devices: A Survey

【速读】：该论文旨在解决移动设备在高机动性场景下，传统视觉传感难以同时实现高精度与低延迟的问题。论文关注事件相机（Event Camera）这一新兴范式，其具备高时间分辨率、低延迟及低功耗的特点，特别适合资源受限的移动设备上的高精度、低延迟感知任务。然而，事件数据处理面临噪声事件显著、缺乏语义信息以及数据量庞大等挑战。解决方案的关键在于从基础原理、事件抽象方法、算法优化、软硬件加速策略等方面构建完整的事件驱动视觉系统，并结合传感器融合技术实现实时部署。此外，论文强调通过引入先进光学设计提升事件相机硬件性能、利用类脑计算提高处理效率，以及结合生物启发算法增强感知能力，从而有效应对上述挑战并推动事件驱动视觉技术的广泛应用。

链接: https://arxiv.org/abs/2503.22943
作者: Haoyang Wang,Ruishan Guo,Pengtao Ma,Ciyu Ruan,Xinyu Luo,Wenhua Ding,Tianyang Zhong,Jingao Xu,Yunhao Liu,Xinlei Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 9 figures

点击查看摘要

Abstract:With the increasing complexity of mobile device applications, these devices are evolving toward high mobility. This shift imposes new demands on mobile sensing, particularly in terms of achieving high accuracy and low latency. Event-based vision has emerged as a disruptive paradigm, offering high temporal resolution, low latency, and energy efficiency, making it well-suited for high-accuracy and low-latency sensing tasks on high-mobility platforms. However, the presence of substantial noisy events, the lack of inherent semantic information, and the large data volume pose significant challenges for event-based data processing on resource-constrained mobile devices. This paper surveys the literature over the period 2014-2024, provides a comprehensive overview of event-based mobile sensing systems, covering fundamental principles, event abstraction methods, algorithmic advancements, hardware and software acceleration strategies. We also discuss key applications of event cameras in mobile sensing, including visual odometry, object tracking, optical flow estimation, and 3D reconstruction, while highlighting the challenges associated with event data processing, sensor fusion, and real-time deployment. Furthermore, we outline future research directions, such as improving event camera hardware with advanced optics, leveraging neuromorphic computing for efficient processing, and integrating bio-inspired algorithms to enhance perception. To support ongoing research, we provide an open-source \textitOnline Sheet with curated resources and recent developments. We hope this survey serves as a valuable reference, facilitating the adoption of event-based vision across diverse applications.
zh

[CV-197] Enhancing Learnable Descriptive Convolutional Vision Transformer for Face Anti-Spoofing

【速读】：该论文旨在提升人脸活体检测（Face Anti-Spoofing, FAS）模型在识别真实与伪造人脸特征方面的表现，特别是在应对人脸呈现攻击（face presentation attacks）时。论文的关键在于提出三种创新的训练策略来增强基于Learnable Descriptive Convolution (LDC) 的Vision Transformer (LDCformer) 的性能。这些策略包括：通过双注意力监督（dual-attention supervision）引导区域级别的活体与伪造注意力，以学习更精细的活体特征；利用自挑战监督（self-challenging supervision）生成具有挑战性的训练数据以提高特征的判别能力；以及通过过渡三元组挖掘策略（transitional triplet mining strategy）缩小跨域差距同时保持真实与伪造特征之间的过渡关系，从而增强LDCformer的领域泛化能力。这些方法共同显著提升了模型的特征表征能力。

链接: https://arxiv.org/abs/2503.22936
作者: Pei-Kai Huanga,Jun-Xiong Chong,Ming-Tsung Hsu,Fang-Yu Hsu,Chiou-Ting Hsu
机构: National Tsing Hua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face anti-spoofing (FAS) heavily relies on identifying live/spoof discriminative features to counter face presentation attacks. Recently, we proposed LDCformer to successfully incorporate the Learnable Descriptive Convolution (LDC) into ViT, to model long-range dependency of locally descriptive features for FAS. In this paper, we propose three novel training strategies to effectively enhance the training of LDCformer to largely boost its feature characterization capability. The first strategy, dual-attention supervision, is developed to learn fine-grained liveness features guided by regional live/spoof attentions. The second strategy, self-challenging supervision, is designed to enhance the discriminability of the features by generating challenging training data. In addition, we propose a third training strategy, transitional triplet mining strategy, through narrowing the cross-domain gap while maintaining the transitional relationship between live and spoof features, to enlarge the domain-generalization capability of LDCformer. Extensive experiments show that LDCformer under joint supervision of the three novel training strategies outperforms previous methods.
zh

[CV-198] Bi-Level Multi-View fuzzy Clustering with Exponential Distance

【速读】：本文研究旨在扩展模糊聚类方法在多视图环境中的应用。论文提出了一种指数多视图模糊聚类算法（Exponential Multi-View FCM, E-MVFCM），其关键在于引入热核系数（Heat-Kernel Coefficients, H-KC）和权重因子以实现集中式的多视图聚类。此外，论文进一步提出了双层指数多视图模糊聚类算法（Exponential Bi-Level Multi-View FCM, EB-MVFCM），其创新点在于同时自动计算特征与权重因子，并通过显式表达热核系数简化了聚类过程中基于适当时间 (t) 的热核矩阵 (\mathcal{K}(t)) 的构建。通过这些改进，论文解决了传统多视图聚类中需要手动调整参数以及复杂度较高的问题，从而提升了聚类效率与准确性。所有提出的工具与函数将在指定网址开放获取。

链接: https://arxiv.org/abs/2503.22932
作者: Kristina P. Sinaga
机构: Istituto di Scienza e Tecnologie dell’Informazione (意大利国家研究委员会)(Istituto di Scienza e Tecnologie dell’Informazione, Italian National Research Council); Italy (意大利)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Probability (math.PR)
备注:

点击查看摘要

Abstract:In this study, we propose extension of fuzzy c-means (FCM) clustering in multi-view environments. First, we introduce an exponential multi-view FCM (E-MVFCM). E-MVFCM is a centralized MVC with consideration to heat-kernel coefficients (H-KC) and weight factors. Secondly, we propose an exponential bi-level multi-view fuzzy c-means clustering (EB-MVFCM). Different to E-MVFCM, EB-MVFCM does automatic computation of feature and weight factors simultaneously. Like E-MVFCM, EB-MVFCM present explicit forms of the H-KC to simplify the generation of the heat-kernel \mathcalK(t) in powers of the proper time t during the clustering process. All the features used in this study, including tools and functions of proposed algorithms will be made available at this https URL.
zh

[CV-199] Unsupervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

【速读】：该论文旨在解决人脸活体检测（Face Anti-Spoofing, FAS）领域中两种主流方法的局限性：两分类方法（two-class FAS）易过拟合于训练数据中的攻击样本以提升性能，而单分类方法（one-class FAS）虽对未见攻击具有较好的泛化能力，但在处理嵌入于活体特征中的域信息时鲁棒性较差。为解决此问题，论文提出了一种无监督特征解耦与增强网络（Unsupervised Feature Disentanglement and Augmentation Network, \textbf{UFDANet}）。其关键是通过无监督特征解耦分离活体特征与域特征，并结合基于分布外活体特征增强方案合成未见欺骗类别的新特征，同时引入域特征增强程序生成未见域特征，从而显著提升模型的泛化能力和特征表达能力，最终在性能上超越现有单分类FAS方法，并达到与最先进的两分类方法相当的效果。

链接: https://arxiv.org/abs/2503.22929
作者: Pei-Kai Huang,Jun-Xiong Chong,Ming-Tsung Hsu,Fang-Yu Hsu,Yi-Ting Lin,Kai-Heng Chien,Hao-Chiang Shao,Chiou-Ting Hsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbfUFDANet), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbfUFDANet employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbfUFDANet incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbfUFDANet outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.
zh

[CV-200] DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID CVPR2025

【速读】：该论文旨在解决衣物更换场景下的人体再识别（Clothes-changing Person Re-identification, CC-ReID）问题。现有方法要么依赖额外模态（如轮廓、姿态和身体网格）建模身体形状，可能忽略性别、年龄和风格等重要生物特征；要么通过附加标签引入监督信号，但这些标注通常是离散的，无法全面描述个体。为克服这些问题，论文提出了一种新颖的对抗学习方法DIFFER（Disentangle Identity Features From Entangled Representations），其关键在于利用文本描述来解耦身份特征。具体而言，DIFFER引入NBDetach机制，通过文本描述的可分离特性作为监督信号，将特征空间划分为不同的子空间，并借助梯度反转层有效分离与身份相关特征和非生物特征。实验表明，DIFFER在四个基准数据集（LTCC、PRCC、CelebReID-Light和CCVID）上均取得了最先进的性能，相比基线方法在top-1准确率上有显著提升（例如，在LTCC上提升了3.6%）。

链接: https://arxiv.org/abs/2503.22912
作者: Xin Liang,Yogesh S Rawat
机构: Center for Research in Computer Vision, University of Central Florida (中央佛罗里达大学计算机视觉研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025

点击查看摘要

Abstract:Clothes-changing person re-identification (CC-ReID) aims to recognize individuals under different clothing scenarios. Current CC-ReID approaches either concentrate on modeling body shape using additional modalities including silhouette, pose, and body mesh, potentially causing the model to overlook other critical biometric traits such as gender, age, and style, or they incorporate supervision through additional labels that the model tries to disregard or emphasize, such as clothing or personal attributes. However, these annotations are discrete in nature and do not capture comprehensive descriptions. In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. Recognizing that image features inherently mix inseparable information, DIFFER introduces NBDetach, a mechanism designed for feature disentanglement by leveraging the separable nature of text descriptions as supervision. It partitions the feature space into distinct subspaces and, through gradient reversal layers, effectively separates identity-related features from non-biometric features. We evaluate DIFFER on 4 different benchmark datasets (LTCC, PRCC, CelebreID-Light, and CCVID) to demonstrate its effectiveness and provide state-of-the-art performance across all the benchmarks. DIFFER consistently outperforms the baseline method, with improvements in top-1 accuracy of 3.6% on LTCC, 3.4% on PRCC, 2.5% on CelebReID-Light, and 1% on CCVID. Our code can be found here. Comments: Accepted in CVPR 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.22912 [cs.CV] (or arXiv:2503.22912v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.22912 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-201] Enhancing DeepLabV3 to Fuse Aerial and Satellite Images for Semantic Segmentation

【速读】：该论文旨在解决多源遥感图像（航空影像与卫星影像）在土地覆盖分割任务中的融合问题，提升DeepLabV3+架构在多模态图像分割中的鲁棒性和性能。论文的关键创新在于通过引入一种新的转置卷积层块（transposed convolutional layers block），用于将低级特征从卫星图像上采样并与高级特征融合，从而增强模型对卫星图像信息的整合能力，并通过与航空影像的融合进一步丰富分割过程。实验结果显示，在未使用数据增强的情况下，该方法实现了84.91%的平均交并比（mIoU）。

链接: https://arxiv.org/abs/2503.22909
作者: Anas Berka,Mohamed El Hajji,Raphael Canals,Youssef Es-saady,Adel Hafiane
机构: IRF-SIC Laboratory, Ibnou Zohr University (IRF-SIC实验室, 艾布·祖赫尔大学); INSA CVL, University of Orleans, PRISME Laboratory, EA 4229 (奥尔良大学INSA CVL分校, PRISME实验室, EA 4229); University of Orleans, Polytech, PRISME Laboratory EA 4229 (奥尔良大学Polytech分校, PRISME实验室, EA 4229)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aerial and satellite imagery are inherently complementary remote sensing sources, offering high-resolution detail alongside expansive spatial coverage. However, the use of these sources for land cover segmentation introduces several challenges, prompting the development of a variety of segmentation methods. Among these approaches, the DeepLabV3+ architecture is considered as a promising approach in the field of single-source image segmentation. However, despite its reliable results for segmentation, there is still a need to increase its robustness and improve its performance. This is particularly crucial for multimodal image segmentation, where the fusion of diverse types of information is essential. An interesting approach involves enhancing this architectural framework through the integration of novel components and the modification of certain internal processes. In this paper, we enhance the DeepLabV3+ architecture by introducing a new transposed conventional layers block for upsampling a second entry to fuse it with high level features. This block is designed to amplify and integrate information from satellite images, thereby enriching the segmentation process through fusion with aerial images. For experiments, we used the this http URL (Land Cover from Aerial Imagery) dataset for aerial images, alongside the corresponding dataset sourced from Sentinel 2 data. Through the fusion of both sources, the mean Intersection over Union (mIoU) achieved a total mIoU of 84.91% without data augmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.22909 [cs.CV] (or arXiv:2503.22909v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.22909 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anas Berka [view email] [v1] Fri, 28 Mar 2025 23:07:39 UTC (2,639 KB) Full-text links: Access Paper: View a PDF of the paper titled Enhancing DeepLabV3+ to Fuse Aerial and Satellite Images for Semantic Segmentation, by Anas Berka and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-03 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-202] SocialGen: Modeling Multi-Human Social Interaction with Language Models

【速读】：该论文旨在解决多人群体交互行为建模这一关键而具有挑战性的问题。现有方法大多局限于两人交互场景，而SocialGen作为首个统一的运动-语言模型（Motion-Language Model），通过提出一种新颖的社会运动表示方法，实现了任意数量个体的运动分解与语言空间的对齐（alignment）。这种对齐能力使模型能够充分利用丰富的预训练语言知识，以更好地理解和推理人类社会行为。为应对数据稀缺的挑战，研究团队构建了一个包含文本注释的综合多人群体交互数据集SocialX，并基于此建立了首个多人群体交互任务的全面基准。论文的关键创新在于通过运动与语言的跨模态对齐，有效提升了多人群体交互行为的理解与生成能力。

链接: https://arxiv.org/abs/2503.22906
作者: Heng Yu,Juze Zhang,Changan Chen,Tiange Xiang,Yusu Fang,Juan Carlos Niebles,Ehsan Adeli
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human interactions in everyday life are inherently social, involving engagements with diverse individuals across various contexts. Modeling these social interactions is fundamental to a wide range of real-world applications. In this paper, we introduce SocialGen, the first unified motion-language model capable of modeling interaction behaviors among varying numbers of individuals, to address this crucial yet challenging problem. Unlike prior methods that are limited to two-person interactions, we propose a novel social motion representation that supports tokenizing the motions of an arbitrary number of individuals and aligning them with the language space. This alignment enables the model to leverage rich, pretrained linguistic knowledge to better understand and reason about human social behaviors. To tackle the challenges of data scarcity, we curate a comprehensive multi-human interaction dataset, SocialX, enriched with textual annotations. Leveraging this dataset, we establish the first comprehensive benchmark for multi-human interaction tasks. Our method achieves state-of-the-art performance across motion-language tasks, setting a new standard for multi-human interaction modeling.
zh

[CV-203] MedCL: Learning Consistent Anatomy Distribution for Scribble-supervised Medical Image Segmentation

【速读】：该论文旨在解决大规模完全标注医学图像数据集构建成本高、耗时费力的问题，尤其针对长尾分布中的疾病种类，常规弱监督方法难以有效利用稀疏标注（如scribbles）进行分割的局限性。论文提出了一种基于聚类的scribble弱监督框架MedCL，其关键是利用医学标签的解剖学先验分布，通过混合特征（包括图像内与图像间的混合操作）以及在局部和全局层次上对解剖学分布进行聚类和正则化，从而学习医学标签的固有解剖分布。这种方法结合少量弱监督信息，能够同时分割常规器官和具有挑战性的不规则病灶。

链接: https://arxiv.org/abs/2503.22890
作者: Ke Zhang,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Curating large-scale fully annotated datasets is expensive, laborious, and cumbersome, especially for medical images. Several methods have been proposed in the literature that make use of weak annotations in the form of scribbles. However, these approaches require large amounts of scribble annotations, and are only applied to the segmentation of regular organs, which are often unavailable for the disease species that fall in the long-tailed distribution. Motivated by the fact that the medical labels have anatomy distribution priors, we propose a scribble-supervised clustering-based framework, called MedCL, to learn the inherent anatomy distribution of medical labels. Our approach consists of two steps: i) Mix the features with intra- and inter-image mix operations, and ii) Perform feature clustering and regularize the anatomy distribution at both local and global levels. Combined with a small amount of weak supervision, the proposed MedCL is able to segment both regular organs and challenging irregular pathologies. We implement MedCL based on SAM and UNet backbones, and evaluate the performance on three open datasets of regular structure (MSCMRseg), multiple organs (BTCV) and irregular pathology (MyoPS). It is shown that even with less scribble supervision, MedCL substantially outperforms the conventional segmentation methods. Our code is available at this https URL.
zh

[CV-204] AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLM s

【速读】：该论文旨在解决人体姿势检索中姿势过渡标注稀缺且不一致的问题。传统方法依赖昂贵的人工标注或基于启发式的规则生成，限制了数据集的可扩展性和多样性。为应对这一挑战，论文提出AutoComPose框架，其关键创新在于利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）自动生成丰富且结构化的姿势过渡描述。通过将过渡细化为细微的身体部位运动，并引入镜像及交换变化，同时结合循环一致性约束确保前后过渡逻辑连贯，该方法显著提升了标注质量。此外，论文构建并发布了两个专门的基准数据集AIST-CPR和PoseFixCPR，进一步增强了现有数据集的属性。实验结果表明，使用AutoComPose训练的检索模型在性能上优于人工标注和启发式方法，大幅降低了标注成本并提高了检索质量，为未来复合姿势检索研究奠定了可扩展的基础。

链接: https://arxiv.org/abs/2503.22884
作者: Yi-Ting Shen,Sungmin Eum,Doheon Lee,Rohit Shete,Chiao-Yi Wang,Heesung Kwon,Shuvra S. Bhattacharyya
机构: University of Maryland, College Park (马里兰大学帕克分校); DEVCOM Army Research Laboratory ( DEVCOM陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
zh

[CV-205] Pairwise Matching of Intermediate Representations for Fine-grained Explainability

【速读】：该论文试图解决细粒度类别图像在现有深度学习模型可解释性技术中因解释过于分散而无法提供有用且可解释说明的问题。论文提出了一种新的可解释性方法（PAIR-X），通过结合模型中间激活和反向传播的相关性分数，生成细粒度且高度局部化的成对视觉解释。解决方案的关键在于创新性地融合了模型的不同特征表示方式，并针对性地提升了解释的聚焦性和准确性，从而显著改善了细粒度图像匹配任务中的可解释性效果。

链接: https://arxiv.org/abs/2503.22881
作者: Lauren Shrack,Timm Haucke,Antoine Salaün,Arjun Subramonian,Sara Beery
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The differences between images belonging to fine-grained categories are often subtle and highly localized, and existing explainability techniques for deep learning models are often too diffuse to provide useful and interpretable explanations. We propose a new explainability method (PAIR-X) that leverages both intermediate model activations and backpropagated relevance scores to generate fine-grained, highly-localized pairwise visual explanations. We use animal and building re-identification (re-ID) as a primary case study of our method, and we demonstrate qualitatively improved results over a diverse set of explainability baselines on 35 public re-ID datasets. In interviews, animal re-ID experts were in unanimous agreement that PAIR-X was an improvement over existing baselines for deep model explainability, and suggested that its visualizations would be directly applicable to their work. We also propose a novel quantitative evaluation metric for our method, and demonstrate that PAIR-X visualizations appear more plausible for correct image matches than incorrect ones even when the model similarity score for the pairs is the same. By improving interpretability, PAIR-X enables humans to better distinguish correct and incorrect matches. Our code is available at: this https URL
zh

[CV-206] he Marine Debris Forward-Looking Sonar Datasets

【速读】：该论文旨在解决水下机器人声呐感知受限于现有AI系统需要大规模训练数据集的问题，同时填补公共声呐模态数据集的不足。论文的关键在于提出Marine Debris Forward-Looking Sonar数据集，通过三种不同场景设置（水箱、转台、淹没采石场）提升数据集多样性，并支持多种计算机视觉任务，包括物体分类、物体检测、语义分割、补丁匹配及无监督学习。解决方案的关键在于构建这一包含丰富标注且多样化的声呐数据集，为相关研究提供基础资源。

链接: https://arxiv.org/abs/2503.22880
作者: Matias Valdenegro-Toro,Deepan Chakravarthi Padmanabhan,Deepak Singh,Bilal Wehbe,Yvan Petillot
机构: Department of Artificial Intelligence, University of Groningen (格罗宁根大学); Bonn-Rhein-Sieg University of Applied Sciences (波恩-莱茵-锡格应用技术大学); Worcester Polytechnic Institute (伍斯特理工学院); Ocean Systems Lab, Heriot-Watt University (赫瑞瓦特大学海洋系统实验室); German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 12 figures, Oceans Brest 2025 camera readyu

点击查看摘要

Abstract:Sonar sensing is fundamental for underwater robotics, but limited by capabilities of AI systems, which need large training datasets. Public data in sonar modalities is lacking. This paper presents the Marine Debris Forward-Looking Sonar datasets, with three different settings (watertank, turntable, flooded quarry) increasing dataset diversity and multiple computer vision tasks: object classification, object detection, semantic segmentation, patch matching, and unsupervised learning. We provide full dataset description, basic analysis and initial results for some tasks. We expect the research community will benefit from this dataset, which is publicly available at this https URL
zh

[CV-207] VizFlyt: Perception-centric Pedagogical Framework For Autonomous Aerial Robots ICRA2025

【速读】：本文旨在解决面向自主飞行机器人课程的可靠测试平台缺乏的问题。传统实验环境容易因机器人碰撞障碍物而损坏硬件，限制了教学与算法开发效率。为应对这一挑战，论文提出了一种名为\textit{VizFlyt}的开源、以感知为中心的Hardware-In-The-Loop (HITL) 高保真仿真框架。其关键创新在于利用外部定位系统提供的位姿信息，结合3D高斯点阵渲染技术（3D Gaussian Splatting），实时生成逼真的视觉传感器输入，从而在无实际物理风险的情况下测试飞行器的自主算法。通过实现超过100 Hz的系统更新率，该框架确保了高效且稳定的实验环境。此外，基于\textit{VizFlyt}，作者进一步设计了一套开放源码与开放硬件的课程体系，为未来相关教育提供参考。实验证明，该系统不仅能够有效支持多种课程项目，还展现了广泛的实际应用潜力。

链接: https://arxiv.org/abs/2503.22876
作者: Kushagra Srivastava,Rutwik Kulkarni,Manoj Velmurugan,Nitin J. Sanket
机构: Worcester Polytechnic Institute (伍斯特理工学院); RTX Technology Research Center (RTX 技术研究中心); National Science Foundation I/UCRC (国家科学基金会工业界大学合作研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2025. Projected Page: this https URL

点击查看摘要

Abstract:Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. In this paper, we present \textitVizFlyt, an open-source perception-centric Hardware-In-The-Loop (HITL) photorealistic testing framework for aerial robotics courses. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on \textitVizFlyt for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at this https URL
zh

[CV-208] SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction

【速读】：该论文试图解决在仅给定单个物体图像或手-物交互图像的情况下，生成逼真且多样化的三维手部轨迹的问题。此任务具有高度歧义性，需要正确识别感兴趣的物体甚至可能的正确交互方式。为了解决这一挑战，论文提出的关键方案是SIGHT-Fusion系统，它包含一个精心设计的管道，用于从涉及物体操作的自视角视频中提取手-物交互细节的视觉特征，并采用基于扩散的条件运动生成模型处理这些提取的特征。该方法通过带有相应手部轨迹标注的视频数据进行训练，无需动作标签形式的监督。

链接: https://arxiv.org/abs/2503.22869
作者: Alexey Gavryushin,Florian Redhardt,Gaia Di Lorenzo,Luc Van Gool,Marc Pollefeys,Kaichun Mo,Xi Wang
机构: ETH Zürich (瑞士苏黎世联邦理工学院); KU Leuven (比利时鲁汶大学); INSAIT, Sofia University (保加利亚索非亚大学 INSAIT 部门); Microsoft Research (微软研究院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object, which could be involved in a hand-object interaction scene or pictured by itself. When humans grasp an object, appropriate trajectories naturally form in our minds to use it for specific tasks. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields. However, synthesizing realistic and appropriate hand trajectories given a single object or hand-object interaction image is a highly ambiguous task, requiring to correctly identify the object of interest and possibly even the correct interaction among many possible alternatives. To tackle this challenging problem, we propose the SIGHT-Fusion system, consisting of a curated pipeline for extracting visual features of hand-object interaction details from egocentric videos involving object manipulation, and a diffusion-based conditional motion generation model processing the extracted features. We train our method given video data with corresponding hand trajectory annotations, without supervision in the form of action labels. For the evaluation, we establish benchmarks utilizing the first-person FPHAB and HOI4D datasets, testing our method against various baselines and using multiple metrics. We also introduce task simulators for executing the generated hand trajectories and reporting task success rates as an additional metric. Experiments show that our method generates more appropriate and realistic hand trajectories than baselines and presents promising generalization capability on unseen objects. The accuracy of the generated hand trajectories is confirmed in a physics simulation setting, showcasing the authenticity of the created sequences and their applicability in downstream uses.
zh

[CV-209] Zero-shot Domain Generalization of Foundational Models for 3D Medical Image Segmentation: An Experimental Study

【速读】：该论文旨在解决因成像模态和采集协议变化导致的领域偏移（domain shift）问题，这一问题限制了医学图像分割模型在跨域场景下的泛化能力。论文的关键在于探索提示可调基础模型（promptable Foundation Models, FMs）通过智能提示技术（smart prompting techniques）实现领域泛化（domain generalization, DG）的能力。研究通过全面的实验评估了6种医学分割基础模型在涵盖多种模态和解剖结构的12个公开数据集上的表现，揭示了提示可调基础模型在桥接领域差距方面的潜力，并深入分析了零样本领域泛化的多个方面，为未来的研究提供了有价值的洞见与方向。

链接: https://arxiv.org/abs/2503.22862
作者: Soumitri Chattopadhyay,Basar Demir,Marc Niethammer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain shift, caused by variations in imaging modalities and acquisition protocols, limits model generalization in medical image segmentation. While foundation models (FMs) trained on diverse large-scale data hold promise for zero-shot generalization, their application to volumetric medical data remains underexplored. In this study, we examine their ability towards domain generalization (DG), by conducting a comprehensive experimental study encompassing 6 medical segmentation FMs and 12 public datasets spanning multiple modalities and anatomies. Our findings reveal the potential of promptable FMs in bridging the domain gap via smart prompting techniques. Additionally, by probing into multiple facets of zero-shot DG, we offer valuable insights into the viability of FMs for DG and identify promising avenues for future research.
zh

[CV-210] GmNet: Revisiting Gating Mechanisms From A Frequency View

【速读】：该论文试图解决神经网络中门控机制（Gating Mechanisms）缺乏理论分析的问题，特别是其在处理长距离依赖（long-range dependency）和信息流自适应控制中的具体作用机制。论文从频域视角系统性地探讨了门控机制对神经网络训练动态的影响，并揭示了逐元素乘法与激活函数在管理不同频率成分响应中的交互作用。解决方案的关键在于提出了一种轻量级的门控机制网络（GmNet），通过高效利用各类频率成分的信息，有效减少了现有轻量级模型中存在的低频偏差（low-frequency bias），从而在图像分类任务中实现了卓越的效果与效率平衡。

链接: https://arxiv.org/abs/2503.22841
作者: Yifan Wang,Xu Ma,Yitian Zhang,Zhongruo Wang,Sung-Cheol Kim,Vahid Mirjalili,Vidya Renganathan,Yun Fu
机构: Northeastern University (东北大学); UC Davis (加州大学戴维斯分校); FM-Global (FM全球公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gating mechanisms have emerged as an effective strategy integrated into model designs beyond recurrent neural networks for addressing long-range dependency problems. In a broad understanding, it provides adaptive control over the information flow while maintaining computational efficiency. However, there is a lack of theoretical analysis on how the gating mechanism works in neural networks. In this paper, inspired by the convolution theorem, we systematically explore the effect of gating mechanisms on the training dynamics of neural networks from a frequency perspective. We investigate the interact between the element-wise product and activation functions in managing the responses to different frequency components. Leveraging these insights, we propose a Gating Mechanism Network (GmNet), a lightweight model designed to efficiently utilize the information of various frequency components. It minimizes the low-frequency bias present in existing lightweight models. GmNet achieves impressive performance in terms of both effectiveness and efficiency in the image classification task.
zh

[CV-211] DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

【速读】：本文旨在解决多模态扩散Transformer（MMDiT）在文本到图像生成任务中，由于注意力机制导致的显著计算瓶颈问题，这限制了模型的可扩展性和效率。为了解决这一挑战，论文提出了DiTFastAttnV2，这是一种针对MMDiT后训练压缩的方法，旨在加速注意力机制。其关键创新在于通过深入分析MMDiT的注意力模式，引入了基于头级别的箭头注意力和缓存机制，以动态调整注意力头，同时设计了一个高效的融合内核以进一步加速。此外，利用局部度量方法和优化技术，大幅减少了寻找最优压缩方案的时间至仅几分钟，同时保持生成质量。最终，DiTFastAttnV2在2K图像生成任务中实现了68%的注意力浮点运算（FLOPs）减少以及1.5倍的端到端速度提升，而未牺牲视觉保真度。

链接: https://arxiv.org/abs/2503.22796
作者: Hanling Zhang,Rundong Su,Zhihang Yuan,Pengtao Chen,Mingzhu Shen Yibo Fan,Shengen Yan,Guohao Dai,Yu Wang
机构: Tsinghua University (清华大学); Infinigence AI; Fudan University (复旦大学); Imperial College London (帝国理工学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT’s attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.
zh

[CV-212] Patronus: Bringing Transparency to Diffusion Models with Prototypes

【速读】：该论文旨在解决扩散基生成模型（如Denoising Diffusion Probabilistic Models, DDPMs）在图像生成中的生成机制不透明问题，即其逐步去噪过程缺乏可解释性，导致关键生成机制未被充分理解。为了解决这一问题，论文提出了一种名为\emphPatronus的新方法，这是一种受ProtoPNet启发的可解释扩散模型。关键在于将原型网络（prototypical network）集成到DDPMs中，通过提取原型并将生成过程条件化于这些原型的激活向量，从而增强模型的可解释性。此设计不仅展示了所学习的原型及其对生成过程的影响，还支持下游任务如图像操作，实现了更透明且可控的修改。此外，Patronus能够通过检测学习原型之间的不必要相关性来揭示生成过程中的捷径学习（shortcut learning）。值得注意的是，该方法完全无需标注或文本提示。这一工作为通过基于原型的可解释性理解和控制扩散模型开辟了新途径。代码可在提供的链接获取。

链接: https://arxiv.org/abs/2503.22782
作者: Nina Weng,Aasa Feragen,Siavash Bigdeli
机构: Technical University of Denmark (丹麦技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion-based generative models, such as Denoising Diffusion Probabilistic Models (DDPMs), have achieved remarkable success in image generation, but their step-by-step denoising process remains opaque, leaving critical aspects of the generation mechanism unexplained. To address this, we introduce \emphPatronus, an interpretable diffusion model inspired by ProtoPNet. Patronus integrates a prototypical network into DDPMs, enabling the extraction of prototypes and conditioning of the generation process on their prototype activation vector. This design enhances interpretability by showing the learned prototypes and how they influence the generation process. Additionally, the model supports downstream tasks like image manipulation, enabling more transparent and controlled modifications. Moreover, Patronus could reveal shortcut learning in the generation process by detecting unwanted correlations between learned prototypes. Notably, Patronus operates entirely without any annotations or text prompts. This work opens new avenues for understanding and controlling diffusion models through prototype-based interpretability. Our code is available at \hrefthis https URLthis https URL.
zh

[CV-213] Ancestral Mamba: Enhancing Selective Discriminant Space Model with Online Visual Prototype Learning for Efficient and Robust Discriminant Approach

【速读】：该论文旨在解决在计算机图形领域中，从非平稳数据流中持续学习的同时，适应新的视觉模式并缓解灾难性遗忘的问题。现有方法往往难以捕捉和表征演化的视觉概念的核心特性，从而限制了其在动态图形任务中的适用性。论文提出的解决方案Ancestral Mamba的关键在于将在线原型学习集成到选择性判别空间模型中，以实现高效且稳健的在线连续学习。其核心组件包括Ancestral Prototype Adaptation (APA)，通过不断精炼和扩展已学习的视觉原型来增强模型的适应能力；以及Mamba Feedback (MF)，提供针对性反馈以应对具有挑战性的视觉模式。APA使模型能够基于祖先知识持续调整原型以应对新挑战，而MF则作为目标反馈机制，专注于困难类别并优化其表示。实验结果表明，Ancestral Mamba在CIFAR-10和CIFAR-100等图形导向数据集上的表现显著优于最先进的基线方法，在准确率和遗忘缓解方面取得了显著改进。

链接: https://arxiv.org/abs/2503.22729
作者: Jiahao Qin,Feng Liu,Lu Zong
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:In the realm of computer graphics, the ability to learn continuously from non-stationary data streams while adapting to new visual patterns and mitigating catastrophic forgetting is of paramount importance. Existing approaches often struggle to capture and represent the essential characteristics of evolving visual concepts, hindering their applicability to dynamic graphics tasks. In this paper, we propose Ancestral Mamba, a novel approach that integrates online prototype learning into a selective discriminant space model for efficient and robust online continual learning. The key components of our approach include Ancestral Prototype Adaptation (APA), which continuously refines and builds upon learned visual prototypes, and Mamba Feedback (MF), which provides targeted feedback to adapt to challenging visual patterns. APA enables the model to continuously adapt its prototypes, building upon ancestral knowledge to tackle new challenges, while MF acts as a targeted feedback mechanism, focusing on challenging classes and refining their representations. Extensive experiments on graphics-oriented datasets, such as CIFAR-10 and CIFAR-100, demonstrate the superior performance of Ancestral Mamba compared to state-of-the-art baselines, achieving significant improvements in accuracy and forgetting mitigation.
zh

[CV-214] Dual Audio-Centric Modality Coupling for Talking Head Generation

【速读】：该论文旨在解决音频驱动的Talking Head视频生成中的唇同步（lip synchronization）和视觉质量（visual quality）问题，传统方法难以有效捕捉音频与面部动态之间的复杂交互。论文提出了一种基于NeRF的新框架——Dual Audio-Centric Modality Coupling (DAMC)，其关键是通过双编码器结构分别提取语义内容（Content-Aware Encoder）和动态同步特征（Dynamic-Sync Encoder），并利用Cross-Synchronized Fusion Module (CSFM) 融合这些特征，从而提升内容表示能力和唇同步精度，同时保证生成图像的质量。实验结果表明，该方法在多种音频输入（包括文本转语音系统生成的合成语音）上的表现优于现有最先进方法，展示了良好的泛化能力。

链接: https://arxiv.org/abs/2503.22728
作者: Ao Fu,Ziqi Ni,Yi Zhou
机构: School of Computer Science and Engineering, Southeast University (东南大学), China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education (教育部), China
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:The generation of audio-driven talking head videos is a key challenge in computer vision and graphics, with applications in virtual avatars and digital media. Traditional approaches often struggle with capturing the complex interaction between audio and facial dynamics, leading to lip synchronization and visual quality issues. In this paper, we propose a novel NeRF-based framework, Dual Audio-Centric Modality Coupling (DAMC), which effectively integrates content and dynamic features from audio inputs. By leveraging a dual encoder structure, DAMC captures semantic content through the Content-Aware Encoder and ensures precise visual synchronization through the Dynamic-Sync Encoder. These features are fused using a Cross-Synchronized Fusion Module (CSFM), enhancing content representation and lip synchronization. Extensive experiments show that our method outperforms existing state-of-the-art approaches in key metrics such as lip synchronization accuracy and image quality, demonstrating robust generalization across various audio inputs, including synthetic speech from text-to-speech (TTS) systems. Our results provide a promising solution for high-quality, audio-driven talking head generation and present a scalable approach for creating realistic talking heads.
zh

[CV-215] Hierarchical Adaptive Expert for Multimodal Sentiment Analysis

【速读】：该论文旨在解决多模态情感分析中难以有效区分和整合模态共享信息与模态特定信息的问题，这限制了多模态学习的性能。为应对这一挑战，论文提出了一种名为“分层自适应专家用于多模态情感分析”（Hierarchical Adaptive Expert for Multimodal Sentiment Analysis, HAEMSA）的新框架。HAEMSA 的关键在于通过协同进化优化、跨模态知识迁移和多任务学习来实现模态表示的全局与局部捕捉，并利用进化算法动态优化网络架构和模态组合，以适应部分模态和完整模态场景。实验结果表明，HAEMSA 在多个基准数据集上实现了显著性能提升。

链接: https://arxiv.org/abs/2503.22715
作者: Jiahao Qin,Feng Liu,Lu Zong
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Shanghai Jiao Tong University (上海交通大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Multimodal sentiment analysis has emerged as a critical tool for understanding human emotions across diverse communication channels. While existing methods have made significant strides, they often struggle to effectively differentiate and integrate modality-shared and modality-specific information, limiting the performance of multimodal learning. To address this challenge, we propose the Hierarchical Adaptive Expert for Multimodal Sentiment Analysis (HAEMSA), a novel framework that synergistically combines evolutionary optimization, cross-modal knowledge transfer, and multi-task learning. HAEMSA employs a hierarchical structure of adaptive experts to capture both global and local modality representations, enabling more nuanced sentiment analysis. Our approach leverages evolutionary algorithms to dynamically optimize network architectures and modality combinations, adapting to both partial and full modality scenarios. Extensive experiments demonstrate HAEMSA’s superior performance across multiple benchmark datasets. On CMU-MOSEI, HAEMSA achieves a 2.6% increase in 7-class accuracy and a 0.059 decrease in MAE compared to the previous best method. For CMU-MOSI, we observe a 6.3% improvement in 7-class accuracy and a 0.058 reduction in MAE. On IEMOCAP, HAEMSA outperforms the state-of-the-art by 2.84% in weighted-F1 score for emotion recognition. These results underscore HAEMSA’s effectiveness in capturing complex multimodal interactions and generalizing across different emotional contexts.
zh

[CV-216] AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models

【速读】：该论文旨在解决结肠镜检查中遗漏息肉检测的问题，通过评估基础模型（Foundation Models）在息肉分割任务中的表现，探索其在医学成像领域中的应用潜力。论文的关键在于研究基础模型的零样本或少样本学习能力是否能够促进其在未见过的数据或任务上的泛化能力，尤其是在缺乏大规模标注数据的医学影像场景中。结果显示，基础模型在息肉表征方面的成功高度依赖于领域专业化。对于医学应用的最佳性能，需要专门化的模型；而通用模型则需要经过微调才能获得有效结果。通过领域特定的调整，这些基础模型在检测和分割任务中表现出优于现有最先进的模型的性能，并且某些模型甚至在零样本评估中超越了针对未见数据进行微调的模型。

链接: https://arxiv.org/abs/2503.24138
作者: Uxue Delaquintana-Aramendi,Leire Benito-del-Valle,Aitor Alvarez-Gila,Javier Pascau,Luisa F Sánchez-Peralta,Artzai Picón,J Blas Pagador,Cristina L Saratxaga
机构: TECNALIA, Basque Research and Technology Alliance (BRTA), Derio, Bizkaia, 48160 Spain; Universidad Carlos III de Madrid, Madrid, 28911 Spain; Instituto de Investigación Sanitaria Gregorio Marañón, Madrid, 28009 Spain; Jesús Usón Minimally Invasive Surgery Centre, Cáceres, 10071 Spain; AI4polypNET Thematic Network, Barcelona, 08193 Spain; University of the Basque Country, Plaza Torres Quevedo, 48013 Bilbao, Spain; Artificial Intelligence in Science Institute (AISI) at University of California Irvine, CA 92617 USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE TMI for possible publication

点击查看摘要

Abstract:In colonoscopy, 80% of the missed polyps could be detected with the help of Deep Learning models. In the search for algorithms capable of addressing this challenge, foundation models emerge as promising candidates. Their zero-shot or few-shot learning capabilities, facilitate generalization to new data or tasks without extensive fine-tuning. A concept that is particularly advantageous in the medical imaging domain, where large annotated datasets for traditional training are scarce. In this context, a comprehensive evaluation of foundation models for polyp segmentation was conducted, assessing both detection and delimitation. For the study, three different colonoscopy datasets have been employed to compare the performance of five different foundation models, DINOv2, YOLO-World, GroundingDINO, SAM and MedSAM, against two benchmark networks, YOLOv8 and Mask R-CNN. Results show that the success of foundation models in polyp characterization is highly dependent on domain specialization. For optimal performance in medical applications, domain-specific models are essential, and generic models require fine-tuning to achieve effective results. Through this specialization, foundation models demonstrated superior performance compared to state-of-the-art detection and segmentation models, with some models even excelling in zero-shot evaluation; outperforming fine-tuned models on unseen data.
zh

[CV-217] An Explainable Neural Radiomic Sequence Model with Spatiotemporal Continuity for Quantifying 4DCT-based Pulmonary Ventilation

【速读】：本文旨在解决基于四维计算机断层扫描（4DCT）准确识别肺部通气受损区域的问题。目前临床常用的核医学通气闪烁显像技术虽广泛应用，但存在耗时长、成本高及额外辐射暴露等局限性。为此，研究提出了一种可解释的神经放射组学序列模型，通过提取4DCT数据中的体素级放射组学特征（56维），捕捉呼吸周期内的局部强度与纹理动态变化，形成时间序列放射组学特征。关键解决方案在于开发了一种时间显著性增强的可解释长短时记忆网络（LSTM），利用这些放射组学序列训练模型以识别受损通气区域，并通过生成的时间显著性图解释模型预测的关键特征，揭示了在呼气阶段，受损肺功能区域通常表现出强度增加而均匀性降低的现象。

链接: https://arxiv.org/abs/2503.23898
作者: Rihui Zhang,Haiming Zhu,Jingtong Zhao,Lei Zhang,Fang-Fang Yin,Chunhao Wang,Zhenyu Yang
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 43 pages, 13 figures

点击查看摘要

Abstract:Accurate evaluation of regional lung ventilation is essential for the management and treatment of lung cancer patients, supporting assessments of pulmonary function, optimization of therapeutic strategies, and monitoring of treatment response. Currently, ventilation scintigraphy using nuclear medicine techniques is widely employed in clinical practice; however, it is often time-consuming, costly, and entails additional radiation exposure. In this study, we propose an explainable neural radiomic sequence model to identify regions of compromised pulmonary ventilation based on four-dimensional computed tomography (4DCT). A cohort of 45 lung cancer patients from the VAMPIRE dataset was analyzed. For each patient, lung volumes were segmented from 4DCT, and voxel-wise radiomic features (56-dimensional) were extracted across the respiratory cycle to capture local intensity and texture dynamics, forming temporal radiomic sequences. Ground truth ventilation defects were delineated voxel-wise using Galligas-PET and DTPA-SPECT. To identify compromised regions, we developed a temporal saliency-enhanced explainable long short-term memory (LSTM) network trained on the radiomic sequences. Temporal saliency maps were generated to highlight key features contributing to the model’s predictions. The proposed model demonstrated robust performance, achieving average (range) Dice similarity coefficients of 0.78 (0.74-0.79) for 25 PET cases and 0.78 (0.74-0.82) for 20 SPECT cases. The temporal saliency map explained three key radiomic sequences in ventilation quantification: during lung exhalation, compromised pulmonary function region typically exhibits (1) an increasing trend of intensity and (2) a decreasing trend of homogeneity, in contrast to healthy lung tissue.
zh

[CV-218] Optimal Invariant Bases for Atomistic Machine Learning

【速读】：该论文旨在解决现有原子环境描述符（Atomic Environment Descriptors）中不完整或功能依赖性过高的问题。许多现有的描述符无法充分表征所有有意义的原子环境变化，而完整的描述符集则因冗余导致计算负担增加且区分能力未提升。为解决此问题，论文的关键在于利用模式识别技术，去除那些可以由其他描述符函数表示的冗余描述符，从而获得最小化的完备描述符集合。通过这一方法，论文一方面优化了现有的原子团簇展开（Atomistic Cluster Expansion）描述符，得到更高效的子集；另一方面，结合神经网络设计了一种新的消息传递网络架构，利用最优的笛卡尔张量不变量实现了高达5体模式的识别，同时保持较低的计算成本。最终，该方案不仅提升了模型性能，还为多种应用提供了兼具低成本和高表达性的不变基类设计思路。

链接: https://arxiv.org/abs/2503.23515
作者: Alice E. A. Allen,Emily Shinkle,Roxana Bujack,Nicholas Lubbers
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); Max Planck Institute for Polymer Research (马克斯普朗克聚合物研究所)
类目: Chemical Physics (physics.chem-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The representation of atomic configurations for machine learning models has led to the development of numerous descriptors, often to describe the local environment of atoms. However, many of these representations are incomplete and/or functionally dependent. Incomplete descriptor sets are unable to represent all meaningful changes in the atomic environment. Complete constructions of atomic environment descriptors, on the other hand, often suffer from a high degree of functional dependence, where some descriptors can be written as functions of the others. These redundant descriptors do not provide additional power to discriminate between different atomic environments and increase the computational burden. By employing techniques from the pattern recognition literature to existing atomistic representations, we remove descriptors that are functions of other descriptors to produce the smallest possible set that satisfies completeness. We apply this in two ways: first we refine an existing description, the Atomistic Cluster Expansion. We show that this yields a more efficient subset of descriptors. Second, we augment an incomplete construction based on a scalar neural network, yielding a new message-passing network architecture that can recognize up to 5-body patterns in each neuron by taking advantage of an optimal set of Cartesian tensor invariants. This architecture shows strong accuracy on state-of-the-art benchmarks while retaining low computational cost. Our results not only yield improved models, but point the way to classes of invariant bases that minimize cost while maximizing expressivity for a host of applications.
zh

[CV-219] A Lightweight Image Super-Resolution Transformer Trained on Low-Resolution Images Only

【速读】：该论文旨在解决单图像超分辨率（Single-Image Super-Resolution, SISR）任务中，当仅使用低分辨率（Low-Resolution, LR）图像进行训练时，数据匮乏对基于Transformer架构模型性能的影响问题。传统Transformer模型在SISR任务中表现优异，但其强大的表征能力通常需要大量的高分辨率（High-Resolution, HR）训练数据支持，而许多实际应用场景中高质量HR图像难以获取。为应对这一挑战，该研究提出了一种轻量级视觉Transformer模型，并结合一种从显微镜图像超分辨率任务中迁移来的低分辨率（LR-only）训练方法，开发出多尺度双三次退化（Multi-Scale Training for Bicubic Degradation, MSTbic）训练策略。该方案的关键在于通过设计适合LR-only条件下的高效训练机制，使轻量级Transformer能够在仅使用LR图像的情况下实现优于现有最先进的基于CNN的LR-only SISR方法的性能。实验结果表明，所提方法在经典SR基准数据集上表现出色。

链接: https://arxiv.org/abs/2503.23265
作者: Björn Möller,Lucas Görnhardt,Tim Fingscheidt
机构: Technische Universität Braunschweig ( Braunschweig理工大学 )
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer architectures prominently lead single-image super-resolution (SISR) benchmarks, reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts. Their strong representative power, however, comes with a higher demand for training data compared to convolutional neural networks (CNNs). For many real-world SR applications, the availability of high-quality HR training images is not given, sparking interest in LR-only training methods. The LR-only SISR benchmark mimics this condition by allowing only low-resolution (LR) images for model training. For a 4x super-resolution, this effectively reduces the amount of available training data to 6.25% of the HR image pixels, which puts the employment of a data-hungry transformer model into question. In this work, we are the first to utilize a lightweight vision transformer model with LR-only training methods addressing the unsupervised SISR LR-only benchmark. We adopt and configure a recent LR-only training method from microscopy image super-resolution to macroscopic real-world data, resulting in our multi-scale training method for bicubic degradation (MSTbic). Furthermore, we compare it with reference methods and prove its effectiveness both for a transformer and a CNN model. We evaluate on the classic SR benchmark datasets Set5, Set14, BSD100, Urban100, and Manga109, and show superior performance over state-of-the-art (so far: CNN-based) LR-only SISR methods. The code is available on GitHub: this https URL.
zh

[CV-220] Aurelia: Test-time Reasoning Distillation in Audio-Visual LLM s

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在处理音频-视觉（Audio-Visual, AV）场景中的复杂推理任务时存在的局限性。尽管近期在推理优化方面取得了进展，但这些方法未能充分应对多模态输入的复杂性。为此，论文提出了一种名为AURELIA的新框架，其关键创新在于通过基于演员-评论家（actor-critic）的方法，在测试阶段将结构化的逐步推理能力蒸馏到音频-视觉大语言模型（AVLLMs）中，从而提升其处理复杂多模态输入的能力，而无需额外的训练或微调过程。此外，为了进一步推动AVLLMs的推理能力发展，论文还构建了一个包含4500个音频-视觉问题的数据集AVReasonBench，并设计了涵盖六个不同任务的基准测试，以评估模型在多模态推理方面的表现。实验结果表明，利用AURELIA可使AVLLMs的性能相对提升高达100%，验证了该方法的有效性。

链接: https://arxiv.org/abs/2503.23219
作者: Sanjoy Chowdhury,Hanan Gani,Nishit Anand,Sayan Nag,Ruohan Gao,Mohamed Elhoseiny,Salman Khan,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学帕克分校); MBZUAI (穆罕默德·本·扎耶德人工智能大学); University of Toronto (多伦多大学); KAUST (沙特国王科技大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.
zh

[CV-221] OncoReg: Medical Image Registration for Oncological Challenges

【速读】：该论文试图解决现代癌症研究中因患者隐私相关挑战而导致大量医疗数据未被充分利用的问题。解决方案的关键在于通过OncoReg Challenge提出的两阶段框架，在保护患者隐私的同时促进更通用化AI模型的发展。第一阶段利用公开可用的数据集，第二阶段则在医院内部安全网络中训练私有数据模型。此外，论文还引入了基于介入锥形束CT（CBCT）与标准计划扇形束CT（FBCT）图像配准的新方法，进一步推动了图像引导放射治疗中的动态治疗调整精度。研究发现特征提取在此任务中起着至关重要的作用，且深度学习与经典方法相结合特别是在特征提取方面的组合策略证明最为有效。

链接: https://arxiv.org/abs/2503.23179
作者: Wiebke Heyer,Yannic Elser,Lennart Berkel,Xinrui Song,Xuanang Xu,Pingkun Yan,Xi Jia,Zi Li,Tony C. W. Mok,BoWen LI,Christian Staackmann,Christoph Großbröhmer,Alessa Hering,Malte M. Sieren,Mattias P. Heinrich
机构: Institute of Medical Informatics, University of Lübeck (医学信息学研究所，吕贝克大学); Institute of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein (放射科与核医学研究所，Schleswig-Holstein大学医院), Lübeck, Germany; Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute (生物医学工程系与生物技术与跨学科研究中心，伦斯勒理工学院), Troy, NY, USA; School of Computer Science, University of Birmingham (计算机科学学院，伯明翰大学), Birmingham, UK; DAMO Academy, Alibaba Group (达摩院，阿里巴巴集团), Hangzhou, China; Hangzhou Shengshi Technology Co., Ltd (杭州圣石科技有限公司), Hangzhou, China; Radboud University Medical Center (拉德堡德大学医学中心), Nijmegen, Netherlands; Institute of Radiology and Nuclear Medicine & Institute of Interventional Radiology, University Hospital Schleswig-Holstein (放射科与核医学研究所 & 干预放射学研究所，Schleswig-Holstein大学医院), Lübeck, Germany; Institute of Medical Informatics, University of Lübeck (医学信息学研究所，吕贝克大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 6 figures

点击查看摘要

Abstract:In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography (CBCT) with standard planning fan-beam CT (FBCT) images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods - particularly in feature extraction - proving most effective.
zh

[CV-222] MIL vs. Aggregation: Evaluating Patient-Level Survival Prediction Strategies Using Graph-Based Learning

【速读】：该论文旨在解决如何有效利用全片扫描图像（Whole-Slide Images, WSIs）进行癌症患者生存预测的问题。由于肿瘤异质性、患者内部变异以及WSIs数据规模庞大且包含复杂信息，直接处理这些图像计算成本高昂且难以提取有效特征。此外，来自同一患者的多张WSI可能捕捉到不同且部分冗余的肿瘤区域，这提出了一个核心问题：是否应使用所有WSIs来表征患者，还是仅选择最具代表性的WSI用于预后分析？论文的关键在于通过比较多种策略（包括将每张WSI视为独立样本的传统方法与采用多实例学习[Multiple-Instance Learning, MIL]自动识别最相关WSI的方法）来评估其对生存预测的影响，并进一步结合图神经网络(Graph Neural Networks, GNNs)架构进行实验验证。研究基于MMIST-ccRCC数据集展开，结果表明基于MIL的选择策略能够显著提高预测准确性，从而证明选取最具代表性的WSI对于生存预测具有重要价值。

链接: https://arxiv.org/abs/2503.23042
作者: M Rita Verdelho,Alexandre Bernardino,Catarina Barata
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Oncologists often rely on a multitude of data, including whole-slide images (WSIs), to guide therapeutic decisions, aiming for the best patient outcome. However, predicting the prognosis of cancer patients can be a challenging task due to tumor heterogeneity and intra-patient variability, and the complexity of analyzing WSIs. These images are extremely large, containing billions of pixels, making direct processing computationally expensive and requiring specialized methods to extract relevant information. Additionally, multiple WSIs from the same patient may capture different tumor regions, some being more informative than others. This raises a fundamental question: Should we use all WSIs to characterize the patient, or should we identify the most representative slide for prognosis? Our work seeks to answer this question by performing a comparison of various strategies for predicting survival at the WSI and patient level. The former treats each WSI as an independent sample, mimicking the strategy adopted in other works, while the latter comprises methods to either aggregate the predictions of the several WSIs or automatically identify the most relevant slide using multiple-instance learning (MIL). Additionally, we evaluate different Graph Neural Networks architectures under these strategies. We conduct our experiments using the MMIST-ccRCC dataset, which comprises patients with clear cell renal cell carcinoma (ccRCC). Our results show that MIL-based selection improves accuracy, suggesting that choosing the most representative slide benefits survival prediction.
zh

[CV-223] Nonhuman Primate Brain Tissue Segmentation Using a Transfer Learning Approach

【速读】：该论文旨在解决非人灵长类（NHPs）脑组织分割难题，由于进化上与人类的密切关系，NHPs是研究人脑功能及神经疾病的重要模型。然而，这一任务面临多重挑战，包括标注数据稀缺、NHP脑部尺寸较小、现有成像数据分辨率有限以及人脑与NHP脑解剖结构差异显著等。为应对这些挑战，论文提出了一种基于STU-Net结合迁移学习的新方法，利用从人脑MRI数据中迁移的知识提升NHP脑MRI分割精度，尤其是在训练样本有限的情况下。该方案的关键在于通过STU-Net与迁移学习的结合，有效描绘复杂的组织边界并捕捉特定于NHP脑的精细解剖细节，尤其在分割纹状体和丘脑等小而复杂的下皮层结构方面表现出色，实现了Dice相似系数（DSC）超过0.88、交并比（IoU）超过0.8以及95%豪斯多夫距离（HD95）低于7的性能。此研究为多类别NHP脑组织分割提供了稳健方法，有望加速进化神经科学及与人类健康相关的神经疾病临床前研究的进展。

链接: https://arxiv.org/abs/2503.22829
作者: Zhen Lin,Hongyu Yuan,Richard Barcus,Qing Lyu,Sucheta Chakravarty,Megan E. Lipford,Carol A. Shively,Suzanne Craft,Mohammad Kawas,Jeongchul Kim,Christopher T. Whitlow
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Non-human primates (NHPs) serve as critical models for understanding human brain function and neurological disorders due to their close evolutionary relationship with humans. Accurate brain tissue segmentation in NHPs is critical for understanding neurological disorders, but challenging due to the scarcity of annotated NHP brain MRI datasets, the small size of the NHP brain, the limited resolution of available imaging data and the anatomical differences between human and NHP brains. To address these challenges, we propose a novel approach utilizing STU-Net with transfer learning to leverage knowledge transferred from human brain MRI data to enhance segmen-tation accuracy in the NHP brain MRI, particularly when training data is this http URL combination of STU-Net and transfer learning effectively delineates complex tissue boundaries and captures fine anatomical details specific to NHP brains. Notably, our method demonstrated improvement in segmenting small subcortical structures such as putamen and thalamus that are challenging to resolve with limited spatial resolution and tissue contrast, and achieved DSC of over 0.88, IoU over 0.8 and HD95 under 7. This study introduces a robust method for multi-class brain tissue segmentation in NHPs, potentially accelerating research in evolutionary neuroscience and preclinical studies of neurological disorders relevant to human health.
zh

[CV-224] Chirp Localization via Fine-Tuned Transformer Model: A Proof-of-Concept Study

【速读】：该论文旨在解决电生理信号（如脑电图 EEG）光谱图中啁啾（chirp-like）模式的自动化检测、定位及特征提取工具缺乏的问题。解决方案的关键在于通过微调视觉Transformer（Vision Transformer, ViT）模型，并结合低秩适应（Low-Rank Adaptation, LoRA）技术来增强模型的适应性。具体而言，研究者生成了包含线性和指数频率扫描的 100,000 个合成光谱图，构建了首个大规模的啁啾定位基准数据集。ViT 模型被调整为回归任务以预测啁啾参数（起始时间、起始频率和结束频率），而 LoRA 技术用于高效更新预训练模型的注意力层。实验采用均方误差损失函数（MSE Loss）、AdamW 优化器，并结合学习率调度与早停策略避免过拟合。结果表明，该方法在预测值与实际标签之间表现出高度相关性（皮尔逊相关系数达 0.9841），且推理时间稳定（137 至 140 秒），误差分布偏差极小。此方法填补了 EEG 时间-频率表示中啁啾分析的工具性空白。

链接: https://arxiv.org/abs/2503.22713
作者: Nooshin Bahador,Milad Lankarany
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Spectrograms are pivotal in time-frequency signal analysis, widely used in audio processing and computational neuroscience. Chirp-like patterns in electroencephalogram (EEG) spectrograms (marked by linear or exponential frequency sweep) are key biomarkers for seizure dynamics, but automated tools for their detection, localization, and feature extraction are lacking. This study bridges this gap by fine-tuning a Vision Transformer (ViT) model on synthetic spectrograms, augmented with Low-Rank Adaptation (LoRA) to boost adaptability. We generated 100000 synthetic spectrograms with chirp parameters, creating the first large-scale benchmark for chirp localization. These spectrograms mimic neural chirps using linear or exponential frequency sweep, Gaussian noise, and smoothing. A ViT model, adapted for regression, predicted chirp parameters. LoRA fine-tuned the attention layers, enabling efficient updates to the pre-trained backbone. Training used MSE loss and the AdamW optimizer, with a learning rate scheduler and early stopping to curb overfitting. Only three features were targeted: Chirp Start Time (Onset Time), Chirp Start Frequency (Onset Frequency), and Chirp End Frequency (Offset Frequency). Performance was evaluated via Pearson correlation between predicted and actual labels. Results showed strong alignment: 0.9841 correlation for chirp start time, with stable inference times (137 to 140s) and minimal bias in error distributions. This approach offers a tool for chirp analysis in EEG time-frequency representation, filling a critical methodological void.
zh

[CV-225] From Eye to Mind: brain2text Decoding Reveals the Neural Mechanisms of Visual Semantic Processing

【速读】：该论文试图解决的核心问题是揭示将感觉体验转化为有意义语义表示的神经机制，特别是复杂自然刺激下语义内容的神经编码格式。传统基于视觉重建的脑解码方法主要捕获低级知觉特征，未能揭示指导人类认知的深层语义本质。为解决这一问题，论文提出了一种范式转变，通过直接从功能性磁共振成像（fMRI）信号解码生成自然图像的文字描述来实现语义解码。关键在于开发了一种无需视觉输入的新型深度学习模型，该模型在语义解码任务中达到了最先进的性能，并成功捕捉复杂场景的核心语义内容。此外，神经解剖学分析强调了高级视觉区域（如MT+、腹侧视觉皮层和顶下小叶）在此语义转换过程中的重要作用，而类别特定解码进一步展示了对生命体征和运动等语义维度的精细神经表征。这一基于文本的解码方法相较于视觉重建提供了更直接且可解释的窗口，以探究复杂的语义处理神经基础。

链接: https://arxiv.org/abs/2503.22697
作者: Feihan Feng,Jingxin Nie
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 7 figures

点击查看摘要

Abstract:Deciphering the neural mechanisms that transform sensory experiences into meaningful semantic representations is a fundamental challenge in cognitive neuroscience. While neuroimaging has mapped a distributed semantic network, the format and neural code of semantic content remain elusive, particularly for complex, naturalistic stimuli. Traditional brain decoding, focused on visual reconstruction, primarily captures low-level perceptual features, missing the deeper semantic essence guiding human cognition. Here, we introduce a paradigm shift by directly decoding fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual input, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual regions, including MT+, ventral stream visual cortex, and inferior parietal cortex, in this semantic transformation. Category-specific decoding further demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This text-based decoding approach provides a more direct and interpretable window into the brain’s semantic encoding than visual reconstruction, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining our understanding of the distributed semantic network, and potentially inspiring brain-inspired language models.
zh

人工智能

[AI-0] ACPBench Hard: Unrestrained Reasoning about Action Change and Planning AAAI2025

链接: https://arxiv.org/abs/2503.24378
作者: Harsha Kokel,Michael Katz,Kavitha Srinivas,Shirin Sohrabi
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to LM4Plan@AAAI 2025

点击查看摘要

Abstract:The ACPBench dataset provides atomic reasoning tasks required for efficient planning. The dataset is aimed at distilling the complex plan generation task into separate atomic reasoning tasks in their easiest possible form, boolean or multiple-choice questions, where the model has to choose the right answer from the provided options. While the aim of ACPBench is to test the simplest form of reasoning about action and change, when tasked with planning, a model does not typically have options to choose from and thus the reasoning required for planning dictates an open-ended, generative form for these tasks. To that end, we introduce ACPBench Hard, a generative version of ACPBench, with open-ended questions which the model needs to answer. Models that perform well on these tasks could in principle be integrated into a planner or be used directly as a policy. We discuss the complexity of these tasks as well as the complexity of validating the correctness of their answers and present validation algorithms for each task. Equipped with these validators, we test the performance of a variety of models on our tasks and find that for most of these tasks the performance of even the largest models is still subpar. Our experiments show that no model outperforms another in these tasks and with a few exceptions all tested language models score below 65%, indicating that even the current frontier language models have a long way to go before they can reliably reason about planning. In fact, even the so-called reasoning models struggle with solving these reasoning tasks. ACPBench Hard collection is available at the following link: this https URL

[AI-1] Which LIME should I trust? Concepts Challenges and Solutions

链接: https://arxiv.org/abs/2503.24365
作者: Patrick Knab,Sascha Marton,Udo Schlegel,Christian Bartelt
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the 3rd World Conference on eXplainable Artificial Intelligence (XAI 2025)

点击查看摘要

Abstract:As neural networks become dominant in essential systems, Explainable Artificial Intelligence (XAI) plays a crucial role in fostering trust and detecting potential misbehavior of opaque models. LIME (Local Interpretable Model-agnostic Explanations) is among the most prominent model-agnostic approaches, generating explanations by approximating the behavior of black-box models around specific instances. Despite its popularity, LIME faces challenges related to fidelity, stability, and applicability to domain-specific problems. Numerous adaptations and enhancements have been proposed to address these issues, but the growing number of developments can be overwhelming, complicating efforts to navigate LIME-related research. To the best of our knowledge, this is the first survey to comprehensively explore and collect LIME’s foundational concepts and known limitations. We categorize and compare its various enhancements, offering a structured taxonomy based on intermediate steps and key issues. Our analysis provides a holistic overview of advancements in LIME, guiding future research and helping practitioners identify suitable approaches. Additionally, we provide a continuously updated interactive website (this https URL), offering a concise and accessible overview of the survey.

[AI-2] Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation

链接: https://arxiv.org/abs/2503.24361
作者: Abhiram Maddukuri,Zhenyu Jiang,Lawrence Yunliang Chen,Soroush Nasiriany,Yuqi Xie,Yu Fang,Wenqi Huang,Zu Wang,Zhenjia Xu,Nikita Chernyadev,Scott Reed,Ken Goldberg,Ajay Mandlekar,Linxi Fan,Yuke Zhu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Large real-world robot datasets hold great potential to train generalist robot models, but scaling real-world human data collection is time-consuming and resource-intensive. Simulation has great potential in supplementing large-scale data, especially with recent advances in generative AI and automated data generation tools that enable scalable creation of robot behavior datasets. However, training a policy solely in simulation and transferring it to the real world often demands substantial human effort to bridge the reality gap. A compelling alternative is to co-train the policy on a mixture of simulation and real-world datasets. Preliminary studies have recently shown this strategy to substantially improve the performance of a policy over one trained on a limited amount of real-world data. Nonetheless, the community lacks a systematic understanding of sim-and-real co-training and what it takes to reap the benefits of simulation data for real-robot learning. This work presents a simple yet effective recipe for utilizing simulation data to solve vision-based robotic manipulation tasks. We derive this recipe from comprehensive experiments that validate the co-training strategy on various simulation and real-world datasets. Using two domains–a robot arm and a humanoid–across diverse tasks, we demonstrate that simulation data can enhance real-world task performance by an average of 38%, even with notable differences between the simulation and real-world data. Videos and additional results can be found at this https URL

[AI-3] Contextual Preference Collaborative Measure Framework Based on Belief System

链接: https://arxiv.org/abs/2503.24328
作者: Hang Yu,Wei Wei,Zheng Tan,Jing-lei Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: in Chinese language

点击查看摘要

Abstract:To reduce the human intervention in the preference measure process,this article proposes a preference collaborative measure framework based on an updated belief system,which is also capable of improving the accuracy and efficiency of preferen-ce measure this http URL,the distance of rules and the average internal distance of rulesets are proposed for specifying the relationship between the this http URL discovering the most representative preferences that are common in all users,namely common preference,a algorithm based on average internal distance of ruleset,PRA algorithm,is proposed,which aims to finish the discoveryprocess with minimum information loss this http URL,the concept of Common belief is proposed to update the belief system,and the common preferences are the evidences of updated belief this http URL,under the belief system,the proposed belief degree and deviation degree are used to determine whether a rule confirms the belief system or not and classify the preference rules into two kinds(generalized or personalized),and eventually filters out Top-K interesting rules relying on belief degree and deviation this http URL on above,a scalable interestingness calculation framework that can apply various formulas is proposed for accurately calculating interestingness in different this http URL last,IMCos algorithm and IMCov algorithm are proposed as exemplars to verify the accuracy and efficiency of the framework by using weighted cosine similarity and correlation coefficients as belief this http URL experiments,the proposed algorithms are compared to two state-of-the-art algorithms and the results show that IMCos and IMCov outperform than the other two in most aspects.

[AI-4] Pro-Routing: Proactive Routing of Autonomous Multi-Capacity Robots for Pickup-and-Delivery Tasks

链接: https://arxiv.org/abs/2503.24325
作者: Daniel Garces,Stephanie Gil
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 25 pages, 7 figures, and 1 table

点击查看摘要

Abstract:We consider a multi-robot setting, where we have a fleet of multi-capacity autonomous robots that must service spatially distributed pickup-and-delivery requests with fixed maximum wait times. Requests can be either scheduled ahead of time or they can enter the system in real-time. In this setting, stability for a routing policy is defined as the cost of the policy being uniformly bounded over time. Most previous work either solve the problem offline to theoretically maintain stability or they consider dynamically arriving requests at the expense of the theoretical guarantees on stability. In this paper, we aim to bridge this gap by proposing a novel proactive rollout-based routing framework that adapts to real-time demand while still provably maintaining the stability of the learned routing policy. We derive provable stability guarantees for our method by proposing a fleet sizing algorithm that obtains a sufficiently large fleet that ensures stability by construction. To validate our theoretical results, we consider a case study on real ride requests for Harvard’s evening Van System. We also evaluate the performance of our framework using the currently deployed smaller fleet size. In this smaller setup, we compare against the currently deployed routing algorithm, greedy heuristics, and Monte-Carlo-Tree-Search-based algorithms. Our empirical results show that our framework maintains stability when we use the sufficiently large fleet size found in our theoretical results. For the smaller currently deployed fleet size, our method services 6% more requests than the closest baseline while reducing median passenger wait times by 33%.

[AI-5] Evaluating machine learning models for predicting pesticides toxicity to honey bees

链接: https://arxiv.org/abs/2503.24305
作者: Jakub Adamczyk,Jakub Poziemski,Pawel Siedlecki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Small molecules play a critical role in the biomedical, environmental, and agrochemical domains, each with distinct physicochemical requirements and success criteria. Although biomedical research benefits from extensive datasets and established benchmarks, agrochemical data remain scarce, particularly with respect to species-specific toxicity. This work focuses on ApisTox, the most comprehensive dataset of experimentally validated chemical toxicity to the honey bee (\textitApis mellifera), an ecologically vital pollinator. We evaluate ApisTox using a diverse suite of machine learning approaches, including molecular fingerprints, graph kernels, and graph neural networks, as well as pretrained models. Comparative analysis with medicinal datasets from the MoleculeNet benchmark reveals that ApisTox represents a distinct chemical space. Performance degradation on non-medicinal datasets, such as ApisTox, demonstrates their limited generalizability of current state-of-the-art algorithms trained solely on biomedical data. Our study highlights the need for more diverse datasets and for targeted model development geared toward the agrochemical domain.

[AI-6] Shape Expressions with Inheritance ESWC

链接: https://arxiv.org/abs/2503.24299
作者: Iovka Boneva,Jose Emilio Labra Gayo,Eric Prud’hommeaux,Katherine Thornton,Andra Waagmeester
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: Accepted in Extended Semantic Web Conference, ESWC, 2025

点击查看摘要

Abstract:We formally introduce an inheritance mechanism for the Shape Expressions language (ShEx). It is inspired by inheritance in object-oriented programming languages, and provides similar advantages such as reuse, modularity, and more flexible data modelling. Using an example, we explain the main features of the inheritance mechanism. We present its syntax and formal semantics. The semantics is an extension of the semantics of ShEx 2.1. It also directly yields a validation algorithm as an extension of the previous ShEx validation algorithms, while maintaining the same algorithmic complexity.

[AI-7] Value of Information-based Deceptive Path Planning Under Adversarial Interventions

链接: https://arxiv.org/abs/2503.24284
作者: Wesley A. Suttle,Jesse Milzman,Mustafa O. Karabag,Brian M. Sadler,Ufuk Topcu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Existing methods for deceptive path planning (DPP) address the problem of designing paths that conceal their true goal from a passive, external observer. Such methods do not apply to problems where the observer has the ability to perform adversarial interventions to impede the path planning agent. In this paper, we propose a novel Markov decision process (MDP)-based model for the DPP problem under adversarial interventions and develop new value of information (VoI) objectives to guide the design of DPP policies. Using the VoI objectives we propose, path planning agents deceive the adversarial observer into choosing suboptimal interventions by selecting trajectories that are of low informational value to the observer. Leveraging connections to the linear programming theory for MDPs, we derive computationally efficient solution methods for synthesizing policies for performing DPP under adversarial interventions. In our experiments, we illustrate the effectiveness of the proposed solution method in achieving deceptiveness under adversarial interventions and demonstrate the superior performance of our approach to both existing DPP methods and conservative path planning approaches on illustrative gridworld problems.

[AI-8] AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World

链接: https://arxiv.org/abs/2503.24278
作者: Zhiyuan Zhou,Pranav Atreya,You Liang Tan,Karl Pertsch,Sergey Levine
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scalable and reproducible policy evaluation has been a long-standing challenge in robot learning. Evaluations are critical to assess progress and build better policies, but evaluation in the real world, especially at a scale that would provide statistically reliable results, is costly in terms of human time and hard to obtain. Evaluation of increasingly generalist robot policies requires an increasingly diverse repertoire of evaluation environments, making the evaluation bottleneck even more pronounced. To make real-world evaluation of robotic policies more practical, we propose AutoEval, a system to autonomously evaluate generalist robot policies around the clock with minimal human intervention. Users interact with AutoEval by submitting evaluation jobs to the AutoEval queue, much like how software jobs are submitted with a cluster scheduling system, and AutoEval will schedule the policies for evaluation within a framework supplying automatic success detection and automatic scene resets. We show that AutoEval can nearly fully eliminate human involvement in the evaluation process, permitting around the clock evaluations, and the evaluation results correspond closely to ground truth evaluations conducted by hand. To facilitate the evaluation of generalist policies in the robotics community, we provide public access to multiple AutoEval scenes in the popular BridgeData robot setup with WidowX robot arms. In the future, we hope that AutoEval scenes can be set up across institutions to form a diverse and distributed evaluation network.

[AI-9] Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

链接: https://arxiv.org/abs/2503.24277
作者: Sewoong Lee,Adam Davies,Marc E. Canby,Julia Hockenmaier
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have emerged as a workhorse of modern mechanistic interpretability, but leading SAE approaches with top- k style activation functions lack theoretical grounding for selecting the hyperparameter k . SAEs are based on the linear representation hypothesis (LRH), which assumes that the representations of large language models (LLMs) are linearly encoded, and the superposition hypothesis (SH), which states that there can be more features in the model than its dimensionality. We show that, based on the formal definitions of the LRH and SH, the magnitude of sparse feature vectors (the latent representations learned by SAEs of the dense embeddings of LLMs) can be approximated using their corresponding dense vector with a closed-form error bound. To visualize this, we propose the ZF plot, which reveals a previously unknown relationship between LLM hidden embeddings and SAE feature vectors, allowing us to make the first empirical measurement of the extent to which feature vectors of pre-trained SAEs are over- or under-activated for a given input. Correspondingly, we introduce Approximate Feature Activation (AFA), which approximates the magnitude of the ground-truth sparse feature vector, and propose a new evaluation metric derived from AFA to assess the alignment between inputs and activations. We also leverage AFA to introduce a novel SAE architecture, the top-AFA SAE, leading to SAEs that: (a) are more in line with theoretical justifications; and (b) obviate the need to tune SAE sparsity hyperparameters. Finally, we empirically demonstrate that top-AFA SAEs achieve reconstruction loss comparable to that of state-of-the-art top-k SAEs, without requiring the hyperparameter k to be tuned. Our code is available at: this https URL.

[AI-10] New Statistical Framework for Extreme Error Probability in High-Stakes Domains for Reliable Machine Learning

链接: https://arxiv.org/abs/2503.24262
作者: Umberto Michelucci,Francesca Venturini
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning is vital in high-stakes domains, yet conventional validation methods rely on averaging metrics like mean squared error (MSE) or mean absolute error (MAE), which fail to quantify extreme errors. Worst-case prediction failures can have substantial consequences, but current frameworks lack statistical foundations for assessing their probability. In this work a new statistical framework, based on Extreme Value Theory (EVT), is presented that provides a rigorous approach to estimating worst-case failures. Applying EVT to synthetic and real-world datasets, this method is shown to enable robust estimation of catastrophic failure probabilities, overcoming the fundamental limitations of standard cross-validation. This work establishes EVT as a fundamental tool for assessing model reliability, ensuring safer AI deployment in new technologies where uncertainty quantification is central to decision-making or scientific analysis.

[AI-11] Spatio-temporal Prediction of Fine-Grained Origin-Destination Matrices with Applications in Ridesharing

链接: https://arxiv.org/abs/2503.24237
作者: Run Yang,Runpeng Dai,Siran Gao,Xiaocheng Tang,Fan Zhou,Hongtu Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurate spatial-temporal prediction of network-based travelers’ requests is crucial for the effective policy design of ridesharing platforms. Having knowledge of the total demand between various locations in the upcoming time slots enables platforms to proactively prepare adequate supplies, thereby increasing the likelihood of fulfilling travelers’ requests and redistributing idle drivers to areas with high potential demand to optimize the global supply-demand equilibrium. This paper delves into the prediction of Origin-Destination (OD) demands at a fine-grained spatial level, especially when confronted with an expansive set of local regions. While this task holds immense practical value, it remains relatively unexplored within the research community. To fill this gap, we introduce a novel prediction model called OD-CED, which comprises an unsupervised space coarsening technique to alleviate data sparsity and an encoder-decoder architecture to capture both semantic and geographic dependencies. Through practical experimentation, OD-CED has demonstrated remarkable results. It achieved an impressive reduction of up to 45% reduction in root-mean-square error and 60% in weighted mean absolute percentage error over traditional statistical methods when dealing with OD matrices exhibiting a sparsity exceeding 90%.

[AI-12] All You Need is Sally-Anne: ToM in AI Strongly Supported After Surpassing Tests for 3-Year-Olds

链接: https://arxiv.org/abs/2503.24215
作者: Nitay Alon,Joseph Barnby,Reuth Mirsky,Stefan Sarkadi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Theory of Mind (ToM) is a hallmark of human cognition, allowing individuals to reason about others’ beliefs and intentions. Engineers behind recent advances in Artificial Intelligence (AI) have claimed to demonstrate comparable capabilities. This paper presents a model that surpasses traditional ToM tests designed for 3-year-old children, providing strong support for the presence of ToM in AI systems.

[AI-13] Agent -Based Simulations of Online Political Discussions: A Case Study on Elections in Germany ESWC

链接: https://arxiv.org/abs/2503.24199
作者: Abdul Sittar,Simon Münker,Fabio Sartori,Andreas Reitenbach,Achim Rettinger,Michael Mäs,Alenka Guček,Marko Grobelnik
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 15 pages, 3, ESWC, Workshop Paper

点击查看摘要

Abstract:User engagement on social media platforms is influenced by historical context, time constraints, and reward-driven interactions. This study presents an agent-based simulation approach that models user interactions, considering past conversation history, motivation, and resource constraints. Utilizing German Twitter data on political discourse, we fine-tune AI models to generate posts and replies, incorporating sentiment analysis, irony detection, and offensiveness classification. The simulation employs a myopic best-response model to govern agent behavior, accounting for decision-making based on expected rewards. Our results highlight the impact of historical context on AI-generated responses and demonstrate how engagement evolves under varying constraints.

[AI-14] Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

链接: https://arxiv.org/abs/2503.24191
作者: Shuoming Zhang,Jiacheng Zhao,Ruiyuan Xu,Xiaobing Feng,Huimin Cui
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 15 pages, 13 figures, 4 tables Work In Progress

点击查看摘要

Abstract:Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) are extensively used as tooling platforms through structured output APIs to ensure syntax compliance so that robust integration with existing softwares like agent systems, could be achieved. However, the feature enabling functionality of grammar-guided structured output presents significant security vulnerabilities. In this work, we reveal a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a novel jailbreak class that weaponizes structured output constraints to bypass safety mechanisms. Unlike prior attacks focused on input prompts, CDA operates by embedding malicious intent in schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane). We instantiate this with a proof-of-concept Chain Enum Attack, achieves 96.2% attack success rates across proprietary and open-weight LLMs on five safety benchmarks with a single query, including GPT-4o and Gemini-2.0-flash. Our findings identify a critical security blind spot in current LLM architectures and urge a paradigm shift in LLM safety to address control-plane vulnerabilities, as current mechanisms focused solely on data-plane threats leave critical systems exposed.

[AI-15] Predicting Targeted Therapy Resistance in Non-Small Cell Lung Cancer Using Multimodal Machine Learning

链接: https://arxiv.org/abs/2503.24165
作者: Peiying Hua,Andrea Olofson,Faraz Farhadi,Liesbeth Hondelink,Gregory Tsongalis,Konstantin Dragnev,Dagmar Hoegemann Savellano,Arief Suriawinata,Laura Tafe,Saeed Hassanpour
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Lung cancer is the primary cause of cancer death globally, with non-small cell lung cancer (NSCLC) emerging as its most prevalent subtype. Among NSCLC patients, approximately 32.3% have mutations in the epidermal growth factor receptor (EGFR) gene. Osimertinib, a third-generation EGFR-tyrosine kinase inhibitor (TKI), has demonstrated remarkable efficacy in the treatment of NSCLC patients with activating and T790M resistance EGFR mutations. Despite its established efficacy, drug resistance poses a significant challenge for patients to fully benefit from osimertinib. The absence of a standard tool to accurately predict TKI resistance, including that of osimertinib, remains a critical obstacle. To bridge this gap, in this study, we developed an interpretable multimodal machine learning model designed to predict patient resistance to osimertinib among late-stage NSCLC patients with activating EGFR mutations, achieving a c-index of 0.82 on a multi-institutional dataset. This machine learning model harnesses readily available data routinely collected during patient visits and medical assessments to facilitate precision lung cancer management and informed treatment decisions. By integrating various data types such as histology images, next generation sequencing (NGS) data, demographics data, and clinical records, our multimodal model can generate well-informed recommendations. Our experiment results also demonstrated the superior performance of the multimodal model over single modality models (c-index 0.82 compared with 0.75 and 0.77), thus underscoring the benefit of combining multiple modalities in patient outcome prediction.

[AI-16] Learning a Canonical Basis of Human Preferences from Binary Ratings

链接: https://arxiv.org/abs/2503.24150
作者: Kailas Vodrahalli,Wei Wei,James Zou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 25 pages, 11 figures

点击查看摘要

Abstract:Recent advances in generative AI have been driven by alignment techniques such as reinforcement learning from human feedback (RLHF). RLHF and related techniques typically involve constructing a dataset of binary or ranked choice human preferences and subsequently fine-tuning models to align with these preferences. This paper shifts the focus to understanding the preferences encoded in such datasets and identifying common human preferences. We find that a small subset of 21 preference categories (selected from a set of nearly 5,000 distinct preferences) captures 89% of preference variation across individuals. This small set of preferences is analogous to a canonical basis of human preferences, similar to established findings that characterize human variation in psychology or facial recognition studies. Through both synthetic and empirical evaluations, we confirm that our low-rank, canonical set of human preferences generalizes across the entire dataset and within specific topics. We further demonstrate our preference basis’ utility in model evaluation, where our preference categories offer deeper insights into model alignment, and in model training, where we show that fine-tuning on preference-defined subsets successfully aligns the model accordingly.

[AI-17] Resonance: Drawing from Memories to Imagine Positive Futures through AI-Augmented Journaling

链接: https://arxiv.org/abs/2503.24145
作者: Wazeer Zulfikar,Treyden Chiaravalloti,Jocelyn Shen,Rosalind Picard,Pattie Maes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 17 pages, 13 figures

点击查看摘要

Abstract:People inherently use experiences of their past while imagining their future, a capability that plays a crucial role in mental health. Resonance is an AI-powered journaling tool designed to augment this ability by offering AI-generated, action-oriented suggestions for future activities based on the user’s own past memories. Suggestions are offered when a new memory is logged and are followed by a prompt for the user to imagine carrying out the suggestion. In a two-week randomized controlled study (N=55), we found that using Resonance significantly improved mental health outcomes, reducing the users’ PHQ8 scores, a measure of current depression, and increasing their daily positive affect, particularly when they would likely act on the suggestion. Notably, the effectiveness of the suggestions was higher when they were personal, novel, and referenced the user’s logged memories. Finally, through open-ended feedback, we discuss the factors that encouraged or hindered the use of the tool.

[AI-18] Graph Neural Network-Based Predictive Modeling for Robotic Plaster Printing

链接: https://arxiv.org/abs/2503.24130
作者: Diego Machain Rivera,Selen Ercan Jenny,Ping Hsun Tsai,Ena Lloret-Fritschi,Luis Salamanca,Fernando Perez-Cruz,Konstantinos E. Tatsis
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work proposes a Graph Neural Network (GNN) modeling approach to predict the resulting surface from a particle based fabrication process. The latter consists of spray-based printing of cementitious plaster on a wall and is facilitated with the use of a robotic arm. The predictions are computed using the robotic arm trajectory features, such as position, velocity and direction, as well as the printing process parameters. The proposed approach, based on a particle representation of the wall domain and the end effector, allows for the adoption of a graph-based solution. The GNN model consists of an encoder-processor-decoder architecture and is trained using data from laboratory tests, while the hyperparameters are optimized by means of a Bayesian scheme. The aim of this model is to act as a simulator of the printing process, and ultimately used for the generation of the robotic arm trajectory and the optimization of the printing parameters, towards the materialization of an autonomous plastering process. The performance of the proposed model is assessed in terms of the prediction error against unseen ground truth data, which shows its generality in varied scenarios, as well as in comparison with the performance of an existing benchmark model. The results demonstrate a significant improvement over the benchmark model, with notably better performance and enhanced error scaling across prediction steps.

[AI-19] owards Scientific Intelligence: A Survey of LLM -based Scientific Agents

链接: https://arxiv.org/abs/2503.24047
作者: Shuo Ren,Pu Jian,Zhenjiang Ren,Chunlin Leng,Can Xie,Jiajun Zhang
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 34 pages, 10 figures

点击查看摘要

Abstract:As scientific research becomes increasingly complex, innovative tools are needed to manage vast data, facilitate interdisciplinary collaboration, and accelerate discovery. Large language models (LLMs) are now evolving into LLM-based scientific agents that automate critical tasks, ranging from hypothesis generation and experiment design to data analysis and simulation. Unlike general-purpose LLMs, these specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms, enabling them to handle complex data types, ensure reproducibility, and drive scientific breakthroughs. This survey provides a focused review of the architectures, design, benchmarks, applications, and ethical considerations surrounding LLM-based scientific agents. We highlight why they differ from general agents and the ways in which they advance research across various scientific fields. By examining their development and challenges, this survey offers a comprehensive roadmap for researchers and practitioners to harness these agents for more efficient, reliable, and ethically sound scientific discovery.

[AI-20] Pay More Attention to the Robustness of Prompt for Instruction Data Mining

链接: https://arxiv.org/abs/2503.24028
作者: Qiang Wang,Dawei Feng,Xu Zhang,Ao Shen,Yang Xu,Bo Ding,Huaimin Wang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Instruction tuning has emerged as a paramount method for tailoring the behaviors of LLMs. Recent work has unveiled the potential for LLMs to achieve high performance through fine-tuning with a limited quantity of high-quality instruction data. Building upon this approach, we further explore the impact of prompt’s robustness on the selection of high-quality instruction data. This paper proposes a pioneering framework of high-quality online instruction data mining for instruction tuning, focusing on the impact of prompt’s robustness on the data mining process. Our notable innovation, is to generate the adversarial instruction data by conducting the attack for the prompt of online instruction data. Then, we introduce an Adversarial Instruction-Following Difficulty metric to measure how much help the adversarial instruction data can provide to the generation of the corresponding response. Apart from it, we propose a novel Adversarial Instruction Output Embedding Consistency approach to select high-quality online instruction data. We conduct extensive experiments on two benchmark datasets to assess the performance. The experimental results serve to underscore the effectiveness of our proposed two methods. Moreover, the results underscore the critical practical significance of considering prompt’s robustness.

[AI-21] Bayesian Predictive Coding

链接: https://arxiv.org/abs/2503.24016
作者: Alexander Tschantz,Magnus Koudahl,Hampus Linander,Lancelot Da Costa,Conor Heins,Jeff Beck,Christopher Buckley
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Predictive coding (PC) is an influential theory of information processing in the brain, providing a biologically plausible alternative to backpropagation. It is motivated in terms of Bayesian inference, as hidden states and parameters are optimised via gradient descent on variational free energy. However, implementations of PC rely on maximum \textita posteriori (MAP) estimates of hidden states and maximum likelihood (ML) estimates of parameters, limiting their ability to quantify epistemic uncertainty. In this work, we investigate a Bayesian extension to PC that estimates a posterior distribution over network parameters. This approach, termed Bayesian Predictive coding (BPC), preserves the locality of PC and results in closed-form Hebbian weight updates. Compared to PC, our BPC algorithm converges in fewer epochs in the full-batch setting and remains competitive in the mini-batch setting. Additionally, we demonstrate that BPC offers uncertainty quantification comparable to existing methods in Bayesian deep learning, while also improving convergence properties. Together, these results suggest that BPC provides a biologically plausible method for Bayesian learning in the brain, as well as an attractive approach to uncertainty quantification in deep learning.

[AI-22] CITRAS: Covariate-Informed Transformer for Time Series Forecasting

链接: https://arxiv.org/abs/2503.24007
作者: Yosuke Yamaguchi,Issei Suemitsu,Wenpeng Wei
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Covariates play an indispensable role in practical time series forecasting, offering rich context from the past and sometimes extending into the future. However, their availability varies depending on the scenario, and situations often involve multiple target variables simultaneously. Moreover, the cross-variate dependencies between them are multi-granular, with some covariates having a short-term impact on target variables and others showing long-term correlations. This heterogeneity and the intricate dependencies arising in covariate-informed forecasting present significant challenges to existing deep models. To address these issues, we propose CITRAS, a patch-based Transformer that flexibly leverages multiple targets and covariates covering both the past and the future forecasting horizon. While preserving the strong autoregressive capabilities of the canonical Transformer, CITRAS introduces two novel mechanisms in patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates future known covariates into the forecasting of target variables based on their concurrent dependencies. Additionally, Attention Score Smoothing transforms locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the past series of attention scores. Experimentally, CITRAS achieves state-of-the-art performance in both covariate-informed and multivariate forecasting, demonstrating its versatile ability to leverage cross-variate dependency for improved forecasting accuracy.

[AI-23] Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

链接: https://arxiv.org/abs/2503.24000
作者: Wei Gao,Xinyu Zhou,Peng Sun,Tianwei Zhang,Yonggang Wen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 21 pages, 18 figures, published to MLSys2025

点击查看摘要

Abstract:Key-Value cache (\textttKV \textttcache) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \textttKV \textttcache to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \textttKV \textttcache compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \textttKV \textttcache compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \textttKV \textttcache compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \textttKV \textttcache can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \textttKV \textttcache may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \textttKV \textttcache compression when handling specific LLM tasks. Third, we provide tools to shed light on future \textttKV \textttcache compression studies and facilitate their practical deployment in production. They are open-sourced in \hrefthis https URLthis https URL.

[AI-24] Rubric Is All You Need: Enhancing LLM -based Code Evaluation With Question-Specific Rubrics

链接: https://arxiv.org/abs/2503.23989
作者: Aditya Pathak,Rachit Gandhi,Vaibhav Uttam,Devansh,Yashwanth Nakka,Aaryan Raj Jindal,Pratyush Ghosh,Arnav Ramamoorthy,Shreyash Verma,Aditya Mittal,Aashna Ased,Chirag Khatri,Jagat Sesh Challa,Dhruv Kumar
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Since the disruption in LLM technology brought about by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation remains a popular field of research, code evaluation using LLMs remains a problem with no conclusive solution. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use question-agnostic rubrics. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen’s Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.

[AI-25] Deep Learning Model Deployment in Multiple Cloud Providers: an Exploratory Study Using Low Computing Power Environments

链接: https://arxiv.org/abs/2503.23988
作者: Elayne Lemos,Rodrigo Oliveira,Jairson Rodrigues,Rosalvo F. Oliveira Neto
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:The deployment of Machine Learning models at cloud have grown by tech companies. Hardware requirements are higher when these models involve Deep Learning (DL) techniques and the cloud providers’ costs may be a barrier. We explore deploying DL models using for experiments the GECToR model, a DL solution for Grammatical Error Correction, across three of the major cloud platforms (AWS, Google Cloud, Azure). We evaluate real-time latency, hardware usage and cost at each cloud provider by 7 execution environments with 10 experiments reproduced. We found that while GPUs excel in performance, they had an average cost 300% higher than solutions without GPU. Our analysis also identifies that processor cache size is crucial for cost-effective CPU deployments, enabling over 50% of cost reduction compared to GPUs. This study demonstrates the feasibility and affordability of cloud-based DL inference solutions without GPUs, benefiting resource-constrained users like startups.

[AI-26] Noise-based reward-modulated learning

链接: https://arxiv.org/abs/2503.23972
作者: Jesús García Fernández,Nasir Ahmad,Marcel van Gerven
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) have led to significant improvements in task performance. However, training neural networks in an RL regime is typically achieved in combination with backpropagation, limiting their applicability in resource-constrained environments or when using non-differentiable neural networks. While noise-based alternatives like reward-modulated Hebbian learning (RMHL) have been proposed, their performance has remained limited, especially in scenarios with delayed rewards, which require retrospective credit assignment over time. Here, we derive a novel noise-based learning rule that addresses these challenges. Our approach combines directional derivative theory with Hebbian-like updates to enable efficient, gradient-free learning in RL. It features stochastic noisy neurons which can approximate gradients, and produces local synaptic updates modulated by a global reward signal. Drawing on concepts from neuroscience, our method uses reward prediction error as its optimization target to generate increasingly advantageous behavior, and incorporates an eligibility trace to facilitate temporal credit assignment in environments with delayed rewards. Its formulation relies on local information alone, making it compatible with implementations in neuromorphic hardware. Experimental validation shows that our approach significantly outperforms RMHL and is competitive with BP-based baselines, highlighting the promise of noise-based, biologically inspired learning for low-power and real-time applications.

[AI-27] AI2Agent : An End-to-End Framework for Deploying AI Projects as Autonomous Agents

链接: https://arxiv.org/abs/2503.23948
作者: Jiaxiang Chen,Jingwei Shi,Lei Gan,Jiale Zhang,Qingyu Zhang,Dongqian Zhang,Xin Pang,Zhucong Li,Yinghui Xu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As AI technology advances, it is driving innovation across industries, increasing the demand for scalable AI project deployment. However, deployment remains a critical challenge due to complex environment configurations, dependency conflicts, cross-platform adaptation, and debugging difficulties, which hinder automation and adoption. This paper introduces AI2Agent, an end-to-end framework that automates AI project deployment through guideline-driven execution, self-adaptive debugging, and case \ solution accumulation. AI2Agent dynamically analyzes deployment challenges, learns from past cases, and iteratively refines its approach, significantly reducing human intervention. To evaluate its effectiveness, we conducted experiments on 30 AI deployment cases, covering TTS, text-to-image generation, image editing, and other AI applications. Results show that AI2Agent significantly reduces deployment time and improves success rates. The code and demo video are now publicly accessible.

[AI-28] Green MLOps to Green GenOps: An Empirical Study of Energy Consumption in Discriminative and Generative AI Operations

链接: https://arxiv.org/abs/2503.23934
作者: Adrián Sánchez-Mompó,Ioannis Mavromatis,Peizheng Li,Konstantinos Katsaros,Aftab Khan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published to MDPI Information - Artificial Intelligence Section

点击查看摘要

Abstract:This study presents an empirical investigation into the energy consumption of Discriminative and Generative AI models within real-world MLOps pipelines. For Discriminative models, we examine various architectures and hyperparameters during training and inference and identify energy-efficient practices. For Generative AI, Large Language Models (LLMs) are assessed, focusing primarily on energy consumption across different model sizes and varying service requests. Our study employs software-based power measurements, ensuring ease of replication across diverse configurations, models, and datasets. We analyse multiple models and hardware setups to uncover correlations among various metrics, identifying key contributors to energy consumption. The results indicate that for Discriminative models, optimising architectures, hyperparameters, and hardware can significantly reduce energy consumption without sacrificing performance. For LLMs, energy efficiency depends on balancing model size, reasoning complexity, and request-handling capacity, as larger models do not necessarily consume more energy when utilisation remains low. This analysis provides practical guidelines for designing green and sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. This paper can serve as a benchmark for accurately estimating total energy use across different types of AI models.

[AI-29] What the F*ck Is Artificial General Intelligence?

链接: https://arxiv.org/abs/2503.23923
作者: Michael Timothy Bennett
类目: Artificial Intelligence (cs.AI)
*备注: Preprint; 10 pages;

点击查看摘要

Abstract:Artificial general intelligence (AGI) is an established field of research. Yet Melanie Mitchell and others have questioned if the term still has meaning. AGI has been subject to so much hype and speculation it has become something of a Rorschach test. Mitchell points out that the debate will only be settled through long term, scientific investigation. To that end here is a short, accessible and provocative overview of AGI. I compare definitions of intelligence, settling on intelligence in terms of adaptation and AGI as an artificial scientist. Taking my queue from Sutton’s Bitter Lesson I describe two foundational tools used to build adaptive systems: search and approximation. I compare pros, cons, hybrids and architectures like o3, AlphaGo, AERA, NARS and Hyperon. I then discuss overall meta-approaches to making systems behave more intelligently. I divide them into scale-maxing, simp-maxing, w-maxing based on the Bitter Lesson, Ockham’s and Bennett’s Razors. These maximise resources, simplicity of form, and the weakness of constraints on functionality. I discuss examples including AIXI, the free energy principle and The Embiggening of language models. I conclude that though scale-maxed approximation dominates, AGI will be a fusion of tools and meta-approaches. The Embiggening was enabled by improvements in hardware. Now the bottlenecks are sample and energy efficiency.

[AI-30] SchemaAgent : A Multi-Agents Framework for Generating Relational Database Schema

链接: https://arxiv.org/abs/2503.23886
作者: Qin Wang,Youhuan Li,Yansong Feng,Si Chen,Ziming Li,Pan Zhang,Zhichao Shi,Yuequn Dou,chuchu Gao,Zebin Huang,Zihui Si,Yixuan Chen,Zhaohai Sun,Ke Tang,Wenqiang Jin
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 19 pages, 16 figures

点击查看摘要

Abstract:The relational database design would output a schema based on user’s requirements, which defines table structures and their interrelated relations. Translating requirements into accurate schema involves several non-trivial subtasks demanding both database expertise and domain-specific knowledge. This poses unique challenges for automated design of relational databases. Existing efforts are mostly based on customized rules or conventional deep learning models, often producing suboptimal schema. Recently, large language models (LLMs) have significantly advanced intelligent application development across various domains. In this paper, we propose SchemaAgent, a unified LLM-based multi-agent framework for the automated generation of high-quality database schema. SchemaAgent is the first to apply LLMs for schema generation, which emulates the workflow of manual schema design by assigning specialized roles to agents and enabling effective collaboration to refine their respective subtasks. Schema generation is a streamlined workflow, where directly applying the multi-agent framework may cause compounding impact of errors. To address this, we incorporate dedicated roles for reflection and inspection, alongside an innovative error detection and correction mechanism to identify and rectify issues across various phases. For evaluation, we present a benchmark named \textitRSchema, which contains more than 500 pairs of requirement description and schema. Experimental results on this benchmark demonstrate the superiority of our approach over mainstream LLMs for relational database schema generation.

[AI-31] GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models

链接: https://arxiv.org/abs/2503.23875
作者: Wenkang Ji,Huaben Chen,Mingyang Chen,Guobin Zhu,Lufeng Xu,Roderich Groß,Rui Zhou,Ming Cao,Shiyu Zhao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The development of control policies for multi-robot systems traditionally follows a complex and labor-intensive process, often lacking the flexibility to adapt to dynamic tasks. This has motivated research on methods to automatically create control policies. However, these methods require iterative processes of manually crafting and refining objective functions, thereby prolonging the development cycle. This work introduces \textitGenSwarm, an end-to-end system that leverages large language models to automatically generate and deploy control policies for multi-robot tasks based on simple user instructions in natural language. As a multi-language-agent system, GenSwarm achieves zero-shot learning, enabling rapid adaptation to altered or unseen tasks. The white-box nature of the code policies ensures strong reproducibility and interpretability. With its scalable software and hardware architectures, GenSwarm supports efficient policy deployment on both simulated and real-world multi-robot systems, realizing an instruction-to-execution end-to-end functionality that could prove valuable for robotics specialists and non-specialists this http URL code of the proposed GenSwarm system is available online: this https URL.

[AI-32] OrchMLLM : Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training

链接: https://arxiv.org/abs/2503.23830
作者: Yijie Zheng,Bangjun Xiao,Lei Shi,Xiaoyang Li,Faming Wu,Tianyu Li,Xuefeng Xiao,Yang Zhang,Yuxuan Wang,Shouda Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs), such as GPT-4o, are garnering significant attention. During the exploration of MLLM training, we identified Modality Composition Incoherence, a phenomenon that the proportion of a certain modality varies dramatically across different examples. It exacerbates the challenges of addressing mini-batch imbalances, which lead to uneven GPU utilization between Data Parallel (DP) instances and severely degrades the efficiency and scalability of MLLM training, ultimately affecting training speed and hindering further research on MLLMs. To address these challenges, we introduce OrchMLLM, a comprehensive framework designed to mitigate the inefficiencies in MLLM training caused by Modality Composition Incoherence. First, we propose Batch Post-Balancing Dispatcher, a technique that efficiently eliminates mini-batch imbalances in sequential data. Additionally, we integrate MLLM Global Orchestrator into the training framework to orchestrate multimodal data and tackle the issues arising from Modality Composition Incoherence. We evaluate OrchMLLM across various MLLM sizes, demonstrating its efficiency and scalability. Experimental results reveal that OrchMLLM achieves a Model FLOPs Utilization (MFU) of 41.6% when training an 84B MLLM with three modalities on 2560 H100 GPUs, outperforming Megatron-LM by up to 3.1\times in throughput. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.23830 [cs.DC] (or arXiv:2503.23830v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2503.23830 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] When Counterfactual Reasoning Fails: Chaos and Real-World Complexity

链接: https://arxiv.org/abs/2503.23820
作者: Yahya Aalaila,Gerrit Großmann,Sumantrak Mukherjee,Jonas Wahl,Sebastian Vollmer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Counterfactual reasoning, a cornerstone of human cognition and decision-making, is often seen as the ‘holy grail’ of causal learning, with applications ranging from interpreting machine learning models to promoting algorithmic fairness. While counterfactual reasoning has been extensively studied in contexts where the underlying causal model is well-defined, real-world causal modeling is often hindered by model and parameter uncertainty, observational noise, and chaotic behavior. The reliability of counterfactual analysis in such settings remains largely unexplored. In this work, we investigate the limitations of counterfactual reasoning within the framework of Structural Causal Models. Specifically, we empirically investigate \emphcounterfactual sequence estimation and highlight cases where it becomes increasingly unreliable. We find that realistic assumptions, such as low degrees of model uncertainty or chaotic dynamics, can result in counterintuitive outcomes, including dramatic deviations between predicted and true counterfactual trajectories. This work urges caution when applying counterfactual reasoning in settings characterized by chaos and uncertainty. Furthermore, it raises the question of whether certain systems may pose fundamental limitations on the ability to answer counterfactual questions about their behavior.

[AI-34] hinking Longer Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

链接: https://arxiv.org/abs/2503.23803
作者: Yingwei Ma,Binhua Li,Yihong Dong,Xue Jiang,Rongyu Cao,Jue Chen,Fei Huang,Yongbin Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textitHow can personally deployable open-source LLMs achieve comparable code reasoning performance? To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textitdevelopment-contextualized trajectory synthesis method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textitdevelopment-process-based search strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing “end-point only” verification methods. Evaluations on SWE-bench Verified demonstrate our \textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbfmodels dynamically allocate more tokens to increasingly challenging problems, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.23803 [cs.SE] (or arXiv:2503.23803v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2503.23803 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] DebFlow: Automating Agent Creation via Agent Debate

链接: https://arxiv.org/abs/2503.23781
作者: Jinwei Su,Yinghui Xia,Ronghua Shi,Jianhui Wang,Jianuo Huang,Yijin Wang,Tianyu Shi,Yang Jingsong,Lewei He
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong potential and impressive performance in automating the generation and optimization of workflows. However, existing approaches are marked by limited reasoning capabilities, high computational demands, and significant resource requirements. To address these issues, we propose DebFlow, a framework that employs a debate mechanism to optimize workflows and integrates reflexion to improve based on previous experiences. We evaluated our method across six benchmark datasets, including HotpotQA, MATH, and ALFWorld. Our approach achieved a 3% average performance improvement over the latest baselines, demonstrating its effectiveness in diverse problem domains. In particular, during training, our framework reduces resource consumption by 37% compared to the state-of-the-art baselines. Additionally, we performed ablation studies. Removing the Debate component resulted in a 4% performance drop across two benchmark datasets, significantly greater than the 2% drop observed when the Reflection component was removed. These findings strongly demonstrate the critical role of Debate in enhancing framework performance, while also highlighting the auxiliary contribution of reflexion to overall optimization.

[AI-36] Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion

链接: https://arxiv.org/abs/2503.23721
作者: Jiagen Li,Rui Yu,Huihao Huang,Huaicheng Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal Emotion Recognition in Conversations (MERC) identifies emotional states across text, audio and video, which is essential for intelligent dialogue systems and opinion analysis. Existing methods emphasize heterogeneous modal fusion directly for cross-modal integration, but often suffer from disorientation in multimodal learning due to modal heterogeneity and lack of instructive guidance. In this work, we propose SUMMER, a novel heterogeneous multimodal integration framework leveraging Mixture of Experts with Hierarchical Cross-modal Fusion and Interactive Knowledge Distillation. Key components include a Sparse Dynamic Mixture of Experts (SDMoE) for capturing dynamic token-wise interactions, a Hierarchical Cross-Modal Fusion (HCMF) for effective fusion of heterogeneous modalities, and Interactive Knowledge Distillation (IKD), which uses a pre-trained unimodal teacher to guide multimodal fusion in latent and logit spaces. Experiments on IEMOCAP and MELD show SUMMER outperforms state-of-the-art methods, particularly in recognizing minority and semantically similar emotions.

[AI-37] GNN-Based Candidate Node Predictor for Influence Maximization in Temporal Graphs AAAI25 AAAI2025

链接: https://arxiv.org/abs/2503.23713
作者: Priyanka Gautam,Balasubramaniam Natarajan,Sai Munikoti,S M Ferdous,Mahantesh Halappanavar
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, Accepted in AAAI25 to AI4TS Workshop@AAAI 2025

点击查看摘要

Abstract:In an age where information spreads rapidly across social media, effectively identifying influential nodes in dynamic networks is critical. Traditional influence maximization strategies often fail to keep up with rapidly evolving relationships and structures, leading to missed opportunities and inefficiencies. To address this, we propose a novel learning-based approach integrating Graph Neural Networks (GNNs) with Bidirectional Long Short-Term Memory (BiLSTM) models. This hybrid framework captures both structural and temporal dynamics, enabling accurate prediction of candidate nodes for seed set selection. The bidirectional nature of BiLSTM allows our model to analyze patterns from both past and future network states, ensuring adaptability to changes over time. By dynamically adapting to graph evolution at each time snapshot, our approach improves seed set calculation efficiency, achieving an average of 90% accuracy in predicting potential seed nodes across diverse networks. This significantly reduces computational overhead by optimizing the number of nodes evaluated for seed selection. Our method is particularly effective in fields like viral marketing and social network analysis, where understanding temporal dynamics is crucial.

[AI-38] owards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios

链接: https://arxiv.org/abs/2503.23708
作者: Jingzheng Li,Xianglong Liu,Shikui Wei,Zhijun Chen,Bing Li,Qing Guo,Xianqi Yang,Yanjun Pu,Jiakai Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Autonomous driving has made significant progress in both academia and industry, including performance improvements in perception task and the development of end-to-end autonomous driving systems. However, the safety and robustness assessment of autonomous driving has not received sufficient attention. Current evaluations of autonomous driving are typically conducted in natural driving scenarios. However, many accidents often occur in edge cases, also known as safety-critical scenarios. These safety-critical scenarios are difficult to collect, and there is currently no clear definition of what constitutes a safety-critical scenario. In this work, we explore the safety and robustness of autonomous driving in safety-critical scenarios. First, we provide a definition of safety-critical scenarios, including static traffic scenarios such as adversarial attack scenarios and natural distribution shifts, as well as dynamic traffic scenarios such as accident scenarios. Then, we develop an autonomous driving safety testing platform to comprehensively evaluate autonomous driving systems, encompassing not only the assessment of perception modules but also system-level evaluations. Our work systematically constructs a safety verification process for autonomous driving, providing technical support for the industry to establish standardized test framework and reduce risks in real-world road deployment.

[AI-39] MolGround: A Benchmark for Molecular Grounding

链接: https://arxiv.org/abs/2503.23668
作者: Jiaxin Wu,Ting Zhang,Rubing Chen,Wengyu Zhang,Chen Jason Zhang,Xiaoyong Wei,Li Qing
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current molecular understanding approaches predominantly focus on the descriptive aspect of human perception, providing broad, topic-level insights. However, the referential aspect – linking molecular concepts to specific structural components – remains largely unexplored. To address this gap, we propose a molecular grounding benchmark designed to evaluate a model’s referential abilities. We align molecular grounding with established conventions in NLP, cheminformatics, and molecular science, showcasing the potential of NLP techniques to advance molecular understanding within the AI for Science movement. Furthermore, we constructed the largest molecular understanding benchmark to date, comprising 79k QA pairs, and developed a multi-agent grounding prototype as proof of concept. This system outperforms existing models, including GPT-4o, and its grounding outputs have been integrated to enhance traditional tasks such as molecular captioning and ATC (Anatomical, Therapeutic, Chemical) classification.

[AI-40] GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS

链接: https://arxiv.org/abs/2503.23633
作者: Zhenlong Li,Huan Ning,Song Gao,Krzysztof Janowicz,Wenwen Li,Samantha T. Arundel,Chaowei Yang,Budhendra Bhaduri,Shaowen Wang,A-Xing Zhu,Mark Gahegan,Shashi Shekhar,Xinyue Ye,Grant McKenzie,Guido Cervone,Michael E. Hodgson
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advent of generative AI exemplified by large language models (LLMs) opens new ways to represent and compute geographic information and transcend the process of geographic knowledge production, driving geographic information systems (GIS) towards autonomous GIS. Leveraging LLMs as the decision core, autonomous GIS can independently generate and execute geoprocessing workflows to perform spatial analysis. In this vision paper, we elaborate on the concept of autonomous GIS and present a framework that defines its five autonomous goals, five levels of autonomy, five core functions, and three operational scales. We demonstrate how autonomous GIS could perform geospatial data retrieval, spatial analysis, and map making with four proof-of-concept GIS agents. We conclude by identifying critical challenges and future research directions, including fine-tuning and self-growing decision cores, autonomous modeling, and examining the ethical and practical implications of autonomous GIS. By establishing the groundwork for a paradigm shift in GIScience, this paper envisions a future where GIS moves beyond traditional workflows to autonomously reason, derive, innovate, and advance solutions to pressing global challenges.

[AI-41] Intrinsically-Motivated Humans and Agents in Open-World Exploration

链接: https://arxiv.org/abs/2503.23631
作者: Aly Lidayan,Yuqing Du,Eliza Kosoy,Maria Rufova,Pieter Abbeel,Alison Gopnik
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:What drives exploration? Understanding intrinsic motivation is a long-standing challenge in both cognitive science and artificial intelligence; numerous objectives have been proposed and used to train agents, yet there remains a gap between human and agent exploration. We directly compare adults, children, and AI agents in a complex open-ended environment, Crafter, and study how common intrinsic objectives: Entropy, Information Gain, and Empowerment, relate to their behavior. We find that only Entropy and Empowerment are consistently positively correlated with human exploration progress, indicating that these objectives may better inform intrinsic reward design for agents. Furthermore, across agents and humans we observe that Entropy initially increases rapidly, then plateaus, while Empowerment increases continuously, suggesting that state diversity may provide more signal in early exploration, while advanced exploration should prioritize control. Finally, we find preliminary evidence that private speech utterances, and particularly goal verbalizations, may aid exploration in children.

[AI-42] Finding Interest Needle in Popularity Haystack: Improving Retrieval by Modeling Item Exposure

链接: https://arxiv.org/abs/2503.23630
作者: Amit Jaspal,Rahul Agarwal
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 2 pages

点击查看摘要

Abstract:Recommender systems operate in closed feedback loops, where user interactions reinforce popularity bias, leading to over-recommendation of already popular items while under-exposing niche or novel content. Existing bias mitigation methods, such as Inverse Propensity Scoring (IPS) and Off- Policy Correction (OPC), primarily operate at the ranking stage or during training, lacking explicit real-time control over exposure dynamics. In this work, we introduce an exposure- aware retrieval scoring approach, which explicitly models item exposure probability and adjusts retrieval-stage ranking at inference time. Unlike prior work, this method decouples exposure effects from engagement likelihood, enabling controlled trade-offs between fairness and engagement in large-scale recommendation platforms. We validate our approach through online A/B experiments in a real-world video recommendation system, demonstrating a 25% increase in uniquely retrieved items and a 40% reduction in the dominance of over-popular content, all while maintaining overall user engagement levels. Our results establish a scalable, deployable solution for mitigating popularity bias at the retrieval stage, offering a new paradigm for bias-aware personalization.

[AI-43] Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking

链接: https://arxiv.org/abs/2503.23622
作者: Muhammad Sajjad Akbar
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing use of generative AI tools like ChatGPT has raised urgent concerns about their impact on student learning, particularly the potential erosion of critical thinking and creativity. As students increasingly turn to these tools to complete assessments, foundational cognitive skills are at risk of being bypassed, challenging the integrity of higher education and the authenticity of student work. Existing AI-generated text detection tools are inadequate; they produce unreliable outputs and are prone to both false positives and false negatives, especially when students apply paraphrasing, translation, or rewording. These systems rely on shallow statistical patterns rather than true contextual or semantic understanding, making them unsuitable as definitive indicators of AI misuse. In response, this research proposes a proactive, AI-resilient solution based on assessment design rather than detection. It introduces a web-based Python tool that integrates Bloom’s Taxonomy with advanced natural language processing techniques including GPT-3.5 Turbo, BERT-based semantic similarity, and TF-IDF metrics to evaluate the AI-solvability of assessment tasks. By analyzing surface-level and semantic features, the tool helps educators determine whether a task targets lower-order thinking such as recall and summarization or higher-order skills such as analysis, evaluation, and creation, which are more resistant to AI automation. This framework empowers educators to design cognitively demanding, AI-resistant assessments that promote originality, critical thinking, and fairness. It offers a sustainable, pedagogically sound strategy to foster authentic learning and uphold academic standards in the age of AI.

[AI-44] Graph-Eq: Discovering Mathematical Equations using Graph Generative Models

链接: https://arxiv.org/abs/2503.23617
作者: Nisal Ranasinghe,Damith Senanayake,Saman Halgamuge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:The ability to discover meaningful, accurate, and concise mathematical equations that describe datasets is valuable across various domains. Equations offer explicit relationships between variables, enabling deeper insights into underlying data patterns. Most existing equation discovery methods rely on genetic programming, which iteratively searches the equation space but is often slow and prone to overfitting. By representing equations as directed acyclic graphs, we leverage the use of graph neural networks to learn the underlying semantics of equations, and generate new, previously unseen equations. Although graph generative models have been shown to be successful in discovering new types of graphs in many fields, there application in discovering equations remains largely unexplored. In this work, we propose Graph-EQ, a deep graph generative model designed for efficient equation discovery. Graph-EQ uses a conditional variational autoencoder (CVAE) to learn a rich latent representation of the equation space by training it on a large corpus of equations in an unsupervised manner. Instead of directly searching the equation space, we employ Bayesian optimization to efficiently explore this learned latent space. We show that the encoder-decoder architecture of Graph-Eq is able to accurately reconstruct input equations. Moreover, we show that the learned latent representation can be sampled and decoded into valid equations, including new and previously unseen equations in the training data. Finally, we assess Graph-Eq’s ability to discover equations that best fit a dataset by exploring the latent space using Bayesian optimization. Latent space exploration is done on 20 dataset with known ground-truth equations, and Graph-Eq is shown to successfully discover the grountruth equation in the majority of datasets.

[AI-45] An Organizationally-Oriented Approach to Enhancing Explainability and Control in Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.23615
作者: Julien Soulé,Jean-Paul Jamont,Michel Occello,Louis-Marie Traonouez,Paul Théron
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning can lead to the development of collaborative agent behaviors that show similarities with organizational concepts. Pushing forward this perspective, we introduce a novel framework that explicitly incorporates organizational roles and goals from the \mathcalMOISE^+ model into the MARL process, guiding agents to satisfy corresponding organizational constraints. By structuring training with roles and goals, we aim to enhance both the explainability and control of agent behaviors at the organizational level, whereas much of the literature primarily focuses on individual agents. Additionally, our framework includes a post-training analysis method to infer implicit roles and goals, offering insights into emergent agent behaviors. This framework has been applied across various MARL environments and algorithms, demonstrating coherence between predefined organizational specifications and those inferred from trained agents.

[AI-46] Partial Transportability for Domain Generalization

链接: https://arxiv.org/abs/2503.23605
作者: Kasra Jalaldoust,Alexis Bellot,Elias Bareinboim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: this http URL

点击查看摘要

Abstract:A fundamental task in AI is providing performance guarantees for predictions made in unseen domains. In practice, there can be substantial uncertainty about the distribution of new data, and corresponding variability in the performance of existing predictors. Building on the theory of partial identification and transportability, this paper introduces new results for bounding the value of a functional of the target distribution, such as the generalization error of a classifier, given data from source domains and assumptions about the data generating mechanisms, encoded in causal diagrams. Our contribution is to provide the first general estimation technique for transportability problems, adapting existing parameterization schemes such Neural Causal Models to encode the structural constraints necessary for cross-population inference. We demonstrate the expressiveness and consistency of this procedure and further propose a gradient-based optimization scheme for making scalable inferences in practice. Our results are corroborated with experiments.

[AI-47] A Survey on Unlearnable Data

链接: https://arxiv.org/abs/2503.23536
作者: Jiahao Li,Yiqiang Chen,Yunbing Xing,Yang Gu,Xiangyuan Lan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 31 pages, 3 figures

点击查看摘要

Abstract:Unlearnable data (ULD) has emerged as an innovative defense technique to prevent machine learning models from learning meaningful patterns from specific data, thus protecting data privacy and security. By introducing perturbations to the training data, ULD degrades model performance, making it difficult for unauthorized models to extract useful representations. Despite the growing significance of ULD, existing surveys predominantly focus on related fields, such as adversarial attacks and machine unlearning, with little attention given to ULD as an independent area of study. This survey fills that gap by offering a comprehensive review of ULD, examining unlearnable data generation methods, public benchmarks, evaluation metrics, theoretical foundations and practical applications. We compare and contrast different ULD approaches, analyzing their strengths, limitations, and trade-offs related to unlearnability, imperceptibility, efficiency and robustness. Moreover, we discuss key challenges, such as balancing perturbation imperceptibility with model degradation and the computational complexity of ULD generation. Finally, we highlight promising future research directions to advance the effectiveness and applicability of ULD, underscoring its potential to become a crucial tool in the evolving landscape of data protection in machine learning.

[AI-48] Buffer is All You Need: Defending Federated Learning against Backdoor Attacks under Non-iids via Buffering

链接: https://arxiv.org/abs/2503.23511
作者: Xingyu Lyu,Ning Wang,Yang Xiao,Shixiong Li,Tao Li,Danjue Chen,Yimin Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a popular paradigm enabling clients to jointly train a global model without sharing raw data. However, FL is known to be vulnerable towards backdoor attacks due to its distributed nature. As participants, attackers can upload model updates that effectively compromise FL. What’s worse, existing defenses are mostly designed under independent-and-identically-distributed (iid) settings, hence neglecting the fundamental non-iid characteristic of FL. Here we propose FLBuff for tackling backdoor attacks even under non-iids. The main challenge for such defenses is that non-iids bring benign and malicious updates closer, hence harder to separate. FLBuff is inspired by our insight that non-iids can be modeled as omni-directional expansion in representation space while backdoor attacks as uni-directional. This leads to the key design of FLBuff, i.e., a supervised-contrastive-learning model extracting penultimate-layer representations to create a large in-between buffer layer. Comprehensive evaluations demonstrate that FLBuff consistently outperforms state-of-the-art defenses.

[AI-49] A Systematic Decade Review of Trip Route Planning with Travel Time Estimation based on User Preferences and Behavior

链接: https://arxiv.org/abs/2503.23486
作者: Nikil Jayasuriya,Deshan Sumanathilaka
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:This paper systematically explores the advancements in adaptive trip route planning and travel time estimation (TTE) through Artificial Intelligence (AI). With the increasing complexity of urban transportation systems, traditional navigation methods often struggle to accommodate dynamic user preferences, real-time traffic conditions, and scalability requirements. This study explores the contributions of established AI techniques, including Machine Learning (ML), Reinforcement Learning (RL), and Graph Neural Networks (GNNs), alongside emerging methodologies like Meta-Learning, Explainable AI (XAI), Generative AI, and Federated Learning. In addition to highlighting these innovations, the paper identifies critical challenges such as ethical concerns, computational scalability, and effective data integration, which must be addressed to advance the field. The paper concludes with recommendations for leveraging AI to build efficient, transparent, and sustainable navigation systems.

[AI-50] Handling Delay in Real-Time Reinforcement Learning ICLR2025 ALT

链接: https://arxiv.org/abs/2503.23478
作者: Ivan Anokhin,Rishav Rishav,Matthew Riemer,Stephen Chung,Irina Rish,Samira Ebrahimi Kahou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted at ICLR 2025. Code available at this https URL

点击查看摘要

Abstract:Real-time reinforcement learning (RL) introduces several challenges. First, policies are constrained to a fixed number of actions per second due to hardware limitations. Second, the environment may change while the network is still computing an action, leading to observational delay. The first issue can partly be addressed with pipelining, leading to higher throughput and potentially better policies. However, the second issue remains: if each neuron operates in parallel with an execution time of \tau , an N -layer feed-forward network experiences observation delay of \tau N . Reducing the number of layers can decrease this delay, but at the cost of the network’s expressivity. In this work, we explore the trade-off between minimizing delay and network’s expressivity. We present a theoretically motivated solution that leverages temporal skip connections combined with history-augmented observations. We evaluate several architectures and show that those incorporating temporal skip connections achieve strong performance across various neuron execution times, reinforcement learning algorithms, and environments, including four Mujoco tasks and all MinAtar games. Moreover, we demonstrate parallel neuron computation can accelerate inference by 6-350% on standard hardware. Our investigation into temporal skip connections and parallel computations paves the way for more efficient RL agents in real-time setting.

[AI-51] From Content Creation to Citation Inflation: A GenAI Case Study

链接: https://arxiv.org/abs/2503.23414
作者: Haitham S. Al-Sinani,Chris J. Mitchell
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 20 pages

点击查看摘要

Abstract:This paper investigates the presence and impact of questionable, AI-generated academic papers on widely used preprint repositories, with a focus on their role in citation manipulation. Motivated by suspicious patterns observed in publications related to our ongoing research on GenAI-enhanced cybersecurity, we identify clusters of questionable papers and profiles. These papers frequently exhibit minimal technical content, repetitive structure, unverifiable authorship, and mutually reinforcing citation patterns among a recurring set of authors. To assess the feasibility and implications of such practices, we conduct a controlled experiment: generating a fake paper using GenAI, embedding citations to suspected questionable publications, and uploading it to one such repository (ResearchGate). Our findings demonstrate that such papers can bypass platform checks, remain publicly accessible, and contribute to inflating citation metrics like the H-index and i10-index. We present a detailed analysis of the mechanisms involved, highlight systemic weaknesses in content moderation, and offer recommendations for improving platform accountability and preserving academic integrity in the age of GenAI.

[AI-52] Scaling Auditory Cognition via Test-Time Compute in Audio Language Models

链接: https://arxiv.org/abs/2503.23395
作者: Ting Dang,Yan Gao,Hong Jia
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textitself-collected database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.

[AI-53] Pareto Continual Learning: Preference-Conditioned Learning and Adaption for Dynamic Stability-Plasticity Trade-off

链接: https://arxiv.org/abs/2503.23390
作者: Song Lai,Zhe Zhao,Fei Zhu,Xi Lin,Qingfu Zhang,Gaofeng Meng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning aims to learn multiple tasks sequentially. A key challenge in continual learning is balancing between two objectives: retaining knowledge from old tasks (stability) and adapting to new tasks (plasticity). Experience replay methods, which store and replay past data alongside new data, have become a widely adopted approach to mitigate catastrophic forgetting. However, these methods neglect the dynamic nature of the stability-plasticity trade-off and aim to find a fixed and unchanging balance, resulting in suboptimal adaptation during training and inference. In this paper, we propose Pareto Continual Learning (ParetoCL), a novel framework that reformulates the stability-plasticity trade-off in continual learning as a multi-objective optimization (MOO) problem. ParetoCL introduces a preference-conditioned model to efficiently learn a set of Pareto optimal solutions representing different trade-offs and enables dynamic adaptation during inference. From a generalization perspective, ParetoCL can be seen as an objective augmentation approach that learns from different objective combinations of stability and plasticity. Extensive experiments across multiple datasets and settings demonstrate that ParetoCL outperforms state-of-the-art methods and adapts to diverse continual learning scenarios.

[AI-54] A Survey of WebAgents : Towards Next-Generation AI Agents for Web Automation with Large Foundation Models

链接: https://arxiv.org/abs/2503.23350
作者: Liangbo Ning,Ziran Liang,Zhuohang Jiang,Haohao Qu,Yujuan Ding,Wenqi Fan,Xiao-yong Wei,Shanru Lin,Hui Liu,Philip S. Yu,Qing Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the advancement of web techniques, they have significantly revolutionized various aspects of people’s lives. Despite the importance of the web, many tasks performed on it are repetitive and time-consuming, negatively impacting overall quality of life. To efficiently handle these tedious daily tasks, one of the most promising approaches is to advance autonomous agents based on Artificial Intelligence (AI) techniques, referred to as AI Agents, as they can operate continuously without fatigue or performance degradation. In the context of the web, leveraging AI Agents – termed WebAgents – to automatically assist people in handling tedious daily tasks can dramatically enhance productivity and efficiency. Recently, Large Foundation Models (LFMs) containing billions of parameters have exhibited human-like language understanding and reasoning capabilities, showing proficiency in performing various complex tasks. This naturally raises the question: `Can LFMs be utilized to develop powerful AI Agents that automatically handle web tasks, providing significant convenience to users?’ To fully explore the potential of LFMs, extensive research has emerged on WebAgents designed to complete daily web tasks according to user instructions, significantly enhancing the convenience of daily human life. In this survey, we comprehensively review existing research studies on WebAgents across three key aspects: architectures, training, and trustworthiness. Additionally, several promising directions for future research are explored to provide deeper insights.

[AI-55] A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection

链接: https://arxiv.org/abs/2503.23329
作者: Hui Li,Ante Wang,kunquan li,Zhihao Wang,Liang Zhang,Delai Qiu,Qingsong Liu,Jinsong Su
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Misinformation spans various domains, but detection methods trained on specific domains often perform poorly when applied to others. With the rapid development of Large Language Models (LLMs), researchers have begun to utilize LLMs for cross-domain misinformation detection. However, existing LLM-based methods often fail to adequately analyze news in the target domain, limiting their detection capabilities. More importantly, these methods typically rely on manually designed decision rules, which are limited by domain knowledge and expert experience, thus limiting the generalizability of decision rules to different domains. To address these issues, we propose a MultiAgent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization (MARO). Under this framework, we first employs multiple expert agents to analyze target-domain news. Subsequently, we introduce a question-reflection mechanism that guides expert agents to facilitate higherquality analysis. Furthermore, we propose a decision rule optimization approach based on carefully-designed cross-domain validation tasks to iteratively enhance the effectiveness of decision rules in different domains. Experimental results and in-depth analysis on commonlyused datasets demonstrate that MARO achieves significant improvements over existing methods.

[AI-56] Exploring Explainable Multi-player MCTS-minimax Hybrids in Board Game Using Process Mining AAAI2025

链接: https://arxiv.org/abs/2503.23326
作者: Yiyu Qian,Tim Miller,Zheng Qian,Liyuan Zhao
类目: Artificial Intelligence (cs.AI)
*备注: 36 pages, AAAI 2025 PRL

点击查看摘要

Abstract:Monte-Carlo Tree Search (MCTS) is a family of sampling-based search algorithms widely used for online planning in sequential decision-making domains and at the heart of many recent advances in artificial intelligence. Understanding the behavior of MCTS agents is difficult for developers and users due to the frequently large and complex search trees that result from the simulation of many possible futures, their evaluations, and their relationships. This paper presents our ongoing investigation into potential explanations for the decision-making and behavior of MCTS. A weakness of MCTS is that it constructs a highly selective tree and, as a result, can miss crucial moves and fall into tactical traps. Full-width minimax search constitutes the solution. We integrate shallow minimax search into the rollout phase of multi-player MCTS and use process mining technique to explain agents’ strategies in 3v3 checkers.

[AI-57] AI Agents in Engineering Design: A Multi-Agent Framework for Aesthetic and Aerodynamic Car Design

链接: https://arxiv.org/abs/2503.23315
作者: Mohamed Elrefaie,Janet Qian,Raina Wu,Qian Chen,Angela Dai,Faez Ahmed
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce the concept of “Design Agents” for engineering applications, particularly focusing on the automotive design process, while emphasizing that our approach can be readily extended to other engineering and design domains. Our framework integrates AI-driven design agents into the traditional engineering workflow, demonstrating how these specialized computational agents interact seamlessly with engineers and designers to augment creativity, enhance efficiency, and significantly accelerate the overall design cycle. By automating and streamlining tasks traditionally performed manually, such as conceptual sketching, styling enhancements, 3D shape retrieval and generative modeling, computational fluid dynamics (CFD) meshing, and aerodynamic simulations, our approach reduces certain aspects of the conventional workflow from weeks and days down to minutes. These agents leverage state-of-the-art vision-language models (VLMs), large language models (LLMs), and geometric deep learning techniques, providing rapid iteration and comprehensive design exploration capabilities. We ground our methodology in industry-standard benchmarks, encompassing a wide variety of conventional automotive designs, and utilize high-fidelity aerodynamic simulations to ensure practical and applicable outcomes. Furthermore, we present design agents that can swiftly and accurately predict simulation outcomes, empowering engineers and designers to engage in more informed design optimization and exploration. This research underscores the transformative potential of integrating advanced generative AI techniques into complex engineering tasks, paving the way for broader adoption and innovation across multiple engineering disciplines.

[AI-58] SalesRLAgent : A Reinforcement Learning Approach for Real-Time Sales Conversion Prediction and Optimization

链接: https://arxiv.org/abs/2503.23303
作者: Nandakishor M
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current approaches to sales conversation analysis and conversion prediction typically rely on Large Language Models (LLMs) combined with basic retrieval augmented generation (RAG). These systems, while capable of answering questions, fail to accurately predict conversion probability or provide strategic guidance in real time. In this paper, we present SalesRLAgent, a novel framework leveraging specialized reinforcement learning to predict conversion probability throughout sales conversations. Unlike systems from this http URL, Mendable, Inkeep, and others that primarily use off-the-shelf LLMs for content generation, our approach treats conversion prediction as a sequential decision problem, training on synthetic data generated using GPT-4O to develop a specialized probability estimation model. Our system incorporates Azure OpenAI embeddings (3072 dimensions), turn-by-turn state tracking, and meta-learning capabilities to understand its own knowledge boundaries. Evaluations demonstrate that SalesRLAgent achieves 96.7% accuracy in conversion prediction, outperforming LLM-only approaches by 34.7% while offering significantly faster inference (85ms vs 3450ms for GPT-4). Furthermore, integration with existing sales platforms shows a 43.2% increase in conversion rates when representatives utilize our system’s real-time guidance. SalesRLAgent represents a fundamental shift from content generation to strategic sales intelligence, providing moment-by-moment conversion probability estimation with actionable insights for sales professionals.

[AI-59] GRASP: Municipal Budget AI Chatbots for Enhancing Civic Engagement

链接: https://arxiv.org/abs/2503.23299
作者: Jerry Xu,Justin Wang,Joley Leung,Jasmine Gu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There are a growing number of AI applications, but none tailored specifically to help residents answer their questions about municipal budget, a topic most are interested in but few have a solid comprehension of. In this research paper, we propose GRASP, a custom AI chatbot framework which stands for Generation with Retrieval and Action System for Prompts. GRASP provides more truthful and grounded responses to user budget queries than traditional information retrieval systems like general Large Language Models (LLMs) or web searches. These improvements come from the novel combination of a Retrieval-Augmented Generation (RAG) framework (“Generation with Retrieval”) and an agentic workflow (“Action System”), as well as prompt engineering techniques, the incorporation of municipal budget domain knowledge, and collaboration with local town officials to ensure response truthfulness. During testing, we found that our GRASP chatbot provided precise and accurate responses for local municipal budget queries 78% of the time, while GPT-4o and Gemini were only accurate 60% and 35% of the time, respectively. GRASP chatbots greatly reduce the time and effort needed for the general public to get an intuitive and correct understanding of their town’s budget, thus fostering greater communal discourse, improving government transparency, and allowing citizens to make more informed decisions.

[AI-60] wo Heads Are Better than One: Model-Weight and Latent-Space Analysis for Federated Learning on Non-iid Data against Poisoning Attacks

链接: https://arxiv.org/abs/2503.23288
作者: Xingyu Lyu,Ning Wang,Yang Xiao,Shixiong Li,Tao Li,Danjue Chen,Yimin Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning is a popular paradigm that enables remote clients to jointly train a global model without sharing their raw data. However, FL has been shown to be vulnerable towards model poisoning attacks due to its distributed nature. Particularly, attackers acting as participants can upload arbitrary model updates that effectively compromise the global model of FL. While extensive research has been focusing on fighting against these attacks, we find that most of them assume data at remote clients are under iid while in practice they are inevitably non-iid. Our benchmark evaluations reveal that existing defenses generally fail to live up to their reputation when applied to various non-iid scenarios. In this paper, we propose a novel approach, GeminiGuard, that aims to address such a significant gap. We design GeminiGuard to be lightweight, versatile, and unsupervised so that it aligns well with the practical requirements of deploying such defenses. The key challenge from non-iids is that they make benign model updates look more similar to malicious ones. GeminiGuard is mainly built on two fundamental observations: (1) existing defenses based on either model-weight analysis or latent-space analysis face limitations in covering different MPAs and non-iid scenarios, and (2) model-weight and latent-space analysis are sufficiently different yet potentially complementary methods as MPA defenses. We hence incorporate a novel model-weight analysis component as well as a custom latent-space analysis component in GeminiGuard, aiming to further enhance its defense performance. We conduct extensive experiments to evaluate our defense across various settings, demonstrating its effectiveness in countering multiple types of untargeted and targeted MPAs, including adaptive ones. Our comprehensive evaluations show that GeminiGuard consistently outperforms SOTA defenses under various settings.

[AI-61] Model Context Protocol (MCP): Landscape Security Threats and Future Research Directions

链接: https://arxiv.org/abs/2503.23278
作者: Xinyi Hou,Yanjie Zhao,Shenao Wang,Haoyu Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Model Context Protocol (MCP) is a standardized interface designed to enable seamless interaction between AI models and external tools and resources, breaking down data silos and facilitating interoperability across diverse systems. This paper provides a comprehensive overview of MCP, focusing on its core components, workflow, and the lifecycle of MCP servers, which consists of three key phases: creation, operation, and update. We analyze the security and privacy risks associated with each phase and propose strategies to mitigate potential threats. The paper also examines the current MCP landscape, including its adoption by industry leaders and various use cases, as well as the tools and platforms supporting its integration. We explore future directions for MCP, highlighting the challenges and opportunities that will influence its adoption and evolution within the broader AI ecosystem. Finally, we offer recommendations for MCP stakeholders to ensure its secure and sustainable development as the AI landscape continues to evolve.

[AI-62] Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models ICRA2025

链接: https://arxiv.org/abs/2503.23271
作者: Haonan Chen,Jiaming Xu,Lily Sheng,Tianchen Ji,Shuijing Liu,Yunzhu Li,Katherine Driggs-Campbell
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Project Page: this https URL . 12 pages, 12 figures, Accepted at ICRA 2025

点击查看摘要

Abstract:When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

[AI-63] Localized Graph-Based Neural Dynamics Models for Terrain Manipulation

链接: https://arxiv.org/abs/2503.23270
作者: Chaoqi Liu,Yunzhu Li,Kris Hauser
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predictive models can be particularly helpful for robots to effectively manipulate terrains in construction sites and extraterrestrial surfaces. However, terrain state representations become extremely high-dimensional especially to capture fine-resolution details and when depth is unknown or unbounded. This paper introduces a learning-based approach for terrain dynamics modeling and manipulation, leveraging the Graph-based Neural Dynamics (GBND) framework to represent terrain deformation as motion of a graph of particles. Based on the principle that the moving portion of a terrain is usually localized, our approach builds a large terrain graph (potentially millions of particles) but only identifies a very small active subgraph (hundreds of particles) for predicting the outcomes of robot-terrain interaction. To minimize the size of the active subgraph we introduce a learning-based approach that identifies a small region of interest (RoI) based on the robot’s control inputs and the current scene. We also introduce a novel domain boundary feature encoding that allows GBNDs to perform accurate dynamics prediction in the RoI interior while avoiding particle penetration through RoI boundaries. Our proposed method is both orders of magnitude faster than naive GBND and it achieves better overall prediction accuracy. We further evaluated our framework on excavation and shaping tasks on terrain with different granularity.

[AI-64] Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions

链接: https://arxiv.org/abs/2503.23250
作者: Shih-Han Chan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Security threats like prompt injection attacks pose significant risks to applications that integrate Large Language Models (LLMs), potentially leading to unauthorized actions such as API misuse. Unlike previous approaches that aim to detect these attacks on a best-effort basis, this paper introduces a novel method that appends an Encrypted Prompt to each user prompt, embedding current permissions. These permissions are verified before executing any actions (such as API calls) generated by the LLM. If the permissions are insufficient, the LLM’s actions will not be executed, ensuring safety. This approach guarantees that only actions within the scope of the current permissions from the LLM can proceed. In scenarios where adversarial prompts are introduced to mislead the LLM, this method ensures that any unauthorized actions from LLM wouldn’t be executed by verifying permissions in Encrypted Prompt. Thus, threats like prompt injection attacks that trigger LLM to generate harmful actions can be effectively mitigated.

[AI-65] CCCI: Code Completion with Contextual Information for Complex Data Transfer Tasks Using Large Language Models

链接: https://arxiv.org/abs/2503.23231
作者: Hangzhan Jin,Mohammad Hamdaqa
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: The 29th International Conference on Evaluation and Assessment in Software Engineering

点击查看摘要

Abstract:Unlike code generation, which involves creating code from scratch, code completion focuses on integrating new lines or blocks of code into an existing codebase. This process requires a deep understanding of the surrounding context, such as variable scope, object models, API calls, and database relations, to produce accurate results. These complex contextual dependencies make code completion a particularly challenging problem. Current models and approaches often fail to effectively incorporate such context, leading to inaccurate completions with low acceptance rates (around 30%). For tasks like data transfer, which rely heavily on specific relationships and data structures, acceptance rates drop even further. This study introduces CCCI, a novel method for generating context-aware code completions specifically designed to address data transfer tasks. By integrating contextual information, such as database table relationships, object models, and library details into Large Language Models (LLMs), CCCI improves the accuracy of code completions. We evaluate CCCI using 289 Java snippets, extracted from over 819 operational scripts in an industrial setting. The results demonstrate that CCCI achieved a 49.1% Build Pass rate and a 41.0% CodeBLEU score, comparable to state-of-the-art methods that often struggle with complex task completion.

[AI-66] Incorporating GNSS Information with LIDAR-Inertial Odometry for Accurate Land-Vehicle Localization

链接: https://arxiv.org/abs/2503.23199
作者: Jintao Cheng,Bohuan Xue,Shiyang Chen,Qiuchi Xiang,Xiaoyu Tang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, visual odometry and LIDAR odometry are performing well in pose estimation in some typical environments, but they still cannot recover the localization state at high speed or reduce accumulated drifts. In order to solve these problems, we propose a novel LIDAR-based localization framework, which achieves high accuracy and provides robust localization in 3D pointcloud maps with information of multi-sensors. The system integrates global information with LIDAR-based odometry to optimize the localization state. To improve robustness and enable fast resumption of localization, this paper uses offline pointcloud maps for prior knowledge and presents a novel registration method to speed up the convergence rate. The algorithm is tested on various maps of different data sets and has higher robustness and accuracy than other localization algorithms.

[AI-67] Ethereum Price Prediction Employing Large Language Models for Short-term and Few-shot Forecasting

链接: https://arxiv.org/abs/2503.23190
作者: Eftychia Makri,Georgios Palaiokrassas,Sarah Bouraga,Antigoni Polychroniadou,Leandros Tassiulas
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Cryptocurrencies have transformed financial markets with their innovative blockchain technology and volatile price movements, presenting both challenges and opportunities for predictive analytics. Ethereum, being one of the leading cryptocurrencies, has experienced significant market fluctuations, making its price prediction an attractive yet complex problem. This paper presents a comprehensive study on the effectiveness of Large Language Models (LLMs) in predicting Ethereum prices for short-term and few-shot forecasting scenarios. The main challenge in training models for time series analysis is the lack of data. We address this by leveraging a novel approach that adapts existing pre-trained LLMs on natural language or images from billions of tokens to the unique characteristics of Ethereum price time series data. Through thorough experimentation and comparison with traditional and contemporary models, our results demonstrate that selectively freezing certain layers of pre-trained LLMs achieves state-of-the-art performance in this domain. This approach consistently surpasses benchmarks across multiple metrics, including Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), demonstrating its effectiveness and robustness. Our research not only contributes to the existing body of knowledge on LLMs but also provides practical insights in the cryptocurrency prediction domain. The adaptability of pre-trained LLMs to handle the nature of Ethereum prices suggests a promising direction for future research, potentially including the integration of sentiment analysis to further refine forecasting accuracy.

[AI-68] Large Language Models are Unreliable for Cyber Threat Intelligence

链接: https://arxiv.org/abs/2503.23175
作者: Emanuele Mezzi,Fabio Massacci,Katja Tuma
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several recent works have argued that Large Language Models (LLMs) can be used to tame the data deluge in the cybersecurity field, by improving the automation of Cyber Threat Intelligence (CTI) tasks. This work presents an evaluation methodology that other than allowing to test LLMs on CTI tasks when using zero-shot learning, few-shot learning and fine-tuning, also allows to quantify their consistency and their confidence level. We run experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports and present new evidence of potential security risks in relying on LLMs for CTI. We show how LLMs cannot guarantee sufficient performance on real-size reports while also being inconsistent and overconfident. Few-shot learning and fine-tuning only partially improve the results, thus posing doubts about the possibility of using LLMs for CTI scenarios, where labelled datasets are lacking and where confidence is a fundamental factor.

[AI-69] AstroAgents : A Multi-Agent AI for Hypothesis Generation from Mass Spectrometry Data

链接: https://arxiv.org/abs/2503.23170
作者: Daniel Saeedi,Denise Buckner,Jose C. Aponte,Amirali Aghazadeh
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With upcoming sample return missions across the solar system and the increasing availability of mass spectrometry data, there is an urgent need for methods that analyze such data within the context of existing astrobiology literature and generate plausible hypotheses regarding the emergence of life on Earth. Hypothesis generation from mass spectrometry data is challenging due to factors such as environmental contaminants, the complexity of spectral peaks, and difficulties in cross-matching these peaks with prior studies. To address these challenges, we introduce AstroAgents, a large language model-based, multi-agent AI system for hypothesis generation from mass spectrometry data. AstroAgents is structured around eight collaborative agents: a data analyst, a planner, three domain scientists, an accumulator, a literature reviewer, and a critic. The system processes mass spectrometry data alongside user-provided research papers. The data analyst interprets the data, and the planner delegates specific segments to the scientist agents for in-depth exploration. The accumulator then collects and deduplicates the generated hypotheses, and the literature reviewer identifies relevant literature using Semantic Scholar. The critic evaluates the hypotheses, offering rigorous suggestions for improvement. To assess AstroAgents, an astrobiology expert evaluated the novelty and plausibility of more than a hundred hypotheses generated from data obtained from eight meteorites and ten soil samples. Of these hypotheses, 36% were identified as plausible, and among those, 66% were novel. Project website: this https URL

[AI-70] Reasoning -SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning -Enhanced Text-to-SQL

链接: https://arxiv.org/abs/2503.23157
作者: Mohammadreza Pourreza,Shayan Talaei,Ruoxi Sun,Xingchen Wan,Hailong Li,Azalia Mirhoseini,Amin Saberi,Sercan "O. Arik
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Text-to-SQL is a challenging task involving multiple reasoning-intensive subtasks, including natural language understanding, database schema comprehension, and precise SQL query formulation. Existing approaches often rely on handcrafted reasoning paths with inductive biases that can limit their overall effectiveness. Motivated by the recent success of reasoning-enhanced models such as DeepSeek R1 and OpenAI o1, which effectively leverage reward-driven self-exploration to enhance reasoning capabilities and generalization, we propose a novel set of partial rewards tailored specifically for the Text-to-SQL task. Our reward set includes schema-linking, AI feedback, n-gram similarity, and syntax check, explicitly designed to address the reward sparsity issue prevalent in reinforcement learning (RL). Leveraging group relative policy optimization (GRPO), our approach explicitly encourages large language models (LLMs) to develop intrinsic reasoning skills necessary for accurate SQL query generation. With models of different sizes, we demonstrate that RL-only training with our proposed rewards consistently achieves higher accuracy and superior generalization compared to supervised fine-tuning (SFT). Remarkably, our RL-trained 14B-parameter model significantly outperforms larger proprietary models, e.g. o3-mini by 4% and Gemini-1.5-Pro-002 by 3% on the BIRD benchmark. These highlight the efficacy of our proposed RL-training framework with partial rewards for enhancing both accuracy and reasoning capabilities in Text-to-SQL tasks.

[AI-71] Conversational Agents for Older Adults Health: A Systematic Literature Review

链接: https://arxiv.org/abs/2503.23153
作者: Jiaxin An,Siqi Yi,Yao Lyu,Houjiang Liu,Yan Zhang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 31 pages, 4 figures

点击查看摘要

Abstract:There has been vast literature that studies Conversational Agents (CAs) in facilitating older adults’ health. The vast and diverse studies warrants a comprehensive review that concludes the main findings and proposes research directions for future studies, while few literature review did it from human-computer interaction (HCI) perspective. In this study, we present a survey of existing studies on CAs for older adults’ health. Through a systematic review of 72 papers, this work reviewed previously studied older adults’ characteristics and analyzed participants’ experiences and expectations of CAs for health. We found that (1) Past research has an increasing interest on chatbots and voice assistants and applied CA as multiple roles in older adults’ health. (2) Older adults mainly showed low acceptance CAs for health due to various reasons, such as unstable effects, harm to independence, and privacy concerns. (3) Older adults expect CAs to be able to support multiple functions, to communicate using natural language, to be personalized, and to allow users full control. We also discuss the implications based on the findings.

[AI-72] Agent -Based Modeling and Deep Neural Networks for Establishing Digital Twins of Secure Facilities under Sensing Restrictions WWW

链接: https://arxiv.org/abs/2503.23147
作者: Chathika Gunaratne,Mason Stott,Debraj De,Gautam Malviya Thakur,Chris Young
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: This paper has been already published in the 2024 Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC’24): this https URL The authors have obtained permission from I/ITSEC’24 organizers to release this paper on arXiv. Appropriate licensing is also applied

点击查看摘要

Abstract:Digital twin technologies help practitioners simulate, monitor, and predict undesirable outcomes in-silico, while avoiding the cost and risks of conducting live simulation exercises. Virtual reality (VR) based digital twin technologies are especially useful when monitoring human Patterns of Life (POL) in secure nuclear facilities, where live simulation exercises are too dangerous and costly to ever perform. However, the high-security status of such facilities may restrict modelers from deploying human activity sensors for data collection. This problem was encountered when deploying MetaPOL, a digital twin system to prevent insider threat or sabotage of secure facilities, at a secure nuclear reactor facility at Oak Ridge National Laboratory (ORNL). This challenge was addressed using an agent-based model (ABM), driven by anecdotal evidence of facility personnel POL, to generate synthetic movement trajectories. These synthetic trajectories were then used to train deep neural network surrogates for next location and stay duration prediction to drive NPCs in the VR environment. In this study, we evaluate the efficacy of this technique for establishing NPC movement within MetaPOL and the ability to distinguish NPC movement during normal operations from that during a simulated emergency response. Our results demonstrate the success of using a multi-layer perceptron for next location prediction and mixture density network for stay duration prediction to predict the ABM generated trajectories. We also find that NPC movement in the VR environment driven by the deep neural networks under normal operations remain significantly different to that seen when simulating responses to a simulated emergency scenario.

[AI-73] CrossMuSim: A Cross-Modal Framework for Music Similarity Retrieval with LLM -Powered Text Description Sourcing and Mining ICME2025

链接: https://arxiv.org/abs/2503.23128
作者: Tristan Tsoi,Jiajun Deng,Yaolong Ju,Benno Weck,Holger Kirchhoff,Simon Lui
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICME2025

点击查看摘要

Abstract:Music similarity retrieval is fundamental for managing and exploring relevant content from large collections in streaming platforms. This paper presents a novel cross-modal contrastive learning framework that leverages the open-ended nature of text descriptions to guide music similarity modeling, addressing the limitations of traditional uni-modal approaches in capturing complex musical relationships. To overcome the scarcity of high-quality text-music paired data, this paper introduces a dual-source data acquisition approach combining online scraping and LLM-based prompting, where carefully designed prompts leverage LLMs’ comprehensive music knowledge to generate contextually rich descriptions. Exten1sive experiments demonstrate that the proposed framework achieves significant performance improvements over existing benchmarks through objective metrics, subjective evaluations, and real-world A/B testing on the Huawei Music streaming platform.

[AI-74] How to safely discard features based on aggregate SHAP values

链接: https://arxiv.org/abs/2503.23111
作者: Robi Bhattacharjee,Karolin Frohnapfel,Ulrike von Luxburg
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:SHAP is one of the most popular local feature-attribution methods. Given a function f and an input x, it quantifies each feature’s contribution to f(x). Recently, SHAP has been increasingly used for global insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard unimportant features. In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the i-th SHAP value is 0 on the entire data support, there exist functions that clearly depend on Feature i. The issue is that computing SHAP values involves evaluating f on points outside of the data support, where f can be strategically designed to mask its dependence on Feature i. To address this, we propose to aggregate SHAP values over the extended support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the Shapley Lie algebra, which offers algebraic insights that may enable a deeper investigation of SHAP and we show that randomly permuting each column of the data matrix enables safely discarding features based on aggregate SHAP and KernelSHAP values.

[AI-75] Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

链接: https://arxiv.org/abs/2503.23104
作者: Paul Caillon(1),Erwan Fagnou(1),Alexandre Allauzen(1 and 2) ((1) Miles Team, LAMSADE, Université Paris Dauphine - PSL, Paris, France, (2) ESPCI PSL, Paris, France)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages (including additional contents), 3 figures, 5 tables, code available at this https URL

点击查看摘要

Abstract:Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network’s ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.

[AI-76] RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations

链接: https://arxiv.org/abs/2503.23101
作者: Enrico Marchesini,Benjamin Donnot,Constance Crozier,Ian Dytham,Christian Merz,Lars Schewe,Nico Westerbeck,Cathy Wu,Antoine Marot,Priya L. Donti
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) can transform power grid operations by providing adaptive and scalable controllers essential for grid decarbonization. However, existing methods struggle with the complex dynamics, aleatoric uncertainty, long-horizon goals, and hard physical constraints that occur in real-world systems. This paper presents RL2Grid, a benchmark designed in collaboration with power system operators to accelerate progress in grid control and foster RL maturity. Built on a power simulation framework developed by RTE France, RL2Grid standardizes tasks, state and action spaces, and reward structures within a unified interface for a systematic evaluation and comparison of RL approaches. Moreover, we integrate real control heuristics and safety constraints informed by the operators’ expertise to ensure RL2Grid aligns with grid operation requirements. We benchmark popular RL baselines on the grid control tasks represented within RL2Grid, establishing reference performance metrics. Our results and discussion highlight the challenges that power grids pose for RL methods, emphasizing the need for novel algorithms capable of handling real-world physical systems.

[AI-77] Reproducibility Companion Paper: Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems

链接: https://arxiv.org/abs/2503.23032
作者: Yuyuan Li,Junjie Fang,Chaochao Chen,Xiaolin Zheng,Yizhao Zhang,Zhongxuan Han
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we reproduce the experimental results presented in our previous work titled “Making Users Indistinguishable: Attribute-wise Unlearning in Recommender Systems,” which was published in the proceedings of the 31st ACM International Conference on Multimedia. This paper aims to validate the effectiveness of our proposed method and help others reproduce our experimental results. We provide detailed descriptions of our preprocessed datasets, source code structure, configuration file settings, experimental environment, and reproduced experimental results.

[AI-78] owards Understanding the Optimization Mechanisms in Deep Learning

链接: https://arxiv.org/abs/2503.23016
作者: Binchuan Qi,Wei Gong,Li Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we adopt a probability distribution estimation perspective to explore the optimization mechanisms of supervised classification using deep neural networks. We demonstrate that, when employing the Fenchel-Young loss, despite the non-convex nature of the fitting error with respect to the model’s parameters, global optimal solutions can be approximated by simultaneously minimizing both the gradient norm and the structural error. The former can be controlled through gradient descent algorithms. For the latter, we prove that it can be managed by increasing the number of parameters and ensuring parameter independence, thereby providing theoretical insights into mechanisms such as over-parameterization and random initialization. Ultimately, the paper validates the key conclusions of the proposed method through empirical results, illustrating its practical effectiveness.

[AI-79] MSNGO: multi-species protein function annotation based on 3D protein structure and network propagation

链接: https://arxiv.org/abs/2503.23014
作者: Beibei Wang,Boyue Cui,Shiqu Chen,Xuan Wang,Yadong Wang,Junyi Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 2 figures

点击查看摘要

Abstract:Motivation: In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species protein function prediction methods are still in the stage of using PPI networks and sequence features. Providing effective cross-species label propagation for species with sparse protein annotations remains a challenging issue. To address this problem, we propose the MSNGO model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. Results: We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and PPI networks. Availability: this https URL.

[AI-80] Learning Structure-enhanced Temporal Point Processes with Gromov-Wasserstein Regularization

链接: https://arxiv.org/abs/2503.23002
作者: Qingmei Wang,Fanmeng Wang,Bing Su,Hongteng Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the Web Conference workshop 2025

点击查看摘要

Abstract:Real-world event sequences are often generated by different temporal point processes (TPPs) and thus have clustering structures. Nonetheless, in the modeling and prediction of event sequences, most existing TPPs ignore the inherent clustering structures of the event sequences, leading to the models with unsatisfactory interpretability. In this study, we learn structure-enhanced TPPs with the help of Gromov-Wasserstein (GW) regularization, which imposes clustering structures on the sequence-level embeddings of the TPPs in the maximum likelihood estimation this http URL the training phase, the proposed method leverages a nonparametric TPP kernel to regularize the similarity matrix derived based on the sequence embeddings. In large-scale applications, we sample the kernel matrix and implement the regularization as a Gromov-Wasserstein (GW) discrepancy term, which achieves a trade-off between regularity and computational this http URL TPPs learned through this method result in clustered sequence embeddings and demonstrate competitive predictive and clustering performance, significantly improving the model interpretability without compromising prediction accuracy.

[AI-81] AuditVotes: A Framework Towards More Deployable Certified Robustness for Graph Neural Networks

链接: https://arxiv.org/abs/2503.22998
作者: Yuni Lai,Yulin Zhu,Yixuan Sun,Yulun Wu,Bin Xiao,Gaolei Li,Jianhua Li,Kai Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 20 pages

点击查看摘要

Abstract:Despite advancements in Graph Neural Networks (GNNs), adaptive attacks continue to challenge their robustness. Certified robustness based on randomized smoothing has emerged as a promising solution, offering provable guarantees that a model’s predictions remain stable under adversarial perturbations within a specified range. However, existing methods face a critical trade-off between accuracy and robustness, as achieving stronger robustness requires introducing greater noise into the input graph. This excessive randomization degrades data quality and disrupts prediction consistency, limiting the practical deployment of certifiably robust GNNs in real-world scenarios where both accuracy and robustness are essential. To address this challenge, we propose \textbfAuditVotes, the first framework to achieve both high clean accuracy and certifiably robust accuracy for GNNs. It integrates randomized smoothing with two key components, \underlineaugmentation and con\underlineditional smoothing, aiming to improve data quality and prediction consistency. The augmentation, acting as a pre-processing step, de-noises the randomized graph, significantly improving data quality and clean accuracy. The conditional smoothing, serving as a post-processing step, employs a filtering function to selectively count votes, thereby filtering low-quality predictions and improving voting consistency. Extensive experimental results demonstrate that AuditVotes significantly enhances clean accuracy, certified robustness, and empirical robustness while maintaining high computational efficiency. Notably, compared to baseline randomized smoothing, AuditVotes improves clean accuracy by 437.1% and certified accuracy by 409.3% when the attacker can arbitrarily insert 20 edges on the Cora-ML datasets, representing a substantial step toward deploying certifiably robust GNNs in real-world applications.

[AI-82] DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation

链接: https://arxiv.org/abs/2503.22988
作者: Chengkun Wei,Weixian Li,Gong Chen,Wenzhi Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted at IEEE Transactions on Information Forensics Security

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) is a widely adopted technique for privacy-preserving deep learning. A critical challenge in DP-SGD is selecting the optimal clipping threshold C, which involves balancing the trade-off between clipping bias and noise magnitude, incurring substantial privacy and computing overhead during hyperparameter tuning. In this paper, we propose Dynamic Clipping DP-SGD (DC-SGD), a framework that leverages differentially private histograms to estimate gradient norm distributions and dynamically adjust the clipping threshold C. Our framework includes two novel mechanisms: DC-SGD-P and DC-SGD-E. DC-SGD-P adjusts the clipping threshold based on a percentile of gradient norms, while DC-SGD-E minimizes the expected squared error of gradients to optimize C. These dynamic adjustments significantly reduce the burden of hyperparameter tuning C. The extensive experiments on various deep learning tasks, including image classification and natural language processing, show that our proposed dynamic algorithms achieve up to 9 times acceleration on hyperparameter tuning than DP-SGD. And DC-SGD-E can achieve an accuracy improvement of 10.62% on CIFAR10 than DP-SGD under the same privacy budget of hyperparameter tuning. We conduct rigorous theoretical privacy and convergence analyses, showing that our methods seamlessly integrate with the Adam optimizer. Our results highlight the robust performance and efficiency of DC-SGD, offering a practical solution for differentially private deep learning with reduced computational overhead and enhanced privacy guarantees. Comments: Accepted at IEEE Transactions on Information Forensics Security Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2503.22988 [cs.LG] (or arXiv:2503.22988v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-83] PartialLoading: User Scheduling and Bandwidth Allocation for Parameter-sharing Edge Inference

链接: https://arxiv.org/abs/2503.22982
作者: Guanqiao Qu,Qian Chen,Xianhao Chen,Kaibin Huang,Yuguang Fang
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:By provisioning inference offloading services, edge inference drives the rapid growth of AI applications at the network edge. However, achieving high task throughput with stringent latency requirements remains a significant challenge. To address this issue, we develop a parameter-sharing AI model loading (PartialLoading) framework for multi-user edge inference, which exploits two key insights: 1) the majority of latency arises from loading AI models into server GPU memory, and 2) different AI models can share a significant number of parameters, for which redundant loading should be avoided. Towards this end, we formulate a joint multi-user scheduling and spectrum bandwidth allocation problem to maximize task throughput by exploiting shared parameter blocks across models. The intuition is to judiciously schedule user requests to reuse the shared parameter blocks between consecutively loaded models, thereby reducing model loading time substantially. To facilitate solution finding, we decouple the problem into two sub-problems, i.e., user scheduling and bandwidth allocation, showing that solving them sequentially is equivalent to solving the original problem. Due to the NP-hardness of the problem, we first study an important special case called the “bottom-layer-sharing” case, where AI models share some bottom layers within clusters, and design a dynamic programming-based algorithm to obtain the optimal solution in polynomial time. For the general case, where shared parameter blocks appear at arbitrary positions within AI models, we propose a greedy heuristic to obtain the sub-optimal solution efficiently. Simulation results demonstrate that the proposed framework significantly improves task throughput under deadline constraints compared with user scheduling without exploiting parameter sharing.

[AI-84] Enhancing Federated Learning Through Secure Cluster-Weighted Client Aggregation

链接: https://arxiv.org/abs/2503.22971
作者: Kanishka Ranaweera,Azadeh Ghari Neiat,Xiao Liu,Bipasha Kashyap,Pubudu N. Pathirana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm in machine learning, enabling collaborative model training across decentralized devices without the need for raw data sharing. In FL, a global model is trained iteratively on local datasets residing on individual devices, each contributing to the model’s improvement. However, the heterogeneous nature of these local datasets, stemming from diverse user behaviours, device capabilities, and data distributions, poses a significant challenge. The inherent heterogeneity in federated learning gives rise to various issues, including model performance discrepancies, convergence challenges, and potential privacy concerns. As the global model progresses through rounds of training, the disparities in local data quality and quantity can impede the overall effectiveness of federated learning systems. Moreover, maintaining fairness and privacy across diverse user groups becomes a paramount concern. To address this issue, this paper introduces a novel FL framework, ClusterGuardFL, that employs dissimilarity scores, k-means clustering, and reconciliation confidence scores to dynamically assign weights to client updates. The dissimilarity scores between global and local models guide the formation of clusters, with cluster size influencing the weight allocation. Within each cluster, a reconciliation confidence score is calculated for individual data points, and a softmax layer generates customized weights for clients. These weights are utilized in the aggregation process, enhancing the model’s robustness and privacy. Experimental results demonstrate the efficacy of the proposed approach in achieving improved model performance in diverse datasets.

[AI-85] Student-Powered Digital Scholarship CoLab Project in the HKUST Library: Develop a Chinese Named-Entity Recognition (NER) Tool within One Semester from the Ground Up

链接: https://arxiv.org/abs/2503.22967
作者: Sherry S.L. Yip,Berry L. Han,Holly H.Y. Chan
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: 47 pages. Presented and submitted to DADH2024 conference ( this https URL )

点击查看摘要

Abstract:Starting in February 2024, the HKUST Library further extended the scope of AI literacy to AI utilization, which focuses on fostering student involvement in utilizing state-of-the-art technologies in the projects that initiated by the Library, named “Digital Scholarship (DS) CoLab”. A key focus of the DS CoLab scheme has been on cultivating talents and enabling students to utilize advanced technologies in practical context. It aims to reinforce the library’s role as a catalyst and hub for fostering multidisciplinary collaboration and cultivate the “can do spirit” among university members. The Library offers 1-2 projects per year for students to engage with advanced technologies in practical contexts while supporting the Library in tackling challenges and streamlining operational tasks. The tool that introduced in this paper was mainly developed by two of the authors, Sherry Yip Sau Lai and Berry Han Liuruo, as part-time student helpers under one of our DS CoLab scheme in the 2024 Spring Semester (February to May 2024). This paper details the complete journey from ideation to implementation of developing a Chinese Named-Entity Recognition (NER) Tool from the group up within one semester, from the initial research and planning stages to execution and come up a viable product. The collaborative spirit fostered by this project, with students playing a central role, exemplifies the power and potential of innovative educational models that prioritize hands-on learning with student involvement.

[AI-86] Late Breaking Results: Breaking Symmetry- Unconventional Placement of Analog Circuits using Multi-Level Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.22958
作者: Supriyo Maji,Linran Zhao,Souradip Poddar,David Z. Pan
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 2 pages, 3 figures, Proceedings of the 62nd ACM/IEEE Design Automation Conference (DAC), 2025

点击查看摘要

Abstract:Layout-dependent effects (LDEs) significantly impact analog circuit performance. Traditionally, designers have relied on symmetric placement of circuit components to mitigate variations caused by LDEs. However, due to non-linear nature of these effects, conventional methods often fall short. We propose an objective-driven, multi-level, multi-agent Q-learning framework to explore unconventional design space of analog layout, opening new avenues for optimizing analog circuit performance. Our approach achieves better variation performance than the state-of-the-art layout techniques. Notably, this is the first application of multi-agent RL in analog layout automation. The proposed approach is compared with non-ML approach based on simulated annealing.

[AI-87] DATAWEAVER: Authoring Data-Driven Narratives through the Integrated Composition of Visualization and Text

链接: https://arxiv.org/abs/2503.22946
作者: Yu Fu,Dennis Bromley,Vidya Setlur
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Accepted to EuroVis 2025. Published in Computer Graphics Forum. DOI: https://doi.org/10.1111/cgf.70098

点击查看摘要

Abstract:Data-driven storytelling has gained prominence in journalism and other data reporting fields. However, the process of creating these stories remains challenging, often requiring the integration of effective visualizations with compelling narratives to form a cohesive, interactive presentation. To help streamline this process, we present an integrated authoring framework and system, DataWeaver, that supports both visualization-to-text and text-to-visualization composition. DataWeaver enables users to create data narratives anchored to data facts derived from “call-out” interactions, i.e., user-initiated highlights of visualization elements that prompt relevant narrative content. In addition to this “vis-to-text” composition, DataWeaver also supports a “text-initiated” approach, generating relevant interactive visualizations from existing narratives. Key findings from an evaluation with 13 participants highlighted the utility and usability of DataWeaver and the effectiveness of its integrated authoring framework. The evaluation also revealed opportunities to enhance the framework by refining filtering mechanisms and visualization recommendations and better support authoring creativity by introducing advanced customization options.

[AI-88] Adaptive Interactive Navigation of Quadruped Robots using Large Language Models

链接: https://arxiv.org/abs/2503.22942
作者: Kangjie Zhou,Yao Mu,Haoyang Song,Yi Zeng,Pengying Wu,Han Gao,Chang Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Robotic navigation in complex environments remains a critical research challenge. Traditional navigation methods focus on optimal trajectory generation within free space, struggling in environments lacking viable paths to the goal, such as disaster zones or cluttered warehouses. To address this gap, we propose an adaptive interactive navigation approach that proactively interacts with environments to create feasible paths to reach originally unavailable goals. Specifically, we present a primitive tree for task planning with large language models (LLMs), facilitating effective reasoning to determine interaction objects and sequences. To ensure robust subtask execution, we adopt reinforcement learning to pre-train a comprehensive skill library containing versatile locomotion and interaction behaviors for motion planning. Furthermore, we introduce an adaptive replanning method featuring two LLM-based modules: an advisor serving as a flexible replanning trigger and an arborist for autonomous plan adjustment. Integrated with the tree structure, the replanning mechanism allows for convenient node addition and pruning, enabling rapid plan modification in unknown environments. Comprehensive simulations and experiments have demonstrated our method’s effectiveness and adaptivity in diverse scenarios. The supplementary video is available at page: this https URL.

[AI-89] Identifying Multi-modal Knowledge Neurons in Pretrained Transformers via Two-stage Filtering

链接: https://arxiv.org/abs/2503.22941
作者: Yugen Sato,Tomohiro Takagi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have led to the development of multimodal LLMs (MLLMs) in the fields of natural language processing (NLP) and computer vision. Although these models allow for integrated visual and language understanding, they present challenges such as opaque internal processing and the generation of hallucinations and misinformation. Therefore, there is a need for a method to clarify the location of knowledge in MLLMs. In this study, we propose a method to identify neurons associated with specific knowledge using MiniGPT-4, a Transformer-based MLLM. Specifically, we extract knowledge neurons through two stages: activation differences filtering using inpainting and gradient-based filtering using GradCAM. Experiments on the image caption generation task using the MS COCO 2017 dataset, BLEU, ROUGE, and BERTScore quantitative evaluation, and qualitative evaluation using an activation heatmap showed that our method is able to locate knowledge with higher accuracy than existing methods. This study contributes to the visualization and explainability of knowledge in MLLMs and shows the potential for future knowledge editing and control. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM) Cite as: arXiv:2503.22941 [cs.AI] (or arXiv:2503.22941v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.22941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-90] FairSAM: Fair Classification on Corrupted Data Through Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2503.22934
作者: Yucong Dai,Jie Ji,Xiaolong Ma,Yongkai Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Image classification models trained on clean data often suffer from significant performance degradation when exposed to testing corrupted data, such as images with impulse noise, Gaussian noise, or environmental noise. This degradation not only impacts overall performance but also disproportionately affects various demographic subgroups, raising critical algorithmic bias concerns. Although robust learning algorithms like Sharpness-Aware Minimization (SAM) have shown promise in improving overall model robustness and generalization, they fall short in addressing the biased performance degradation across demographic subgroups. Existing fairness-aware machine learning methods - such as fairness constraints and reweighing strategies - aim to reduce performance disparities but hardly maintain robust and equitable accuracy across demographic subgroups when faced with data corruption. This reveals an inherent tension between robustness and fairness when dealing with corrupted data. To address these challenges, we introduce one novel metric specifically designed to assess performance degradation across subgroups under data corruption. Additionally, we propose \textbfFairSAM, a new framework that integrates \underlineFairness-oriented strategies into \underlineSAM to deliver equalized performance across demographic groups under corrupted conditions. Our experiments on multiple real-world datasets and various predictive tasks show that FairSAM successfully reconciles robustness and fairness, offering a structured solution for equitable and resilient image classification in the presence of data corruption.

[AI-91] Factored Agents : Decoupling In-Context Learning and Memorization for Robust Tool Use

链接: https://arxiv.org/abs/2503.22931
作者: Nicholas Roth,Christopher Hidey,Lucas Spangher,William F. Arnold,Chang Ye,Nick Masiewicki,Jinoo Baek,Peter Grabowski,Eugene Ie
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel factored agent architecture designed to overcome the limitations of traditional single-agent systems in agentic AI. Our approach decomposes the agent into two specialized components: (1) a large language model (LLM) that serves as a high level planner and in-context learner, which may use dynamically available information in user prompts, (2) a smaller language model which acts as a memorizer of tool format and output. This decoupling addresses prevalent issues in monolithic designs, including malformed, missing, and hallucinated API fields, as well as suboptimal planning in dynamic environments. Empirical evaluations demonstrate that our factored architecture significantly improves planning accuracy and error resilience, while elucidating the inherent trade-off between in-context learning and static memorization. These findings suggest that a factored approach is a promising pathway for developing more robust and adaptable agentic AI systems.

[AI-92] Predictive Traffic Rule Compliance using Reinforcement Learning ITSC2025

链接: https://arxiv.org/abs/2503.22925
作者: Yanliang Huang,Sebastian Mair,Zhuoqi Zeng,Amr Alanwar,Matthias Althoff
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 12 pages, 7 figures. Preprint submitted to IEEE ITSC 2025

点击查看摘要

Abstract:Autonomous vehicle path planning has reached a stage where safety and regulatory compliance are crucial. This paper presents a new approach that integrates a motion planner with a deep reinforcement learning model to predict potential traffic rule violations. In this setup, the predictions of the critic directly affect the cost function of the motion planner, guiding the choices of the trajectory. We incorporate key interstate rules from the German Road Traffic Regulation into a rule book and use a graph-based state representation to handle complex traffic information. Our main innovation is replacing the standard actor network in an actor-critic setup with a motion planning module, which ensures both predictable trajectory generation and prevention of long-term rule violations. Experiments on an open German highway dataset show that the model can predict and prevent traffic rule violations beyond the planning horizon, significantly increasing safety in challenging traffic conditions.

[AI-93] aching LLM s Music Theory with In-Context Learning and Chain-of-Thought Prompting: Pedagogical Strategies for Machines

链接: https://arxiv.org/abs/2503.22853
作者: Liam Pond,Ichiro Fujinaga
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 11 pages, 4 figures, 3 tables. Published in Volume 1 of the Proceedings of the 17th International Conference on Computer Supported Music Education (CSME 2025). Presented on 3 April 2025 in Porto, Portugal

点击查看摘要

Abstract:This study evaluates the baseline capabilities of Large Language Models (LLMs) like ChatGPT, Claude, and Gemini to learn concepts in music theory through in-context learning and chain-of-thought prompting. Using carefully designed prompts (in-context learning) and step-by-step worked examples (chain-of-thought prompting), we explore how LLMs can be taught increasingly complex material and how pedagogical strategies for human learners translate to educating machines. Performance is evaluated using questions from an official Canadian Royal Conservatory of Music (RCM) Level 6 examination, which covers a comprehensive range of topics, including interval and chord identification, key detection, cadence classification, and metrical analysis. Additionally, we evaluate the suitability of various music encoding formats for these tasks (ABC, Humdrum, MEI, MusicXML). All experiments were run both with and without contextual prompts. Results indicate that without context, ChatGPT with MEI performs the best at 52%, while with context, Claude with MEI performs the best at 75%. Future work will further refine prompts and expand to cover more advanced music theory concepts. This research contributes to the broader understanding of teaching LLMs and has applications for educators, students, and developers of AI music tools alike.

[AI-94] RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

链接: https://arxiv.org/abs/2503.22851
作者: Feng Lin,Dong Jae Kim,Zhenhao Li,Jinqiu Yang,Tse-Husn(Peter)Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.

[AI-95] Data-driven worker activity recognition and picking efficiency estimation in manual strawberry harvesting

链接: https://arxiv.org/abs/2503.22809
作者: Uddhav Bhattarai,Rajkishan Arikapudi,Steven A. Fennimore,Frank N Martin,Stavros G. Vougioukas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Manual fruit harvesting is common in agriculture, but the amount of time that pickers spend on nonproductive activities can make it very inefficient. Accurately identifying picking vs. non-picking activity is crucial for estimating picker efficiency and optimizing labor management and the harvest process. In this study, a practical system was developed to calculate the efficiency of pickers in commercial strawberry harvesting. Instrumented picking carts were used to record in real-time the harvested fruit weight, geo-location, and cart movement. A fleet of these carts was deployed during the commercial strawberry harvest season in Santa Maria, CA. The collected data was then used to train a CNN-LSTM-based deep neural network to classify a picker’s activity into Pick" and NoPick" classes. Experimental evaluations showed that the CNN-LSTM model showed promising activity recognition performance with an F1 score accuracy of up to 0.974. The classification results were then used to compute two worker efficiency metrics: the percentage of time spent actively picking, and the time required to fill a tray. Analysis of the season-long harvest data showed that the pickers spent an average of 73.56% of their total harvest time actively picking strawberries, with an average tray fill time of 6.22 minutes. The mean accuracies of these metrics were 96.29% and 95.42%, respectively. When integrated on a commercial scale, the proposed technology could aid growers in automated worker activity monitoring and harvest optimization, ultimately helping to reduce non-productive time and enhance overall harvest efficiency.

[AI-96] Post-Incorporating Code Structural Knowledge into LLM s via In-Context Learning for Code Translation

链接: https://arxiv.org/abs/2503.22776
作者: Yali Du,Hui Sun,Ming Li
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Code translation migrates codebases across programming languages. Recently, large language models (LLMs) have achieved significant advancements in software mining. However, handling the syntactic structure of source code remains a challenge. Classic syntax-aware methods depend on intricate model architectures and loss functions, rendering their integration into LLM training resource-intensive. This paper employs in-context learning (ICL), which directly integrates task exemplars into the input context, to post-incorporate code structural knowledge into pre-trained LLMs. We revisit exemplar selection in ICL from an information-theoretic perspective, proposing that list-wise selection based on information coverage is more precise and general objective than traditional methods based on combining similarity and diversity. To address the challenges of quantifying information coverage, we introduce a surrogate measure, Coverage of Abstract Syntax Tree (CAST). Furthermore, we formulate the NP-hard CAST maximization for exemplar selection and prove that it is a standard submodular maximization problem. Therefore, we propose a greedy algorithm for CAST submodular maximization, which theoretically guarantees a (1-1/e)-approximate solution in polynomial time complexity. Our method is the first training-free and model-agnostic approach to post-incorporate code structural knowledge into existing LLMs at test time. Experimental results show that our method significantly improves LLMs performance and reveals two meaningful insights: 1) Code structural knowledge can be effectively post-incorporated into pre-trained LLMs during inference, despite being overlooked during training; 2) Scaling up model size or training data does not lead to the emergence of code structural knowledge, underscoring the necessity of explicitly considering code syntactic structure.

[AI-97] GroundHog: Revolutionizing GLDAS Groundwater Storag e Downscaling for Enhanced Recharge Estimation in Bangladesh

链接: https://arxiv.org/abs/2503.22771
作者: Saleh Sakib Ahmed,Rashed Uz Zzaman,Saifur Rahman Jony,Faizur Rahman Himel,Afroza Sharmin,A.H.M. Khalequr Rahman,M. Sohel Rahman,Sara Nowreen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Long-term groundwater level (GWL) measurement is vital for effective policymaking and recharge estimation using annual maxima and minima. However, current methods prioritize short-term predictions and lack multi-year applicability, limiting their utility. Moreover, sparse in-situ measurements lead to reliance on low-resolution satellite data like GLDAS as the ground truth for Machine Learning models, further constraining accuracy. To overcome these challenges, we first develop an ML model to mitigate data gaps, achieving R^2 scores of 0.855 and 0.963 for maximum and minimum GWL predictions, respectively. Subsequently, using these predictions and well observations as ground truth, we train an Upsampling Model that uses low-resolution (25 km) GLDAS data as input to produce high-resolution (2 km) GWLs, achieving an excellent R^2 score of 0.96. Our approach successfully upscales GLDAS data for 2003-2024, allowing high-resolution recharge estimations and revealing critical trends for proactive resource management. Our method allows upsampling of groundwater storage (GWS) from GLDAS to high-resolution GWLs for any points independently of officially curated piezometer data, making it a valuable tool for decision-making.

[AI-98] MediTools – Medical Education Powered by LLM s

链接: https://arxiv.org/abs/2503.22769
作者: Amr Alshatnawi,Remi Sampaleanu,David Liebovitz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 19 pages, 17 figures, 2 tables. Code available at this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI) has been advancing rapidly and with the advent of large language models (LLMs) in late 2022, numerous opportunities have emerged for adopting this technology across various domains, including medicine. These innovations hold immense potential to revolutionize and modernize medical education. Our research project leverages large language models to enhance medical education and address workflow challenges through the development of MediTools - AI Medical Education. This prototype application focuses on developing interactive tools that simulate real-life clinical scenarios, provide access to medical literature, and keep users updated with the latest medical news. Our first tool is a dermatology case simulation tool that uses real patient images depicting various dermatological conditions and enables interaction with LLMs acting as virtual patients. This platform allows users to practice their diagnostic skills and enhance their clinical decision-making abilities. The application also features two additional tools: an AI-enhanced PubMed tool for engaging with LLMs to gain deeper insights into research papers, and a Google News tool that offers LLM generated summaries of articles for various medical specialties. A comprehensive survey has been conducted among medical professionals and students to gather initial feedback on the effectiveness and user satisfaction of MediTools, providing insights for further development and refinement of the application. This research demonstrates the potential of AI-driven tools in transforming and revolutionizing medical education, offering a scalable and interactive platform for continuous learning and skill development.

[AI-99] he Cost of Local and Global Fairness in Federated Learning

链接: https://arxiv.org/abs/2503.22762
作者: Yuying Duan,Gelei Xu,Yiyu Shi,Michael Lemmon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:With the emerging application of Federated Learning (FL) in finance, hiring and healthcare, FL models are regulated to be fair, preventing disparities with respect to legally protected attributes such as race or gender. Two concepts of fairness are important in FL: global and local fairness. Global fairness addresses the disparity across the entire population and local fairness is concerned with the disparity within each client. Prior fair FL frameworks have improved either global or local fairness without considering both. Furthermore, while the majority of studies on fair FL focuses on binary settings, many real-world applications are multi-class problems. This paper proposes a framework that investigates the minimum accuracy lost for enforcing a specified level of global and local fairness in multi-class FL settings. Our framework leads to a simple post-processing algorithm that derives fair outcome predictors from the Bayesian optimal score functions. Experimental results show that our algorithm outperforms the current state of the art (SOTA) with regard to the accuracy-fairness tradoffs, computational and communication costs. Codes are available at: this https URL .

[AI-100] Data Poisoning in Deep Learning: A Survey

链接: https://arxiv.org/abs/2503.22759
作者: Pinlong Zhao,Weiyao Zhu,Pengfei Jiao,Di Gao,Ou Wu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning has become a cornerstone of modern artificial intelligence, enabling transformative applications across a wide range of domains. As the core element of deep learning, the quality and security of training data critically influence model performance and reliability. However, during the training process, deep learning models face the significant threat of data poisoning, where attackers introduce maliciously manipulated training data to degrade model accuracy or lead to anomalous behavior. While existing surveys provide valuable insights into data poisoning, they generally adopt a broad perspective, encompassing both attacks and defenses, but lack a dedicated, in-depth analysis of poisoning attacks specifically in deep learning. In this survey, we bridge this gap by presenting a comprehensive and targeted review of data poisoning in deep learning. First, this survey categorizes data poisoning attacks across multiple perspectives, providing an in-depth analysis of their characteristics and underlying design princinples. Second, the discussion is extended to the emerging area of data poisoning in large language models(LLMs). Finally, we explore critical open challenges in the field and propose potential research directions to advance the field further. To support further exploration, an up-to-date repository of resources on data poisoning in deep learning is available at this https URL.

[AI-101] owards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools

链接: https://arxiv.org/abs/2503.22756
作者: Giorgia Adorni
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:The rapid digitalisation of contemporary society has profoundly impacted various facets of our lives, including healthcare, communication, business, and education. The ability to engage with new technologies and solve problems has become crucial, making CT skills, such as pattern recognition, decomposition, and algorithm design, essential competencies. In response, Switzerland is conducting research and initiatives to integrate CT into its educational system. This study aims to develop a comprehensive framework for large-scale assessment of CT skills, particularly focusing on AT, the ability to design algorithms. To achieve this, we first developed a competence model capturing the situated and developmental nature of CT, guiding the design of activities tailored to cognitive abilities, age, and context. This framework clarifies how activity characteristics influence CT development and how to assess these competencies. Additionally, we developed an activity for large-scale assessment of AT skills, offered in two variants: one based on non-digital artefacts (unplugged) and manual expert assessment, and the other based on digital artefacts (virtual) and automatic assessment. To provide a more comprehensive evaluation of students’ competencies, we developed an IAS based on BNs with noisy gates, which offers real-time probabilistic assessment for each skill rather than a single overall score. The results indicate that the proposed instrument can measure AT competencies across different age groups and educational contexts in Switzerland, demonstrating its applicability for large-scale use. AT competencies exhibit a progressive development, with no overall gender differences, though variations are observed at the school level, significantly influenced by the artefact-based environment and its context, underscoring the importance of creating accessible and adaptable assessment tools.

[AI-102] Reasoning Under Threat: Symbolic and Neural Techniques for Cybersecurity Verification

链接: https://arxiv.org/abs/2503.22755
作者: Sarah Veronica
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cybersecurity demands rigorous and scalable techniques to ensure system correctness, robustness, and resilience against evolving threats. Automated reasoning, encompassing formal logic, theorem proving, model checking, and symbolic analysis, provides a foundational framework for verifying security properties across diverse domains such as access control, protocol design, vulnerability detection, and adversarial modeling. This survey presents a comprehensive overview of the role of automated reasoning in cybersecurity, analyzing how logical systems, including temporal, deontic, and epistemic logics are employed to formalize and verify security guarantees. We examine SOTA tools and frameworks, explore integrations with AI for neural-symbolic reasoning, and highlight critical research gaps, particularly in scalability, compositionality, and multi-layered security modeling. The paper concludes with a set of well-grounded future research directions, aiming to foster the development of secure systems through formal, automated, and explainable reasoning techniques.

[AI-103] Model Lake: a New Alternative for Machine Learning Models Management and Governance

链接: https://arxiv.org/abs/2503.22754
作者: Moncef Garouani,Franck Ravat,Nathalie Valles-Parlangeau
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The rise of artificial intelligence and data science across industries underscores the pressing need for effective management and governance of machine learning (ML) models. Traditional approaches to ML models management often involve disparate storage systems and lack standardized methodologies for versioning, audit, and re-use. Inspired by data lake concepts, this paper develops the concept of ML Model Lake as a centralized management framework for datasets, codes, and models within organizations environments. We provide an in-depth exploration of the Model Lake concept, delineating its architectural foundations, key components, operational benefits, and practical challenges. We discuss the transformative potential of adopting a Model Lake approach, such as enhanced model lifecycle management, discovery, audit, and reusability. Furthermore, we illustrate a real-world application of Model Lake and its transformative impact on data, code and model management practices.

[AI-104] From Individual to Group: Developing a Context-Aware Multi-Criteria Group Recommender System

链接: https://arxiv.org/abs/2503.22752
作者: Ngoc Luyen Le(Heudiasyc),Marie-Hélène Abel(Heudiasyc)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: The 16th International Conference on Management of Digital EcoSystems, Nov 2024, Naples, Italy

点击查看摘要

Abstract:Group decision-making is becoming increasingly common in areas such as education, dining, travel, and finance, where collaborative choices must balance diverse individual preferences. While conventional recommender systems are effective in personalization, they fall short in group settings due to their inability to manage conflicting preferences, contextual factors, and multiple evaluation criteria. This study presents the development of a Context-Aware Multi-Criteria Group Recommender System (CA-MCGRS) designed to address these challenges by integrating contextual factors and multiple criteria to enhance recommendation accuracy. By leveraging a Multi-Head Attention mechanism, our model dynamically weighs the importance of different features. Experiments conducted on an educational dataset with varied ratings and contextual variables demonstrate that CA-MCGRS consistently outperforms other approaches across four scenarios. Our findings underscore the importance of incorporating context and multi-criteria evaluations to improve group recommendations, offering valuable insights for developing more effective group recommender systems.

[AI-105] Advancing Spatiotemporal Prediction using Artificial Intelligence: Extending the Framework of Geographically and Temporally Weighted Neural Network (GTWNN) for Differing Geographical and Temporal Contexts

链接: https://arxiv.org/abs/2503.22751
作者: Nicholas Robert Fisk,Matthew Ng Kok Ming,Zahratu Shabrina
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper aims at improving predictive crime models by extending the mathematical framework of Artificial Neural Networks (ANNs) tailored to general spatiotemporal problems and appropriately applying them. Recent advancements in the geospatial-temporal modelling field have focused on the inclusion of geographical weighting in their deep learning models to account for nonspatial stationarity, which is often apparent in spatial data. We formulate a novel semi-analytical approach to solving Geographically and Temporally Weighted Regression (GTWR), and applying it to London crime data. The results produce high-accuracy predictive evaluation scores that affirm the validity of the assumptions and approximations in the approach. This paper presents mathematical advances to the Geographically and Temporally Weighted Neural Network (GTWNN) framework, which offers a novel contribution to the field. Insights from past literature are harmoniously employed with the assumptions and approximations to generate three mathematical extensions to GTWNN’s framework. Combinations of these extensions produce five novel ANNs, applied to the London and Detroit datasets. The results suggest that one of the extensions is redundant and is generally surpassed by another extension, which we term the history-dependent module. The remaining extensions form three novel ANN designs that pose potential GTWNN improvements. We evaluated the efficacy of various models in both the London and Detroit crime datasets, highlighting the importance of accounting for specific geographic and temporal characteristics when selecting modelling strategies to improve model suitability. In general, the proposed methods provide the foundations for a more context-aware, accurate, and robust ANN approach in spatio-temporal modelling.

[AI-106] Adaptive Clipping for Privacy-Preserving Few-Shot Learning: Enhancing Generalization with Limited Data

链接: https://arxiv.org/abs/2503.22749
作者: Kanishka Ranaweera,Dinh C. Nguyen,Pubudu N. Pathirana,David Smith,Ming Ding,Thierry Rakotoarivelo,Aruna Seneviratne
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the era of data-driven machine-learning applications, privacy concerns and the scarcity of labeled data have become paramount challenges. These challenges are particularly pronounced in the domain of few-shot learning, where the ability to learn from limited labeled data is crucial. Privacy-preserving few-shot learning algorithms have emerged as a promising solution to address such pronounced challenges. However, it is well-known that privacy-preserving techniques often lead to a drop in utility due to the fundamental trade-off between data privacy and model performance. To enhance the utility of privacy-preserving few-shot learning methods, we introduce a novel approach called Meta-Clip. This technique is specifically designed for meta-learning algorithms, including Differentially Private (DP) model-agnostic meta-learning, DP-Reptile, and DP-MetaSGD algorithms, with the objective of balancing data privacy preservation with learning capacity maximization. By dynamically adjusting clipping thresholds during the training process, our Adaptive Clipping method provides fine-grained control over the disclosure of sensitive information, mitigating overfitting on small datasets and significantly improving the generalization performance of meta-learning models. Through comprehensive experiments on diverse benchmark datasets, we demonstrate the effectiveness of our approach in minimizing utility degradation, showcasing a superior privacy-utility trade-off compared to existing privacy-preserving techniques. The adoption of Adaptive Clipping represents a substantial step forward in the field of privacy-preserving few-shot learning, empowering the development of secure and accurate models for real-world applications, especially in scenarios where there are limited data availability.

[AI-107] Ignite Forecasting with SPARK: An Efficient Generative Framework for Refining LLM s in Temporal Knowledge Graph Forecasting DASFAA2025

链接: https://arxiv.org/abs/2503.22748
作者: Gongzhu Yin,Hongli Zhang,Yi Luo,Yuchen Yang,Kun Lu,Chao Meng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To be published in the 30th International Conference on Database Systems for Advanced Applications (DASFAA 2025)

点击查看摘要

Abstract:Temporal Knowledge Graph (TKG) forecasting is crucial for predicting future events using historical data. With the surge of Large Language Models (LLMs), recent studies have begun exploring their integration into TKG forecasting and achieved some success. However, they still face limitations such as limited input length, inefficient output generation, and resource-intensive refinement, which undermine their performance and practical applicability. To address these limitations, we introduce SPARK, a Sequence-level Proxy-Adapting framework for Refining LLMs in TKG forecasting. Inspired by inference-time algorithms adopted in controlling generation, SPARK offers a cost-effective, plug-and-play solution through two key innovations: (1) Beam Sequence-Level Generation, which reframes TKG forecasting as a top-K sequence-level generation task, using beam search for efficiently generating next-entity distribution in a single forward pass. (2) TKG Adapter for Refinement, which employs traditional TKG models as trainable proxy adapters to leverage global graph information and refine LLM outputs, overcoming both the input length and the resource-intensive fine-tuning problems. Experiments across diverse datasets validate SPARK’s forecasting performance, robust generalization capabilities, and high efficiency. We release source codes at this https URL.

[AI-108] LeForecast: Enterprise Hybrid Forecast by Time Series Intelligence

链接: https://arxiv.org/abs/2503.22747
作者: Zheng Tan,Yiwen Nie,Wenfa Wu,Guanyu Zhang,Yanze Liu,Xinyuan Tian,Kailin Gao,Mengya Liu,Qijiang Cheng,Haipeng Jiang,Yingzheng Ma,Wei Zheng,Yuci Zhu,Yuanyuan Sun,Xiangyu Lei,Xiyu Guan,Wanqing Huang,Shouming Liu,Xiangquan Meng,Pengzhan Qu,Chao Yang,Jiaxuan Fan,Yuan He,Hongsheng Qi,Yangzhou Du
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Demand is spiking in industrial fields for multidisciplinary forecasting, where a broad spectrum of sectors needs planning and forecasts to streamline intelligent business management, such as demand forecasting, product planning, inventory optimization, etc. Specifically, these tasks expecting intelligent approaches to learn from sequentially collected historical data and then foresee most possible trend, i.e. time series forecasting. Challenge of it lies in interpreting complex business contexts and the efficiency and generalisation of modelling. With aspirations of pre-trained foundational models for such purpose, given their remarkable success of large foundation model across legions of tasks, we disseminate \leforecast, an enterprise intelligence platform tailored for time series tasks. It integrates advanced interpretations of time series data and multi-source information, and a three-pillar modelling engine combining a large foundation model (Le-TSFM), multimodal model and hybrid model to derive insights, predict or infer futures, and then drive optimisation across multiple sectors in enterprise operations. The framework is composed by a model pool, model profiling module, and two different fusion approaches regarding original model architectures. Experimental results verify the efficiency of our trail fusion concepts: router-based fusion network and coordination of large and small models, resulting in high costs for redundant development and maintenance of models. This work reviews deployment of LeForecast and its performance in three industrial use cases. Our comprehensive experiments indicate that LeForecast is a profound and practical platform for efficient and competitive performance. And we do hope that this work can enlighten the research and grounding of time series techniques in accelerating enterprise.

[AI-109] CSPO: Cross-Market Synergistic Stock Price Movement Forecasting with Pseudo-volatility Optimization

链接: https://arxiv.org/abs/2503.22740
作者: Sida Lin,Yankai Chen,Yiyan Qi,Chenhao Ma,Bokai Cao,Yifei Zhang,Xue Liu,Jian Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The stock market, as a cornerstone of the financial markets, places forecasting stock price movements at the forefront of challenges in quantitative finance. Emerging learning-based approaches have made significant progress in capturing the intricate and ever-evolving data patterns of modern markets. With the rapid expansion of the stock market, it presents two characteristics, i.e., stock exogeneity and volatility heterogeneity, that heighten the complexity of price forecasting. Specifically, while stock exogeneity reflects the influence of external market factors on price movements, volatility heterogeneity showcases the varying difficulty in movement forecasting against price fluctuations. In this work, we introduce the framework of Cross-market Synergy with Pseudo-volatility Optimization (CSPO). Specifically, CSPO implements an effective deep neural architecture to leverage external futures knowledge. This enriches stock embeddings with cross-market insights and thus enhances the CSPO’s predictive capability. Furthermore, CSPO incorporates pseudo-volatility to model stock-specific forecasting confidence, enabling a dynamic adaptation of its optimization process to improve accuracy and robustness. Our extensive experiments, encompassing industrial evaluation and public benchmarking, highlight CSPO’s superior performance over existing methods and effectiveness of all proposed modules contained therein.

[AI-110] Cyborg Data: Merging Human with AI Generated Training Data

链接: https://arxiv.org/abs/2503.22736
作者: Kai North,Christopher Ormerod
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset “Cyborg Data”, as it combines human and machine-scored responses. Our findings show that Student models trained on “Cyborg Data” show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.

[AI-111] Zero-Shot LLM s in Human-in-the-Loop RL: Replacing Human Feedback for Reward Shaping

链接: https://arxiv.org/abs/2503.22723
作者: Mohammad Saif Nazir,Chayan Banerjee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 2 figures, 5 Tables

点击查看摘要

Abstract:Reinforcement learning often faces challenges with reward misalignment, where agents optimize for given rewards but fail to exhibit the desired behaviors. This occurs when the reward function incentivizes proxy behaviors that diverge from the true objective. While human-in-the-loop (HIL) methods can help, they may exacerbate the problem, as humans are prone to biases that lead to inconsistent, subjective, or misaligned feedback, complicating the learning process. To address these issues, we propose two key contributions. First, we extend the use of zero-shot, off-the-shelf large language models (LLMs) for reward shaping beyond natural language processing (NLP) to continuous control tasks. By leveraging LLMs as direct feedback providers, we replace surrogate models trained on human feedback, which often suffer from the bias inherent in the feedback data it is trained on. Second, we introduce a hybrid framework (LLM-HFBF) that enables LLMs to identify and correct biases in human feedback while incorporating this feedback into the reward shaping process. The LLM-HFBF framework creates a more balanced and reliable system by addressing both the limitations of LLMs (e.g., lack of domain-specific knowledge) and human supervision (e.g., inherent biases). By enabling human feedback bias flagging and correction, our approach improves reinforcement learning performance and reduces reliance on potentially biased human guidance. Empirical experiments show that biased human feedback significantly reduces performance, with average episodic reward (AER) dropping from 28.472 in (unbiased approaches) to 7.039 (biased with conservative bias). In contrast, LLM-based approaches maintain a matching AER like unbiased feedback, even in custom edge case scenarios.

[AI-112] Why Representation Engineering Works: A Theoretical and Empirical Study in Vision-Language Models

链接: https://arxiv.org/abs/2503.22720
作者: Bowei Tian,Xuntao Lyu,Meng Liu,Hongyi Wang,Ang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representation Engineering (RepE) has emerged as a powerful paradigm for enhancing AI transparency by focusing on high-level representations rather than individual neurons or circuits. It has proven effective in improving interpretability and control, showing that representations can emerge, propagate, and shape final model outputs in large language models (LLMs). However, in Vision-Language Models (VLMs), visual input can override factual linguistic knowledge, leading to hallucinated responses that contradict reality. To address this challenge, we make the first attempt to extend RepE to VLMs, analyzing how multimodal representations are preserved and transformed. Building on our findings and drawing inspiration from successful RepE applications, we develop a theoretical framework that explains the stability of neural activity across layers using the principal eigenvector, uncovering the underlying mechanism of RepE. We empirically validate these instrinsic properties, demonstrating their broad applicability and significance. By bridging theoretical insights with empirical validation, this work transforms RepE from a descriptive tool into a structured theoretical framework, opening new directions for improving AI robustness, fairness, and transparency.

[AI-113] LLM -based Agent Simulation for Maternal Health Interventions: Uncertainty Estimation and Decision-focused Evaluation

链接: https://arxiv.org/abs/2503.22719
作者: Sarah Martinson,Lingkai Kong,Cheol Woo Kim,Aparna Taneja,Milind Tambe
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Agent-based simulation is crucial for modeling complex human behavior, yet traditional approaches require extensive domain knowledge and large datasets. In data-scarce healthcare settings where historic and counterfactual data are limited, large language models (LLMs) offer a promising alternative by leveraging broad world knowledge. This study examines an LLM-driven simulation of a maternal mobile health program, predicting beneficiaries’ listening behavior when they receive health information via automated messages (control) or live representatives (intervention). Since uncertainty quantification is critical for decision-making in health interventions, we propose an LLM epistemic uncertainty estimation method based on binary entropy across multiple samples. We enhance model robustness through ensemble approaches, improving F1 score and model calibration compared to individual models. Beyond direct evaluation, we take a decision-focused approach, demonstrating how LLM predictions inform intervention feasibility and trial implementation in data-limited settings. The proposed method extends to public health, disaster response, and other domains requiring rapid intervention assessment under severe data constraints. All code and prompts used for this work can be found at this https URL.

[AI-114] Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

链接: https://arxiv.org/abs/2503.22711
作者: Vikramjit Mitra,Amrit Romana,Dung T. Tran,Erdrin Azemi
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

[AI-115] Validating Emergency Department Admission Predictions Based on Local Data Through MIMIC-IV

链接: https://arxiv.org/abs/2503.22706
作者: Francesca Meimeti,Loukas Triantafyllopoulos,Aikaterini Sakagianni,Vasileios Kaldis,Lazaros Tzelves,Nikolaos Theodorakis,Evgenia Paxinou,Georgios Feretzakis,Dimitris Kalles,Vassilios S. Verykios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 36 pages, 3 figures, 6 tables

点击查看摘要

Abstract:The effective management of Emergency Department (ED) overcrowding is essential for improving patient outcomes and optimizing healthcare resource allocation. This study validates hospital admission prediction models initially developed using a small local dataset from a Greek hospital by leveraging the comprehensive MIMIC-IV dataset. After preprocessing the MIMIC-IV data, five algorithms were evaluated: Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Random Forest (RF), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM Radial). Among these, RF demonstrated superior performance, achieving an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9999, sensitivity of 0.9997, and specificity of 0.9999 when applied to the MIMIC-IV data. These findings highlight the robustness of RF in handling complex datasets for admission prediction, establish MIMIC-IV as a valuable benchmark for validating models based on smaller local datasets, and provide actionable insights for improving ED management strategies.

[AI-116] Binary and Multi-Class Intrusion Detection in IoT Using Standalone and Hybrid Machine and Deep Learning Models

链接: https://arxiv.org/abs/2503.22684
作者: Md Ahnaf Akif
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Master’s thesis, 80 pages, 18 figures, 4 tables

点击查看摘要

Abstract:Maintaining security in IoT systems depends on intrusion detection since these networks’ sensitivity to cyber-attacks is growing. Based on the IoT23 dataset, this study explores the use of several Machine Learning (ML) and Deep Learning (DL) along with the hybrid models for binary and multi-class intrusion detection. The standalone machine and deep learning models like Random Forest (RF), Extreme Gradient Boosting (XGBoost), Artificial Neural Network (ANN), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Convolutional Neural Network (CNN) were used. Furthermore, two hybrid models were created by combining machine learning techniques: RF, XGBoost, AdaBoost, KNN, and SVM and these hybrid models were voting based hybrid classifier. Where one is for binary, and the other one is for multi-class classification. These models vi were tested using precision, recall, accuracy, and F1-score criteria and compared the performance of each model. This work thoroughly explains how hybrid, standalone ML and DL techniques could improve IDS (Intrusion Detection System) in terms of accuracy and scalability in IoT (Internet of Things).

[AI-117] Prompt Divide and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

链接: https://arxiv.org/abs/2503.21598
作者: Johan Wahréus,Ahmed Hussain,Panos Papadimitratos
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages; 26 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse. We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code. Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation. Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code. Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8%) compared to our LLM jury system (73.2%), with manual verification confirming that single-judge assessments often accept incomplete implementations. Moreover, we demonstrate that our distributed architecture improves SRs by 12% over the non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts.

[AI-118] SPDZCoder: Combining Expert Knowledge with LLM s for Generating Privacy-Computing Code

链接: https://arxiv.org/abs/2501.00363
作者: Xiaoning Dong,Peilin Xin,Jia Li,Wei Xu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Privacy computing receives increasing attention but writing privacy computing code remains challenging for developers due to limited library functions, necessitating function implementation from scratch, and data-oblivious requirement, contradicting intuitive thinking and usual practices of programmers. Automating the generation of privacy computing code with Large Language Models can streamline development effort and lower the barrier to using privacy computing frameworks. However, existing LLMs still encounter challenges in code translation for privacy-preserving computation, such as translating Python to MP-SPDZ, due to the scarcity of MP-SPDZ data required for effective pre-training or fine-tuning. Moreover, the lack of a benchmark further complicates the evaluation of translation quality. To address the limitations, this work proposes SPDZCoder, a rule-based framework that combines LLMs with expert knowledge for generating privacy-computing code without requiring additional training data. Specifically, SPDZCoder employ a rigorous procedure for collecting high-quality expert knowledge to represent the semantic-expressing differences between Python and MP-SPDZ, and to derive transformation rules for translating Python to MP-SPDZ based on these knowledge. Then, SPDZCoder progressively converts Python code into MP-SPDZ code using transformation rules in a three stage pipeline. To evaluate SPDZCoder, we manually constructed a benchmark dataset, SPDZEval, which comprises six data splits, each representing a distinct class of challenging tasks in MP-SPDZ implementation. Extensive experiments show that SPDZCoder achieves superior performance, significantly surpassing baselines in pass@1 and pass@2. Specifically, SPDZCoder attains an overall correctness of 85.94% and 92.01% in pass@1 and pass@2, respectively, whereas the best-performing baseline achieves 63.58% and 76.36%, respectively.

[AI-119] Deep Nets as Hamiltonians

链接: https://arxiv.org/abs/2503.23982
作者: Mike Winer,Boris Hanin
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
*备注: 19+7 pages

点击查看摘要

Abstract:Neural networks are complex functions of both their inputs and parameters. Much prior work in deep learning theory analyzes the distribution of network outputs at a fixed a set of inputs (e.g. a training dataset) over random initializations of the network parameters. The purpose of this article is to consider the opposite situation: we view a randomly initialized Multi-Layer Perceptron (MLP) as a Hamiltonian over its inputs. For typical realizations of the network parameters, we study the properties of the energy landscape induced by this Hamiltonian, focusing on the structure of near-global minimum in the limit of infinite width. Specifically, we use the replica trick to perform an exact analytic calculation giving the entropy (log volume of space) at a given energy. We further derive saddle point equations that describe the overlaps between inputs sampled iid from the Gibbs distribution induced by the random MLP. For linear activations we solve these saddle point equations exactly. But we also solve them numerically for a variety of depths and activation functions, including \tanh, \sin, \textReLU , and shaped non-linearities. We find even at infinite width a rich range of behaviors. For some non-linearities, such as \sin , for instance, we find that the landscapes of random MLPs exhibit full replica symmetry breaking, while shallow \tanh and ReLU networks or deep shaped MLPs are instead replica symmetric.

[AI-120] Remarks on the Polyak-Lojasiewicz inequality and the convergence of gradient systems

链接: https://arxiv.org/abs/2503.23641
作者: Arthur Castello B. de Oliveira,Leilei Cui,Eduardo D. Sontag
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This work explores generalizations of the Polyak-Lojasiewicz inequality (PLI) and their implications for the convergence behavior of gradient flows in optimization problems. Motivated by the continuous-time linear quadratic regulator (CT-LQR) policy optimization problem – where only a weaker version of the PLI is characterized in the literature – this work shows that while weaker conditions are sufficient for global convergence to, and optimality of the set of critical points of the cost function, the “profile” of the gradient flow solution can change significantly depending on which “flavor” of inequality the cost satisfies. After a general theoretical analysis, we focus on fitting the CT-LQR policy optimization problem to the proposed framework, showing that, in fact, it can never satisfy a PLI in its strongest form. We follow up our analysis with a brief discussion on the difference between continuous- and discrete-time LQR policy optimization, and end the paper with some intuition on the extension of this framework to optimization problems with L1 regularization and solved through proximal gradient flows.

[AI-121] Interpretable Machine Learning in Physics: A Review

链接: https://arxiv.org/abs/2503.23616
作者: Sebastian Johann Wetzel,Seungwoong Ha,Raban Iten,Miriam Klopotek,Ziming Liu
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning is increasingly transforming various scientific fields, enabled by advancements in computational power and access to large data sets from experiments and simulations. As artificial intelligence (AI) continues to grow in capability, these algorithms will enable many scientific discoveries beyond human capabilities. Since the primary goal of science is to understand the world around us, fully leveraging machine learning in scientific discovery requires models that are interpretable – allowing experts to comprehend the concepts underlying machine-learned predictions. Successful interpretations increase trust in black-box methods, help reduce errors, allow for the improvement of the underlying models, enhance human-AI collaboration, and ultimately enable fully automated scientific discoveries that remain understandable to human scientists. This review examines the role of interpretability in machine learning applied to physics. We categorize different aspects of interpretability, discuss machine learning models in terms of both interpretability and performance, and explore the philosophical implications of interpretability in scientific inquiry. Additionally, we highlight recent advances in interpretable machine learning across many subfields of physics. By bridging boundaries between disciplines – each with its own unique insights and challenges – we aim to establish interpretable machine learning as a core research focus in science.

[AI-122] Addressing Model Overcomplexity in Drug-Drug Interaction Prediction With Molecular Fingerprints ICLR2025

链接: https://arxiv.org/abs/2503.23550
作者: Manel Gil-Sorribes,Alexis Molina
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the GEM Workshop at ICLR 2025

点击查看摘要

Abstract:Accurately predicting drug-drug interactions (DDIs) is crucial for pharmaceutical research and clinical safety. Recent deep learning models often suffer from high computational costs and limited generalization across datasets. In this study, we investigate a simpler yet effective approach using molecular representations such as Morgan fingerprints (MFPS), graph-based embeddings from graph convolutional networks (GCNs), and transformer-derived embeddings from MoLFormer integrated into a straightforward neural network. We benchmark our implementation on DrugBank DDI splits and a drug-drug affinity (DDA) dataset from the Food and Drug Administration. MFPS along with MoLFormer and GCN representations achieve competitive performance across tasks, even in the more challenging leak-proof split, highlighting the sufficiency of simple molecular representations. Moreover, we are able to identify key molecular motifs and structural patterns relevant to drug interactions via gradient-based analyses using the representations under study. Despite these results, dataset limitations such as insufficient chemical diversity, limited dataset size, and inconsistent labeling impact robust evaluation and challenge the need for more complex approaches. Our work provides a meaningful baseline and emphasizes the need for better dataset curation and progressive complexity scaling.

[AI-123] POINT2: A Polymer Informatics Training and Testing Database

链接: https://arxiv.org/abs/2503.23491
作者: Jiaxin Xu,Gang Liu,Ruilan Guo,Meng Jiang,Tengfei Luo
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advancement of polymer informatics has been significantly propelled by the integration of machine learning (ML) techniques, enabling the rapid prediction of polymer properties and expediting the discovery of high-performance polymeric materials. However, the field lacks a standardized workflow that encompasses prediction accuracy, uncertainty quantification, ML interpretability, and polymer synthesizability. In this study, we introduce POINT ^2 (POlymer INformatics Training and Testing), a comprehensive benchmark database and protocol designed to address these critical challenges. Leveraging the existing labeled datasets and the unlabeled PI1M dataset, a collection of approximately one million virtual polymers generated via a recurrent neural network trained on the realistic polymers, we develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models. These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors to achieve property predictions, uncertainty estimations, model interpretability, and template-based polymerization synthesizability across a spectrum of properties, including gas permeability, thermal conductivity, glass transition temperature, melting temperature, fractional free volume, and density. The POINT ^2 database can serve as a valuable resource for the polymer informatics community for polymer discovery and optimization.

[AI-124] Spatiotemporal Learning of Brain Dynamics from fMRI Using Frequency-Specific Multi-Band Attention for Cognitive and Psychiatric Applications

链接: https://arxiv.org/abs/2503.23394
作者: Sangyoon Bae,Junbeom Kwon,Shinjae Yoo,Jiook Cha
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding how the brain’s complex nonlinear dynamics give rise to adaptive cognition and behavior is a central challenge in neuroscience. These dynamics exhibit scale-free and multifractal properties, influencing the reconfiguration of neural networks. However, conventional neuroimaging models are constrained by linear and stationary assumptions, limiting their ability to capture these processes. Transformer-based architectures, known for capturing long-range dependencies, align well with the brain’s hierarchical and temporal organization. We introduce Multi-Band Brain Net (MBBN), a transformer-based framework that models frequency-specific spatiotemporal brain dynamics from fMRI by integrating scale-free network principles with frequency-resolved multi-band self-attention. Trained on three large-scale neuroimaging cohorts (UK Biobank, ABCD, ABIDE) totaling 45,951 individuals, MBBN reveals previously undetectable frequency-dependent network interactions, shedding light on connectivity disruptions in psychiatric conditions (ADHD, ASD, depression). This validation shows robust generalizability and highlights core neural principles conserved across populations. MBBN achieves up to 30.59% higher predictive accuracy than state-of-the-art methods, demonstrating the advantage of frequency-informed spatiotemporal modeling in capturing latent neural computations. MBBN’s interpretability uncovers novel frequency-specific biomarkers for neurodevelopmental disorders, providing insights into the hierarchical organization of brain function. By offering an interpretable framework for spatiotemporal learning, MBBN provides insights into how neural computations underpin cognitive function and psychiatric vulnerability, with implications for brain decoding, cognitive neuroscience, and precision psychiatry.

[AI-125] Simulation of Non-Ordinary Consciousness

链接: https://arxiv.org/abs/2503.23245
作者: Khalid M. Saqr
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
*备注: 16 pages, 9 figures, 1 table

点击查看摘要

Abstract:The symbolic architecture of non-ordinary consciousness remains largely unmapped in cognitive science and artificial intelligence. While conventional models prioritize rational coherence, altered states such as those induced by psychedelics reveal distinct symbolic regimes characterized by recursive metaphor, ego dissolution, and semantic destabilization. We present \textitGlyph, a generative symbolic interface designed to simulate psilocybin-like symbolic cognition in large language models. Rather than modeling perception or mood, Glyph enacts symbolic transformation through recursive reentry, metaphoric modulation, and entropy-scaled destabilization – a triadic operator formalized within a tensorial linguistic framework. Experimental comparison with baseline GPT-4o reveals that Glyph consistently generates high-entropy, metaphor-saturated, and ego-dissolving language across diverse symbolic prompt categories. These results validate the emergence of non-ordinary cognitive patterns and support a new paradigm for simulating altered consciousness through language. Glyph opens novel pathways for modeling symbolic cognition, exploring metaphor theory, and encoding knowledge in recursively altered semantic spaces.

[AI-126] Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations

链接: https://arxiv.org/abs/2503.22687
作者: Jinming Chen,Jingyi Fang,Yuanzhong Zheng,Yaoxuan Wang,Haojun Fei
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal approaches. In this paper, the proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model backbone which contains naturally frame aligned textual and emotional features, to achieve precise emotion classification solely based on the audio modality. Furthermore, we design the multimodal fusion (MMF) module and cross-modal attention (CMA) module in order to fuse the phonetic posteriorgram (PPG) and emotional features extracted by the ASR encoder for improving recognition accuracy. The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.

机器学习

[LG-0] Policy Gradient for LQR with Domain Randomization

链接: https://arxiv.org/abs/2503.24371
作者: Tesshu Fujinami,Bruce D. Lee,Nikolai Matni,George J. Pappas
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain randomization (DR) enables sim-to-real transfer by training controllers on a distribution of simulated environments, with the goal of achieving robust performance in the real world. Although DR is widely used in practice and is often solved using simple policy gradient (PG) methods, understanding of its theoretical guarantees remains limited. Toward addressing this gap, we provide the first convergence analysis of PG methods for domain-randomized linear quadratic regulation (LQR). We show that PG converges globally to the minimizer of a finite-sample approximation of the DR objective under suitable bounds on the heterogeneity of the sampled systems. We also quantify the sample-complexity associated with achieving a small performance gap between the sample-average and population-level objectives. Additionally, we propose and analyze a discount-factor annealing algorithm that obviates the need for an initial jointly stabilizing controller, which may be challenging to find. Empirical results support our theoretical findings and highlight promising directions for future work, including risk-sensitive DR formulations and stochastic PG algorithms.

[LG-1] Faster Rates for No-Regret Learning in General Games via Cautious Optimism STOC2025

链接: https://arxiv.org/abs/2503.24340
作者: Ashkan Soleymani,Georgios Piliouras,Gabriele Farina
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Appeared at STOC 2025

点击查看摘要

Abstract:We establish the first uncoupled learning algorithm that attains O(n \log^2 d \log T) per-player regret in multi-player general-sum games, where n is the number of players, d is the number of actions available to each player, and T is the number of repetitions of the game. Our results exponentially improve the dependence on d compared to the O(n, d \log T) regret attainable by Log-Regularized Lifted Optimistic FTRL [Far+22c], and also reduce the dependence on the number of iterations T from \log^4 T to \log T compared to Optimistic Hedge, the previously well-studied algorithm with O(n \log d \log^4 T) regret [DFG21]. Our algorithm is obtained by combining the classic Optimistic Multiplicative Weights Update (OMWU) with an adaptive, non-monotonic learning rate that paces the learning process of the players, making them more cautious when their regret becomes too negative.

[LG-2] NoProp: Training Neural Networks without Back-propagation or Forward-propagation

链接: https://arxiv.org/abs/2503.24322
作者: Qinyu Li,Yee Whye Teh,Razvan Pascanu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The canonical deep learning approach for learning requires computing a gradient term at each layer by back-propagating the error signal from the output towards each learnable parameter. Given the stacked structure of neural networks, where each layer builds on the representation of the layer below, this approach leads to hierarchical representations. More abstract features live on the top layers of the model, while features on lower layers are expected to be less abstract. In contrast to this, we introduce a new learning method named NoProp, which does not rely on either forward or backwards propagation. Instead, NoProp takes inspiration from diffusion and flow matching methods, where each layer independently learns to denoise a noisy target. We believe this work takes a first step towards introducing a new family of gradient-free learning methods, that does not learn hierarchical representations – at least not in the usual sense. NoProp needs to fix the representation at each layer beforehand to a noised version of the target, learning a local denoising process that can then be exploited at inference. We demonstrate the effectiveness of our method on MNIST, CIFAR-10, and CIFAR-100 image classification benchmarks. Our results show that NoProp is a viable learning algorithm which achieves superior accuracy, is easier to use and computationally more efficient compared to other existing back-propagation-free methods. By departing from the traditional gradient based learning paradigm, NoProp alters how credit assignment is done within the network, enabling more efficient distributed learning as well as potentially impacting other characteristics of the learning process.

[LG-3] Sample-Optimal Private Regression in Polynomial Time

链接: https://arxiv.org/abs/2503.24321
作者: Prashanti Anderson,Ainesh Bakshi,Mahbod Majid,Stefan Tiegel
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the task of privately obtaining prediction error guarantees in ordinary least-squares regression problems with Gaussian covariates (with unknown covariance structure). We provide the first sample-optimal polynomial time algorithm for this task under both pure and approximate differential privacy. We show that any improvement to the sample complexity of our algorithm would violate either statistical-query or information-theoretic lower bounds. Additionally, our algorithm is robust to a small fraction of arbitrary outliers and achieves optimal error rates as a function of the fraction of outliers. In contrast, all prior efficient algorithms either incurred sample complexities with sub-optimal dimension dependence, scaling with the condition number of the covariates, or obtained a polynomially worse dependence on the privacy parameters. Our technical contributions are two-fold: first, we leverage resilience guarantees of Gaussians within the sum-of-squares framework. As a consequence, we obtain efficient sum-of-squares algorithms for regression with optimal robustness rates and sample complexity. Second, we generalize the recent robustness-to-privacy framework [HKMN23, (arXiv:2212.05015)] to account for the geometry induced by the covariance of the input samples. This framework crucially relies on the robust estimators to be sum-of-squares algorithms, and combining the two steps yields a sample-optimal private regression algorithm. We believe our techniques are of independent interest, and we demonstrate this by obtaining an efficient algorithm for covariance-aware mean estimation, with an optimal dependence on the privacy parameters. Subjects: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2503.24321 [cs.DS] (or arXiv:2503.24321v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2503.24321 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Fair Dynamic Spectrum Access via Fully Decentralized Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2503.24296
作者: Yubo Zhang,Pedro Botelho,Trevor Gordon,Gil Zussman,Igor Kadota
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: To appear in WiOpt 2025

点击查看摘要

Abstract:We consider a decentralized wireless network with several source-destination pairs sharing a limited number of orthogonal frequency bands. Sources learn to adapt their transmissions (specifically, their band selection strategy) over time, in a decentralized manner, without sharing information with each other. Sources can only observe the outcome of their own transmissions (i.e., success or collision), having no prior knowledge of the network size or of the transmission strategy of other sources. The goal of each source is to maximize their own throughput while striving for network-wide fairness. We propose a novel fully decentralized Reinforcement Learning (RL)-based solution that achieves fairness without coordination. The proposed Fair Share RL (FSRL) solution combines: (i) state augmentation with a semi-adaptive time reference; (ii) an architecture that leverages risk control and time difference likelihood; and (iii) a fairness-driven reward structure. We evaluate FSRL in more than 50 network settings with different number of agents, different amounts of available spectrum, in the presence of jammers, and in an ad-hoc setting. Simulation results suggest that, when we compare FSRL with a common baseline RL algorithm from the literature, FSRL can be up to 89.0% fairer (as measured by Jain’s fairness index) in stringent settings with several sources and a single frequency band, and 48.1% fairer on average.

[LG-5] Advances in Continual Graph Learning for Anti-Money Laundering Systems: A Comprehensive Review

链接: https://arxiv.org/abs/2503.24259
作者: Bruno Deprez,Wei Wei,Wouter Verbeke,Bart Baesens,Kevin Mets,Tim Verdonck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial institutions are required by regulation to report suspicious financial transactions related to money laundering. Therefore, they need to constantly monitor vast amounts of incoming and outgoing transactions. A particular challenge in detecting money laundering is that money launderers continuously adapt their tactics to evade detection. Hence, detection methods need constant fine-tuning. Traditional machine learning models suffer from catastrophic forgetting when fine-tuning the model on new data, thereby limiting their effectiveness in dynamic environments. Continual learning methods may address this issue and enhance current anti-money laundering (AML) practices, by allowing models to incorporate new information while retaining prior knowledge. Research on continual graph learning for AML, however, is still scarce. In this review, we critically evaluate state-of-the-art continual graph learning approaches for AML applications. We categorise methods into replay-based, regularization-based, and architecture-based strategies within the graph neural network (GNN) framework, and we provide in-depth experimental evaluations on both synthetic and real-world AML data sets that showcase the effect of the different hyperparameters. Our analysis demonstrates that continual learning improves model adaptability and robustness in the face of extreme class imbalances and evolving fraud patterns. Finally, we outline key challenges and propose directions for future research.

[LG-6] GPU-centric Communication Schemes for HPC and ML Applications

链接: https://arxiv.org/abs/2503.24230
作者: Naveen Namashivayam
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: A surveyor on Communication Schemes for Distributed HPC and ML Applications. Article in consideration for journal publication

点击查看摘要

Abstract:Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its performance bottleneck. Most programming models and runtime systems enabling the communication requirements on these systems support GPU-aware communication schemes that move the GPU-attached communication buffers in the application directly from the GPU to the NIC without staging through the host memory. A CPU thread is required to orchestrate the communication operations even with support for such GPU-awareness. This survey discusses various available GPU-centric communication schemes that move the control path of the communication operations from the CPU to the GPU. This work presents the need for the new communication schemes, various GPU and NIC capabilities required to implement the schemes, and the potential use-cases addressed. Based on these discussions, challenges involved in supporting the exhibited GPU-centric communication schemes are discussed.

[LG-7] Many-to-Many Matching via Sparsity Controlled Optimal Transport

链接: https://arxiv.org/abs/2503.24204
作者: Weijie Liu,Han Bao,Makoto Yamada,Zenan Huang,Nenggan Zheng,Hui Qian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many-to-many matching seeks to match multiple points in one set and multiple points in another set, which is a basis for a wide range of data mining problems. It can be naturally recast in the framework of Optimal Transport (OT). However, existing OT methods either lack the ability to accomplish many-to-many matching or necessitate careful tuning of a regularization parameter to achieve satisfactory results. This paper proposes a novel many-to-many matching method to explicitly encode many-to-many constraints while preventing the degeneration into one-to-one matching. The proposed method consists of the following two components. The first component is the matching budget constraints on each row and column of a transport plan, which specify how many points can be matched to a point at most. The second component is the deformed q -entropy regularization, which encourages a point to meet the matching budget maximally. While the deformed q -entropy was initially proposed to sparsify a transport plan, we employ it to avoid the degeneration into one-to-one matching. We optimize the objective via a penalty algorithm, which is efficient and theoretically guaranteed to converge. Experimental results on various tasks demonstrate that the proposed method achieves good performance by gleaning meaningful many-to-many matchings.

[LG-8] raffic Engineering in Large-scale Networks with Generalizable Graph Neural Networks

链接: https://arxiv.org/abs/2503.24203
作者: Fangtong Zhou,Xiaorui Liu,Ruozhou Yu,Guoliang Xue
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic engineering (TE) in large-scale computer networks has become a fundamental yet challenging problem, owing to the swift growth of global-scale cloud wide-area networks or backbone low-Earth-orbit satellite constellations. To address the scalability issue of traditional TE algorithms, learning-based approaches have been proposed, showing potential of significant efficiency improvement over state-of-the-art methods. Nevertheless, the intrinsic limitations of existing learning-based methods hinder their practical application: they are not generalizable across diverse topologies and network conditions, incur excessive training overhead, and do not respect link capacities by default. This paper proposes TELGEN, a novel TE algorithm that learns to solve TE problems efficiently in large-scale networks, while achieving superior generalizability across diverse network conditions. TELGEN is based on the novel idea of transforming the problem of “predicting the optimal TE solution” into “predicting the optimal TE algorithm”, which enables TELGEN to learn and efficiently approximate the end-to-end solving process of classical optimal TE algorithms. The learned algorithm is agnostic to the exact network topology or traffic patterns, and can efficiently solve TE problems given arbitrary inputs and generalize well to unseen topologies and demands. We trained and evaluated TELGEN on random and real-world networks with up to 5000 nodes and 106 links. TELGEN achieved less than 3% optimality gap while ensuring feasibility in all cases, even when the test network had up to 20x more nodes than the largest in training. It also saved up to 84% solving time than classical optimal solver, and could reduce training time per epoch and solving time by 2-4 orders of magnitude than latest learning algorithms on the largest networks. Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG) Cite as: arXiv:2503.24203 [cs.NI] (or arXiv:2503.24203v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2503.24203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-9] NeuRaLaTeX: A machine learning library written in pure LaTeX

链接: https://arxiv.org/abs/2503.24187
作者: James A. D. Gardner,Will Rowan,William A. P. Smith
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce NeuRaLaTeX, which we believe to be the first deep learning library written entirely in LaTeX. As part of your LaTeX document you can specify the architecture of a neural network and its loss functions, define how to generate or load training data, and specify training hyperparameters and experiments. When the document is compiled, the LaTeX compiler will generate or load training data, train the network, run experiments, and generate figures. This paper generates a random 100 point spiral dataset, trains a two layer MLP on it, evaluates on a different random spiral dataset, produces plots and tables of results. The paper took 48 hours to compile and the entire source code for NeuRaLaTeX is contained within the source code of the paper. We propose two new metrics: the Written In Latex (WIL) metric measures the proportion of a machine learning library that is written in pure LaTeX, while the Source Code Of Method in Source Code of Paper (SCOMISCOP) metric measures the proportion of a paper’s implementation that is contained within the paper source. We are state-of-the-art for both metrics, outperforming the ResNet and Transformer papers, as well as the PyTorch and Tensorflow libraries. Source code, documentation, videos, crypto scams and an invitation to invest in the commercialisation of NeuRaLaTeX are available at this https URL

[LG-10] Ride-Sourcing Vehicle Rebalancing with Service Accessibility Guarantees via Constrained Mean-Field Reinforcement Learning

链接: https://arxiv.org/abs/2503.24183
作者: Matej Jusup,Kenan Zhang,Zhiyuan Hu,Barna Pásztor,Andreas Krause,Francesco Corman
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 30 pages, 12 figures

点击查看摘要

Abstract:The rapid expansion of ride-sourcing services such as Uber, Lyft, and Didi Chuxing has fundamentally reshaped urban transportation by offering flexible, on-demand mobility via mobile applications. Despite their convenience, these platforms confront significant operational challenges, particularly vehicle rebalancing - the strategic repositioning of thousands of vehicles to address spatiotemporal mismatches in supply and demand. Inadequate rebalancing results in prolonged rider waiting times, inefficient vehicle utilization, and inequitable distribution of services, leading to disparities in driver availability and income. To tackle these complexities, we introduce scalable continuous-state mean-field control (MFC) and reinforcement learning (MFRL) models that explicitly represent each vehicle’s precise location and employ continuous repositioning actions guided by the distribution of other vehicles. To ensure equitable service distribution, an accessibility constraint is integrated within our optimal control formulation, balancing operational efficiency with equitable access to the service across geographic regions. Our approach acknowledges realistic conditions, including inherent stochasticity in transitions, the simultaneous occurrence of vehicle-rider matching, vehicles’ rebalancing and cruising, and variability in rider behaviors. Crucially, we relax the traditional mean-field assumption of equal supply-demand volume, better reflecting practical scenarios. Extensive empirical evaluation using real-world data-driven simulation of Shenzhen demonstrates the real-time efficiency and robustness of our approach at the scale of tens of thousands of vehicles. The code is available at this https URL. Comments: 30 pages, 12 figures Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2503.24183 [cs.LG] (or arXiv:2503.24183v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.24183 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] LLM 4FS: Leverag ing Large Language Models for Feature Selection and How to Improve It

链接: https://arxiv.org/abs/2503.24157
作者: Jianhao Li,Xianchao Xiu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a novel hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making.

[LG-12] Reinforcement Learning for Safe Autonomous Two Device Navigation of Cerebral Vessels in Mechanical Thrombectomy

链接: https://arxiv.org/abs/2503.24140
作者: Harry Robertshaw,Benjamin Jackson,Jiaheng Wang,Hadi Sadati,Lennart Karstensen,Alejandro Granados,Thomas C Booth
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Purpose: Autonomous systems in mechanical thrombectomy (MT) hold promise for reducing procedure times, minimizing radiation exposure, and enhancing patient safety. However, current reinforcement learning (RL) methods only reach the carotid arteries, are not generalizable to other patient vasculatures, and do not consider safety. We propose a safe dual-device RL algorithm that can navigate beyond the carotid arteries to cerebral vessels. Methods: We used the Simulation Open Framework Architecture to represent the intricacies of cerebral vessels, and a modified Soft Actor-Critic RL algorithm to learn, for the first time, the navigation of micro-catheters and micro-guidewires. We incorporate patient safety metrics into our reward function by integrating guidewire tip forces. Inverse RL is used with demonstrator data on 12 patient-specific vascular cases. Results: Our simulation demonstrates successful autonomous navigation within unseen cerebral vessels, achieving a 96% success rate, 7.0s procedure time, and 0.24 N mean forces, well below the proposed 1.5 N vessel rupture threshold. Conclusion: To the best of our knowledge, our proposed autonomous system for MT two-device navigation reaches cerebral vessels, considers safety, and is generalizable to unseen patient-specific cases for the first time. We envisage future work will extend the validation to vasculatures of different complexity and on in vitro models. While our contributions pave the way towards deploying agents in clinical settings, safety and trustworthiness will be crucial elements to consider when proposing new methodology. Subjects: Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2503.24140 [cs.LG] (or arXiv:2503.24140v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.24140 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Int J CARS (2025) Related DOI: https://doi.org/10.1007/s11548-025-03339-8 Focus to learn more DOI(s) linking to related resources Submission history From: Harry Robertshaw [view email] [v1] Mon, 31 Mar 2025 14:25:46 UTC (1,981 KB)

[LG-13] CTSketch: Compositional Tensor Sketching for Scalable Neurosymbolic Learning

链接: https://arxiv.org/abs/2503.24123
作者: Seewon Choi,Alaia Solko-Breslin,Rajeev Alur,Eric Wong
类目: Machine Learning (cs.LG)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Many computational tasks benefit from being formulated as the composition of neural networks followed by a discrete symbolic program. The goal of neurosymbolic learning is to train the neural networks using only end-to-end input-output labels of the composite. We introduce CTSketch, a novel, scalable neurosymbolic learning algorithm. CTSketch uses two techniques to improve the scalability of neurosymbolic inference: decompose the symbolic program into sub-programs and summarize each sub-program with a sketched tensor. This strategy allows us to approximate the output distribution of the program with simple tensor operations over the input distributions and summaries. We provide theoretical insight into the maximum error of the approximation. Furthermore, we evaluate CTSketch on many benchmarks from the neurosymbolic literature, including some designed for evaluating scalability. Our results show that CTSketch pushes neurosymbolic learning to new scales that have previously been unattainable by obtaining high accuracy on tasks involving over one thousand inputs.

[LG-14] Level the Level: Balancing Game Levels for Asymmetric Player Archetypes With Reinforcement Learning

链接: https://arxiv.org/abs/2503.24099
作者: Florian Rupp,Kai Eckert
类目: Machine Learning (cs.LG)
*备注: Accepted at the ACM International Conference on the Foundations of Digital Games (FDG) 2025

点击查看摘要

Abstract:Balancing games, especially those with asymmetric multiplayer content, requires significant manual effort and extensive human playtesting during development. For this reason, this work focuses on generating balanced levels tailored to asymmetric player archetypes, where the disparity in abilities is balanced entirely through the level design. For instance, while one archetype may have an advantage over another, both should have an equal chance of winning. We therefore conceptualize game balancing as a procedural content generation problem and build on and extend a recently introduced method that uses reinforcement learning to balance tile-based game levels. We evaluate the method on four different player archetypes and demonstrate its ability to balance a larger proportion of levels compared to two baseline approaches. Furthermore, our results indicate that as the disparity between player archetypes increases, the required number of training steps grows, while the model’s accuracy in achieving balance decreases.

[LG-15] HACTS: a Human-As-Copilot Teleoperation System for Robot Learning

链接: https://arxiv.org/abs/2503.24070
作者: Zhiyuan Xu,Yinuo Zhao,Kun Wu,Ning Liu,Junjie Ji,Zhengping Che,Chi Harold Liu,Jian Tang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Teleoperation is essential for autonomous robot learning, especially in manipulation tasks that require human demonstrations or corrections. However, most existing systems only offer unilateral robot control and lack the ability to synchronize the robot’s status with the teleoperation hardware, preventing real-time, flexible intervention. In this work, we introduce HACTS (Human-As-Copilot Teleoperation System), a novel system that establishes bilateral, real-time joint synchronization between a robot arm and teleoperation hardware. This simple yet effective feedback mechanism, akin to a steering wheel in autonomous vehicles, enables the human copilot to intervene seamlessly while collecting action-correction data for future learning. Implemented using 3D-printed components and low-cost, off-the-shelf motors, HACTS is both accessible and scalable. Our experiments show that HACTS significantly enhances performance in imitation learning (IL) and reinforcement learning (RL) tasks, boosting IL recovery capabilities and data efficiency, and facilitating human-in-the-loop RL. HACTS paves the way for more effective and interactive human-robot collaboration and data-collection, advancing the capabilities of robot manipulation.

[LG-16] ransMamba: Flexibly Switching between Transformer and Mamba

链接: https://arxiv.org/abs/2503.24067
作者: Yixing Li,Ruobing Xie,Zhen Yang,Xingwu Sun,Shuaipeng Li,Weidong Han,Zhanhui Kang,Yu Cheng,Chengzhong Xu,Di Wang,Jie Jiang
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:Transformers are the cornerstone of modern large language models, but their quadratic computational complexity limits efficiency in long-sequence processing. Recent advancements in Mamba, a state space model (SSM) with linear complexity, offer promising efficiency gains but suffer from unstable contextual learning and multitask generalization. This paper proposes TransMamba, a novel framework that unifies Transformer and Mamba through shared parameter matrices (e.g., QKV and CBx), and thus could dynamically switch between attention and SSM mechanisms at different token lengths and layers. We design the Memory converter to bridge Transformer and Mamba by converting attention outputs into SSM-compatible states, ensuring seamless information flow at TransPoints where the transformation happens. The TransPoint scheduling is also thoroughly explored for further improvements. We conducted extensive experiments demonstrating that TransMamba achieves superior training efficiency and performance compared to baselines, and validated the deeper consistency between Transformer and Mamba paradigms, offering a scalable solution for next-generation sequence modeling.

[LG-17] Accelerated Airfoil Design Using Neural Network Approaches

链接: https://arxiv.org/abs/2503.24052
作者: Anantram Patel,Nikhil Mogre,Mandar Mane,Jayavardhan Reddy Enumula,Vijay Kumar Sutrakar
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Applied Physics (physics.app-ph); Fluid Dynamics (physics.flu-dyn); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:In this paper, prediction of airfoil shape from targeted pressure distribution (suction and pressure sides) and vice versa is demonstrated using both Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs) techniques. The dataset is generated for 1600 airfoil shapes, with simulations carried out at Reynolds numbers (Re) ranging from 10,000 and 90,00,000 and angles of attack (AoA) ranging from 0 to 15 degrees, ensuring the dataset captured diverse aerodynamic conditions. Five different CNN and DNN models are developed depending on the input/output parameters. Results demonstrate that the refined models exhibit improved efficiency, with the DNN model achieving a multi-fold reduction in training time compared to the CNN model for complex datasets consisting of varying airfoil, Re, and AoA. The predicted airfoil shapes/pressure distribution closely match the targeted values, validating the effectiveness of deep learning frameworks. However, the performance of CNN models is found to be better compared to DNN models. Lastly, a flying wing aircraft model of wingspan 10 m is considered for the prediction of pressure distribution along the chordwise. The proposed CNN and DNN models show promising results. This research underscores the potential of deep learning models accelerating aerodynamic optimization and advancing the design of high-performance airfoils.

[LG-18] Frequency-Aware Attention-LSTM for PM_2.5 Time Series Forecasting

链接: https://arxiv.org/abs/2503.24043
作者: Jiahui LU,Shuang Wu,Zhenkai Qin,Dongze Wu,Guifang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To enhance the accuracy and robustness of PM _2.5 concentration forecasting, this paper introduces FALNet, a Frequency-Aware LSTM Network that integrates frequency-domain decomposition, temporal modeling, and attention-based refinement. The model first applies STL and FFT to extract trend, seasonal, and denoised residual components, effectively filtering out high-frequency noise. The filtered residuals are then fed into a stacked LSTM to capture long-term dependencies, followed by a multi-head attention mechanism that dynamically focuses on key time steps. Experiments conducted on real-world urban air quality datasets demonstrate that FALNet consistently outperforms conventional models across standard metrics such as MAE, RMSE, and R^2 . The model shows strong adaptability in capturing sharp fluctuations during pollution peaks and non-stationary conditions. These results validate the effectiveness and generalizability of FALNet for real-time air pollution prediction, environmental risk assessment, and decision-making support.

[LG-19] ree-Guided L_1-Convex Clustering

链接: https://arxiv.org/abs/2503.24012
作者: Bingyuan Zhang,Yoshikazu Terada
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Convex clustering is a modern clustering framework that guarantees globally optimal solutions and performs comparably to other advanced clustering methods. However, obtaining a complete dendrogram (clusterpath) for large-scale datasets remains computationally challenging due to the extensive costs associated with iterative optimization approaches. To address this limitation, we develop a novel convex clustering algorithm called Tree-Guided L_1 -Convex Clustering (TGCC). We first focus on the fact that the loss function of L_1 -convex clustering with tree-structured weights can be efficiently optimized using a dynamic programming approach. We then develop an efficient cluster fusion algorithm that utilizes the tree structure of the weights to accelerate the optimization process and eliminate the issue of cluster splits commonly observed in convex clustering. By combining the dynamic programming approach with the cluster fusion algorithm, the TGCC algorithm achieves superior computational efficiency without sacrificing clustering performance. Remarkably, our TGCC algorithm can construct a complete clusterpath for 10^6 points in \mathbbR^2 within 15 seconds on a standard laptop without the need for parallel or distributed computing frameworks. Moreover, we extend the TGCC algorithm to develop biclustering and sparse convex clustering algorithms.

[LG-20] Federated Structured Sparse PCA for Anomaly Detection in IoT Networks

链接: https://arxiv.org/abs/2503.23981
作者: Chenyi Huang,Xinrong Li,Xianchao Xiu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Although federated learning has gained prominence as a privacy-preserving framework tailored for distributed Internet of Things (IoT) environments, current federated principal component analysis (PCA) methods lack integration of sparsity, a critical feature for robust anomaly detection. To address this limitation, we propose a novel federated structured sparse PCA (FedSSP) approach for anomaly detection in IoT networks. The proposed model uniquely integrates double sparsity regularization: (1) row-wise sparsity governed by \ell_2,p -norm with p\in[0,1) to eliminate redundant feature dimensions, and (2) element-wise sparsity via \ell_q -norm with q\in[0,1) to suppress noise-sensitive components. To efficiently solve this non-convex optimization problem in a distributed setting, we devise a proximal alternating minimization (PAM) algorithm with rigorous theoretical proofs establishing its convergence guarantees. Experiments on real datasets validate that incorporating structured sparsity enhances both model interpretability and detection accuracy.

[LG-21] Machine Learning-assisted High-speed Combinatorial Optimization with Ising Machines for Dynamically Changing Problems

链接: https://arxiv.org/abs/2503.23966
作者: Yohei Hamakawa,Tomoya Kashimata,Masaya Yamasaki,Kosuke Tatsumura
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum or quantum-inspired Ising machines have recently shown promise in solving combinatorial optimization problems in a short time. Real-world applications, such as time division multiple access (TDMA) scheduling for wireless multi-hop networks and financial trading, require solving those problems sequentially where the size and characteristics change dynamically. However, using Ising machines involves challenges to shorten system-wide latency due to the transfer of large Ising model or the cloud access and to determine the parameters for each problem. Here we show a combinatorial optimization method using embedded Ising machines, which enables solving diverse problems at high speed without runtime parameter tuning. We customize the algorithm and circuit architecture of the simulated bifurcation-based Ising machine to compress the Ising model and accelerate computation and then built a machine learning model to estimate appropriate parameters using extensive training data. In TDMA scheduling for wireless multi-hop networks, our demonstration has shown that the sophisticated system can adapt to changes in the problem and showed that it has a speed advantage over conventional methods.

[LG-22] Certified Approximate Reachability (CARe): Formal Error Bounds on Deep Learning of Reachable Sets

链接: https://arxiv.org/abs/2503.23912
作者: Prashant Solanki,Nikolaus Vertovec,Yannik Schnitzer,Jasper Van Beers,Coen de Visser,Alessandro Abate
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent approaches to leveraging deep learning for computing reachable sets of continuous-time dynamical systems have gained popularity over traditional level-set methods, as they overcome the curse of dimensionality. However, as with level-set methods, considerable care needs to be taken in limiting approximation errors, particularly since no guarantees are provided during training on the accuracy of the learned reachable set. To address this limitation, we introduce an epsilon-approximate Hamilton-Jacobi Partial Differential Equation (HJ-PDE), which establishes a relationship between training loss and accuracy of the true reachable set. To formally certify this approximation, we leverage Satisfiability Modulo Theories (SMT) solvers to bound the residual error of the HJ-based loss function across the domain of interest. Leveraging Counter Example Guided Inductive Synthesis (CEGIS), we close the loop around learning and verification, by fine-tuning the neural network on counterexamples found by the SMT solver, thus improving the accuracy of the learned reachable set. To the best of our knowledge, Certified Approximate Reachability (CARe) is the first approach to provide soundness guarantees on learned reachable sets of continuous dynamical systems.

[LG-23] An End-to-End Comprehensive Gear Fault Diagnosis Method Based on Multi-Scale Feature-Level Fusion Strategy

链接: https://arxiv.org/abs/2503.23887
作者: Bowei Qiao,Hongwei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To satisfy the requirements of the end-to-end fault diagnosis of gears, an integrated intelligent method of fault diagnosis for gears using acceleration signals was proposed, which was based on Gabor-based Adaptive Short-Time Fourier Transform (Gabor-ASTFT) and Dual-Tree Complex Wavelet Transform(DTCWT) algorithms, Dilated Residual structure and feature fusion layer, is proposed in this paper. Initially, the raw one-dimensional acceleration signals collected from the gearbox base using vibration sensors undergo pre-segmentation processing. The Gabor-ASTFT and DTCWT are then applied to convert the original one-dimensional time-domain signals into two-dimensional time-frequency representations, facilitating the preliminary extraction of fault features and obtaining weak feature this http URL, a dual-channel structure is established using deconvolution and dilated convolution to perform upsampling and downsampling on the feature maps, adjusting their sizes accordingly. A feature fusion layer is then constructed to integrate the dual-channel features, enabling multi-scale analysis of the extracted fault this http URL, a convolutional neural network (CNN) model incorporating a residual structure is developed to conduct deep feature extraction from the fused feature maps. The extracted features are subsequently fed into a Global Average Pooling(GAP) and a classification function for fault classification. Conducting comparative experiments on different datasets, the proposed method is demonstrated to effectively meet the requirements of end-to-end fault diagnosis for gears.

[LG-24] Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation

链接: https://arxiv.org/abs/2503.23869
作者: Yongle Li,Bo Liu,Sheng Huang,ZHeng ZHang,Xiaotong Yuan,Richang Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning, fine-tuning pre-trained foundation models poses significant challenges, particularly regarding high communication cost and suboptimal model performance due to data heterogeneity between the clients. To address these issues, this paper introduces communication-efficient federated LoRA adaption (CE-LoRA), a method that employs a tri-factorization low-rank adaptation approach with personalized model parameter aggregation. We first presents a novel LoRA parameter factorization by introducing a small-size dense matrix, which can significantly reduce the communication cost and achieve comparable empirical performance than transferring the low-rank parameter matrix used by existing methods. Without violating data privacy, the server considers the client similarity in both training dataset and model parameter space, and learns personalized weights for model aggregation. Our experiments on various LLM and VLM fine-tuning tasks demonstrate that CE-LoRA not only significantly reduces communication overhead but also improves performance under not independently and identically distributed data conditions. In addition, CE-LoRA improves data privacy protection, effectively mitigating gradient-based data reconstruction attacks.

[LG-25] A Channel-Triggered Backdoor Attack on Wireless Semantic Image Reconstruction

链接: https://arxiv.org/abs/2503.23866
作者: Jialin Wan,Nan Cheng,Jinglong Shen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the transformative impact of deep learning (DL) on wireless communication systems through data-driven end-to-end (E2E) learning, the security vulnerabilities of these systems have been largely overlooked. Unlike the extensively studied image domain, limited research has explored the threat of backdoor attacks on the reconstruction of symbols in semantic communication (SemCom) systems. Previous work has investigated such backdoor attacks at the input level, but these approaches are infeasible in applications with strict input control. In this paper, we propose a novel attack paradigm, termed Channel-Triggered Backdoor Attack (CT-BA), where the backdoor trigger is a specific wireless channel. This attack leverages fundamental physical layer characteristics, making it more covert and potentially more threatening compared to previous input-level attacks. Specifically, we utilize channel gain with different fading distributions or channel noise with different power spectral densities as potential triggers. This approach establishes unprecedented attack flexibility as the adversary can select backdoor triggers from both fading characteristics and noise variations in diverse channel environments. Moreover, during the testing phase, CT-BA enables automatic trigger activation through natural channel variations without requiring active adversary participation. We evaluate the robustness of CT-BA on a ViT-based Joint Source-Channel Coding (JSCC) model across three datasets: MNIST, CIFAR-10, and ImageNet. Furthermore, we apply CT-BA to three typical E2E SemCom systems: BDJSCC, ADJSCC, and JSCCOFDM. Experimental results demonstrate that our attack achieves near-perfect attack success rate (ASR) while maintaining effective stealth. Finally, we discuss potential defense mechanisms against such attacks.

[LG-26] An extrapolated and provably convergent algorithm for nonlinear matrix decomposition with the ReLU function

链接: https://arxiv.org/abs/2503.23832
作者: Nicolas Gillis,Margherita Porcelli,Giovanni Seraghiti
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 27 pages. Codes and data available from this https URL

点击查看摘要

Abstract:Nonlinear matrix decomposition (NMD) with the ReLU function, denoted ReLU-NMD, is the following problem: given a sparse, nonnegative matrix X and a factorization rank r , identify a rank- r matrix \Theta such that X\approx \max(0,\Theta) . This decomposition finds application in data compression, matrix completion with entries missing not at random, and manifold learning. The standard ReLU-NMD model minimizes the least squares error, that is, |X - \max(0,\Theta)|_F^2 . The corresponding optimization problem is nondifferentiable and highly nonconvex. This motivated Saul to propose an alternative model, Latent-ReLU-NMD, where a latent variable Z is introduced and satisfies \max(0,Z)=X while minimizing |Z - \Theta|_F^2 (``A nonlinear matrix decomposition for mining the zeros of sparse data’', SIAM J. Math. Data Sci., 2022). Our first contribution is to show that the two formulations may yield different low-rank solutions \Theta ; in particular, we show that Latent-ReLU-NMD can be ill-posed when ReLU-NMD is not, meaning that there are instances in which the infimum of Latent-ReLU-NMD is not attained while that of ReLU-NMD is. We also consider another alternative model, called 3B-ReLU-NMD, which parameterizes \Theta=WH , where W has r columns and H has r rows, allowing one to get rid of the rank constraint in Latent-ReLU-NMD. Our second contribution is to prove the convergence of a block coordinate descent (BCD) applied to 3B-ReLU-NMD and referred to as BCD-NMD. Our third contribution is a novel extrapolated variant of BCD-NMD, dubbed eBCD-NMD, which we prove is also convergent under mild assumptions. We illustrate the significant acceleration effect of eBCD-NMD compared to BCD-NMD, and also show that eBCD-NMD performs well against the state of the art on synthetic and real-world data sets.

[LG-27] Node Embeddings via Neighbor Embeddings

链接: https://arxiv.org/abs/2503.23822
作者: Jan Niklas Böhm,Marius Keute,Alica Guzmán,Sebastian Damrich,Andrew Draganov,Dmitry Kobak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph layouts and node embeddings are two distinct paradigms for non-parametric graph representation learning. In the former, nodes are embedded into 2D space for visualization purposes. In the latter, nodes are embedded into a high-dimensional vector space for downstream processing. State-of-the-art algorithms for these two paradigms, force-directed layouts and random-walk-based contrastive learning (such as DeepWalk and node2vec), have little in common. In this work, we show that both paradigms can be approached with a single coherent framework based on established neighbor embedding methods. Specifically, we introduce graph t-SNE, a neighbor embedding method for two-dimensional graph layouts, and graph CNE, a contrastive neighbor embedding method that produces high-dimensional node representations by optimizing the InfoNCE objective. We show that both graph t-SNE and graph CNE strongly outperform state-of-the-art algorithms in terms of local structure preservation, while being conceptually simpler.

[LG-28] Free Parametrization of L2-bounded State Space Models

链接: https://arxiv.org/abs/2503.23818
作者: Leonardo Massai,Giancarlo Ferrari-Trecate
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Structured state-space models (SSMs) have emerged as a powerful architecture in machine learning and control, featuring stacked layers where each consists of a linear time-invariant (LTI) discrete-time system followed by a nonlinearity. While SSMs offer computational efficiency and excel in long-sequence predictions, their widespread adoption in applications like system identification and optimal control is hindered by the challenge of ensuring their stability and robustness properties. We introduce L2RU, a novel parametrization of SSMs that guarantees input-output stability and robustness by enforcing a prescribed L-bound for all parameter values. This design eliminates the need for complex constraints, allowing unconstrained optimization over L2RUs by using standard methods such as gradient descent. Leveraging tools from system theory and convex optimization, we derive a non-conservative parametrization of square discrete-time LTI systems with a specified L2-bound, forming the foundation of the L2RU architecture. Additionally, we enhance its performance with a bespoke initialization strategy optimized for long input sequences. Through a system identification task, we validate L2RU’s superior performance, showcasing its potential in learning and control applications.

[LG-29] An extension of linear self-attention for in-context learning

链接: https://arxiv.org/abs/2503.23814
作者: Katsuyuki Hagiwara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning is a remarkable property of transformers and has been the focus of recent research. An attention mechanism is a key component in transformers, in which an attention matrix encodes relationships between words in a sentence and is used as weights for words in a sentence. This mechanism is effective for capturing language representations. However, it is questionable whether naive self-attention is suitable for in-context learning in general tasks, since the computation implemented by self-attention is somewhat restrictive in terms of matrix multiplication. In fact, we may need appropriate input form designs when considering heuristic implementations of computational algorithms. In this paper, in case of linear self-attention, we extend it by introducing a bias matrix in addition to a weight matrix for an input. Despite the simple extension, the extended linear self-attention can output any constant matrix, input matrix and multiplications of two or three matrices in the input. Note that the second property implies that it can be a skip connection. Therefore, flexible matrix manipulations can be implemented by connecting the extended linear self-attention components. As an example of implementation using the extended linear self-attention, we show a heuristic construction of a batch-type gradient descent of ridge regression under a reasonable input form.

[LG-30] Accelerating High-Efficiency Organic Photovoltaic Discovery via Pretrained Graph Neural Networks and Generative Reinforcement Learning ICLR2025

链接: https://arxiv.org/abs/2503.23766
作者: Jiangjie Qiu,Hou Hei Lam,Xiuyuan Hu,Wentao Li,Siwei Fu,Fankun Zeng,Hao Zhang,Xiaonan Wang
类目: Machine Learning (cs.LG)
*备注: AI for Accelerated Materials Design - ICLR 2025

点击查看摘要

Abstract:Organic photovoltaic (OPV) materials offer a promising avenue toward cost-effective solar energy utilization. However, optimizing donor-acceptor (D-A) combinations to achieve high power conversion efficiency (PCE) remains a significant challenge. In this work, we propose a framework that integrates large-scale pretraining of graph neural networks (GNNs) with a GPT-2 (Generative Pretrained Transformer 2)-based reinforcement learning (RL) strategy to design OPV molecules with potentially high PCE. This approach produces candidate molecules with predicted efficiencies approaching 21%, although further experimental validation is required. Moreover, we conducted a preliminary fragment-level analysis to identify structural motifs recognized by the RL model that may contribute to enhanced PCE, thus providing design guidelines for the broader research community. To facilitate continued discovery, we are building the largest open-source OPV dataset to date, expected to include nearly 3,000 donor-acceptor pairs. Finally, we discuss plans to collaborate with experimental teams on synthesizing and characterizing AI-designed molecules, which will provide new data to refine and improve our predictive and generative models.

[LG-31] me-Series Forecasting via Topological Information Supervised Framework with Efficient Topological Feature Learning

链接: https://arxiv.org/abs/2503.23757
作者: ZiXin Lin,Nur Fariha Syaqina Zulkepli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Topological Data Analysis (TDA) has emerged as a powerful tool for extracting meaningful features from complex data structures, driving significant advancements in fields such as neuroscience, biology, machine learning, and financial modeling. Despite its success, the integration of TDA with time-series prediction remains underexplored due to three primary challenges: the limited utilization of temporal dependencies within topological features, computational bottlenecks associated with persistent homology, and the deterministic nature of TDA pipelines restricting generalized feature learning. This study addresses these challenges by proposing the Topological Information Supervised (TIS) Prediction framework, which leverages neural networks and Conditional Generative Adversarial Networks (CGANs) to generate synthetic topological features, preserving their distribution while significantly reducing computational time. We propose a novel training strategy that integrates topological consistency loss to improve the predictive accuracy of deep learning models. Specifically, we introduce two state-of-the-art models, TIS-BiGRU and TIS-Informer, designed to capture short-term and long-term temporal dependencies, respectively. Comparative experimental results demonstrate the superior performance of TIS models over conventional predictors, validating the effectiveness of integrating topological information. This work not only advances TDA-based time-series prediction but also opens new avenues for utilizing topological features in deep learning architectures.

[LG-32] HEMIS: Towards Practical Intellectual Property Protection for Post-Deployment On-Device Deep Learning Models USENIX-SECURITY

链接: https://arxiv.org/abs/2503.23748
作者: Yujin Huang,Zhi Zhang,Qingchuan Zhao,Xingliang Yuan,Chunyang Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025

点击查看摘要

Abstract:On-device deep learning (DL) has rapidly gained adoption in mobile apps, offering the benefits of offline model inference and user privacy preservation over cloud-based approaches. However, it inevitably stores models on user devices, introducing new vulnerabilities, particularly model-stealing attacks and intellectual property infringement. While system-level protections like Trusted Execution Environments (TEEs) provide a robust solution, practical challenges remain in achieving scalable on-device DL model protection, including complexities in supporting third-party models and limited adoption in current mobile solutions. Advancements in TEE-enabled hardware, such as NVIDIA’s GPU-based TEEs, may address these obstacles in the future. Currently, watermarking serves as a common defense against model theft but also faces challenges here as many mobile app developers lack corresponding machine learning expertise and the inherent read-only and inference-only nature of on-device DL models prevents third parties like app stores from implementing existing watermarking techniques in post-deployment models. To protect the intellectual property of on-device DL models, in this paper, we propose THEMIS, an automatic tool that lifts the read-only restriction of on-device DL models by reconstructing their writable counterparts and leverages the untrainable nature of on-device DL models to solve watermark parameters and protect the model owner’s intellectual property. Extensive experimental results across various datasets and model structures show the superiority of THEMIS in terms of different metrics. Further, an empirical investigation of 403 real-world DL mobile apps from Google Play is performed with a success rate of 81.14%, showing the practicality of THEMIS. Comments: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2503.23748 [cs.CR] (or arXiv:2503.23748v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.23748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] Integral regularization PINNs for evolution equations

链接: https://arxiv.org/abs/2503.23729
作者: Xiaodong Feng,Haojiong Shangguan,Tao Tang,Xiaoliang Wan
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evolution equations, including both ordinary differential equations (ODEs) and partial differential equations (PDEs), play a pivotal role in modeling dynamic systems. However, achieving accurate long-time integration for these equations remains a significant challenge. While physics-informed neural networks (PINNs) provide a mesh-free framework for solving PDEs, they often suffer from temporal error accumulation, which limits their effectiveness in capturing long-time behaviors. To alleviate this issue, we propose integral regularization PINNs (IR-PINNs), a novel approach that enhances temporal accuracy by incorporating an integral-based residual term into the loss function. This method divides the entire time interval into smaller sub-intervals and enforces constraints over these sub-intervals, thereby improving the resolution and correlation of temporal dynamics. Furthermore, IR-PINNs leverage adaptive sampling to dynamically refine the distribution of collocation points based on the evolving solution, ensuring higher accuracy in regions with sharp gradients or rapid variations. Numerical experiments on benchmark problems demonstrate that IR-PINNs outperform original PINNs and other state-of-the-art methods in capturing long-time behaviors, offering a robust and accurate solution for evolution equations.

[LG-34] PDSL: Privacy-Preserved Decentralized Stochastic Learning with Heterogeneous Data Distribution

链接: https://arxiv.org/abs/2503.23726
作者: Lina Wang,Yunsheng Yuan,Chunxiao Wang,Feng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the paradigm of decentralized learning, a group of agents collaborates to learn a global model using distributed datasets without a central server. However, due to the heterogeneity of the local data across the different agents, learning a robust global model is rather challenging. Moreover, the collaboration of the agents relies on their gradient information exchange, which poses a risk of privacy leakage. In this paper, to address these issues, we propose PDSL, a novel privacy-preserved decentralized stochastic learning algorithm with heterogeneous data distribution. On one hand, we innovate in utilizing the notion of Shapley values such that each agent can precisely measure the contributions of its heterogeneous neighbors to the global learning goal; on the other hand, we leverage the notion of differential privacy to prevent each agent from suffering privacy leakage when it contributes gradient information to its neighbors. We conduct both solid theoretical analysis and extensive experiments to demonstrate the efficacy of our PDSL algorithm in terms of privacy preservation and convergence.

[LG-35] Steering Large Agent Populations using Mean-Field Schrodinger Bridges with Gaussian Mixture Models

链接: https://arxiv.org/abs/2503.23705
作者: George Rapakoulias,Ali Reza Pedram,Panagiotis Tsiotras
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Mean-Field Schrodinger Bridge (MFSB) problem is an optimization problem aiming to find the minimum effort control policy to drive a McKean-Vlassov stochastic differential equation from one probability measure to another. In the context of multiagent control, the objective is to control the configuration of a swarm of identical, interacting cooperative agents, as captured by the time-varying probability measure of their state. Available methods for solving this problem for distributions with continuous support rely either on spatial discretizations of the problem’s domain or on approximating optimal solutions using neural networks trained through stochastic optimization schemes. For agents following Linear Time-Varying dynamics, and for Gaussian Mixture Model boundary distributions, we propose a highly efficient parameterization to approximate the solutions of the corresponding MFSB in closed form, without any learning steps. Our proposed approach consists of a mixture of elementary policies, each solving a Gaussian-to-Gaussian Covariance Steering problem from the components of the initial to the components of the terminal mixture. Leveraging the semidefinite formulation of the Covariance Steering problem, our proposed solver can handle probabilistic hard constraints on the system’s state, while maintaining numerical tractability. We illustrate our approach on a variety of numerical examples.

[LG-36] A Low-complexity Structured Neural Network to Realize States of Dynamical Systems

链接: https://arxiv.org/abs/2503.23697
作者: Hansaka Aluvihare,Levi Lingsch,Xianqi Li,Sirani M. Perera
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Data-driven learning is rapidly evolving and places a new perspective on realizing state-space dynamical systems. However, dynamical systems derived from nonlinear ordinary differential equations (ODEs) suffer from limitations in computational efficiency. Thus, this paper stems from data-driven learning to advance states of dynamical systems utilizing a structured neural network (StNN). The proposed learning technique also seeks to identify an optimal, low-complexity operator to solve dynamical systems, the so-called Hankel operator, derived from time-delay measurements. Thus, we utilize the StNN based on the Hankel operator to solve dynamical systems as an alternative to existing data-driven techniques. We show that the proposed StNN reduces the number of parameters and computational complexity compared with the conventional neural networks and also with the classical data-driven techniques, such as Sparse Identification of Nonlinear Dynamics (SINDy) and Hankel Alternative view of Koopman (HAVOK), which is commonly known as delay-Dynamic Mode Decomposition(DMD) or Hankel-DMD. More specifically, we present numerical simulations to solve dynamical systems utilizing the StNN based on the Hankel operator beginning from the fundamental Lotka-Volterra model, where we compare the StNN with the LEarning Across Dynamical Systems (LEADS), and extend our analysis to highly nonlinear and chaotic Lorenz systems, comparing the StNN with conventional neural networks, SINDy, and HAVOK. Hence, we show that the proposed StNN paves the way for realizing state-space dynamical systems with a low-complexity learning algorithm, enabling prediction and understanding of future states.

[LG-37] Data-Driven Forecasting of High-Dimensional Transient and Stationary Processes via Space-Time Projection

链接: https://arxiv.org/abs/2503.23686
作者: Oliver T. Schmidt
类目: Machine Learning (cs.LG); Astrophysics of Galaxies (astro-ph.GA); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Space-Time Projection (STP) is introduced as a data-driven forecasting approach for high-dimensional and time-resolved data. The method computes extended space-time proper orthogonal modes from training data spanning a prediction horizon comprising both hindcast and forecast intervals. Forecasts are then generated by projecting the hindcast portion of these modes onto new data, simultaneously leveraging their orthogonality and optimal correlation with the forecast extension. Rooted in Proper Orthogonal Decomposition (POD) theory, dimensionality reduction and time-delay embedding are intrinsic to the approach. For a given ensemble and fixed prediction horizon, the only tunable parameter is the truncation rank–no additional hyperparameters are required. The hindcast accuracy serves as a reliable indicator for short-term forecast accuracy and establishes a lower bound on forecast errors. The efficacy of the method is demonstrated using two datasets: transient, highly anisotropic simulations of supernova explosions in a turbulent interstellar medium, and experimental velocity fields of a turbulent high-subsonic engineering flow. In a comparative study with standard Long Short-Term Memory (LSTM) neural networks–acknowledging that alternative architectures or training strategies may yield different outcomes–the method consistently provided more accurate forecasts. Considering its simplicity and robust performance, STP offers an interpretable and competitive benchmark for forecasting high-dimensional transient and chaotic processes, relying purely on spatiotemporal correlation information.

[LG-38] Dynamic Operating System Scheduling Using Double DQN: A Reinforcement Learning Approach to Task Optimization

链接: https://arxiv.org/abs/2503.23659
作者: Xiaoxuan Sun,Yifei Duan,Yingnan Deng,Fan Guo,Guohui Cai,Yuting Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, an operating system scheduling algorithm based on Double DQN (Double Deep Q network) is proposed, and its performance under different task types and system loads is verified by experiments. Compared with the traditional scheduling algorithm, the algorithm based on Double DQN can dynamically adjust the task priority and resource allocation strategy, thus improving the task completion efficiency, system throughput, and response speed. The experimental results show that the Double DQN algorithm has high scheduling performance under light load, medium load and heavy load scenarios, especially when dealing with I/O intensive tasks, and can effectively reduce task completion time and system response time. In addition, the algorithm also shows high optimization ability in resource utilization and can intelligently adjust resource allocation according to the system state, avoiding resource waste and excessive load. Future studies will further explore the application of the algorithm in more complex systems, especially scheduling optimization in cloud computing and large-scale distributed environments, combining factors such as network latency and energy efficiency to improve the overall performance and adaptability of the algorithm.

[LG-39] A Survey of Reinforcement Learning-Based Motion Planning for Autonomous Driving: Lessons Learned from a Driving Task Perspective

链接: https://arxiv.org/abs/2503.23650
作者: Zhuoren Li,Guizhe Jin,Ran Yu,Zhiwen Chen,Nan Li,Wei Han,Lu Xiong,Bo Leng,Jia Hu,Ilya Kolmanovsky,Dimitar Filev
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 21 pages, 5 figures

点击查看摘要

Abstract:Reinforcement learning (RL), with its ability to explore and optimize policies in complex, dynamic decision-making tasks, has emerged as a promising approach to addressing motion planning (MoP) challenges in autonomous driving (AD). Despite rapid advancements in RL and AD, a systematic description and interpretation of the RL design process tailored to diverse driving tasks remains underdeveloped. This survey provides a comprehensive review of RL-based MoP for AD, focusing on lessons from task-specific perspectives. We first outline the fundamentals of RL methodologies, and then survey their applications in MoP, analyzing scenario-specific features and task requirements to shed light on their influence on RL design choices. Building on this analysis, we summarize key design experiences, extract insights from various driving task applications, and provide guidance for future implementations. Additionally, we examine the frontier challenges in RL-based MoP, review recent efforts to addresse these challenges, and propose strategies for overcoming unresolved issues.

[LG-40] A Constrained Multi-Agent Reinforcement Learning Approach to Autonomous Traffic Signal Control

链接: https://arxiv.org/abs/2503.23626
作者: Anirudh Satheesh,Keenan Powell
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Submitted to ACM Journal for Autonomous Transportation Systems

点击查看摘要

Abstract:Traffic congestion in modern cities is exacerbated by the limitations of traditional fixed-time traffic signal systems, which fail to adapt to dynamic traffic patterns. Adaptive Traffic Signal Control (ATSC) algorithms have emerged as a solution by dynamically adjusting signal timing based on real-time traffic conditions. However, the main limitation of such methods is that they are not transferable to environments under real-world constraints, such as balancing efficiency, minimizing collisions, and ensuring fairness across intersections. In this paper, we view the ATSC problem as a constrained multi-agent reinforcement learning (MARL) problem and propose a novel algorithm named Multi-Agent Proximal Policy Optimization with Lagrange Cost Estimator (MAPPO-LCE) to produce effective traffic signal control policies. Our approach integrates the Lagrange multipliers method to balance rewards and constraints, with a cost estimator for stable adjustment. We also introduce three constraints on the traffic network: GreenTime, GreenSkip, and PhaseSkip, which penalize traffic policies that do not conform to real-world scenarios. Our experimental results on three real-world datasets demonstrate that MAPPO-LCE outperforms three baseline MARL algorithms by across all environments and traffic constraints (improving on MAPPO by 12.60%, IPPO by 10.29%, and QTRAN by 13.10%). Our results show that constrained MARL is a valuable tool for traffic planners to deploy scalable and efficient ATSC methods in real-world traffic networks. We provide code at this https URL.

[LG-41] Simple Feedfoward Neural Networks are Almost All You Need for Time Series Forecasting

链接: https://arxiv.org/abs/2503.23621
作者: Fan-Keng Sun,Yu-Cheng Wu,Duane S. Boning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series data are everywhere – from finance to healthcare – and each domain brings its own unique complexities and structures. While advanced models like Transformers and graph neural networks (GNNs) have gained popularity in time series forecasting, largely due to their success in tasks like language modeling, their added complexity is not always necessary. In our work, we show that simple feedforward neural networks (SFNNs) can achieve performance on par with, or even exceeding, these state-of-the-art models, while being simpler, smaller, faster, and more robust. Our analysis indicates that, in many cases, univariate SFNNs are sufficient, implying that modeling interactions between multiple series may offer only marginal benefits. Even when inter-series relationships are strong, a basic multivariate SFNN still delivers competitive results. We also examine some key design choices and offer guidelines on making informed decisions. Additionally, we critique existing benchmarking practices and propose an improved evaluation protocol. Although SFNNs may not be optimal for every situation (hence the ``almost’’ in our title) they serve as a strong baseline that future time series forecasting methods should always be compared against.

[LG-42] Make Autoregressive Great Again: Diffusion-Free Graph Generation with Next-Scale Prediction

链接: https://arxiv.org/abs/2503.23612
作者: Samuel Belkadi,Steve Hong,Marian Chen
类目: Machine Learning (cs.LG)
*备注: Draft #1

点击查看摘要

Abstract:Autoregressive models are popular generative models due to their speed and properties. However, they require an explicit sequence order, which contradicts the unordered nature of graphs. In contrast, diffusion models maintain permutation invariance and enable one-shot generation but require up to thousands of denoising steps and additional features, leading to high computational costs. Inspired by recent breakthroughs in image generation-especially the success of visual autoregressive methods-we propose MAG, a novel diffusion-free graph generation framework based on next-scale prediction. By leveraging a hierarchy of latent representations, the model progressively generates scales of the entire graph without the need for explicit node ordering. Extensive experiments on both generic and molecular graph datasets demonstrate that MAG delivers competitive performance compared to state-of-the-art methods, achieving up to three orders of magnitude in speedup during inference.

[LG-43] Autonomous Learning with High-Dimensional Computing Architecture Similar to von Neumanns

链接: https://arxiv.org/abs/2503.23608
作者: Pentti Kanerva
类目: Machine Learning (cs.LG)
*备注: 20 pages including references, all contained in a single .tex file

点击查看摘要

Abstract:We model human and animal learning by computing with high-dimensional vectors (H = 10,000 for example). The architecture resembles traditional (von Neumann) computing with numbers, but the instructions refer to vectors and operate on them in superposition. The architecture includes a high-capacity memory for vectors, analogue of the random-access memory (RAM) for numbers. The model’s ability to learn from data reminds us of deep learning, but with an architecture closer to biology. The architecture agrees with an idea from psychology that human memory and learning involve a short-term working memory and a long-term data store. Neuroscience provides us with a model of the long-term memory, namely, the cortex of the cerebellum. With roots in psychology, biology, and traditional computing, a theory of computing with vectors can help us understand how brains compute. Application to learning by robots seems inevitable, but there is likely to be more, including language. Ultimately we want to compute with no more material and energy than used by brains. To that end, we need a mathematical theory that agrees with psychology and biology, and is suitable for nanotechnology. We also need to exercise the theory in large-scale experiments. Computing with vectors is described here in terms familiar to us from traditional computing with numbers.

[LG-44] Space of Data through the Lens of Multilevel Graph

链接: https://arxiv.org/abs/2503.23602
作者: Marco Caputo,Michele Russo,Emanuela Merelli
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 18 pages, 11 figures, ITADATA 2024 conference

点击查看摘要

Abstract:This work seeks to tackle the inherent complexity of dataspaces by introducing a novel data structure that can represent datasets across multiple levels of abstraction, ranging from local to global. We propose the concept of a multilevel graph, which is equipped with two fundamental operations: contraction and expansion of its topology. This multilevel graph is specifically designed to fulfil the requirements for incremental abstraction and flexibility, as outlined in existing definitions of dataspaces. Furthermore, we provide a comprehensive suite of methods for manipulating this graph structure, establishing a robust framework for data analysis. While its effectiveness has been empirically validated for unstructured data, its application to structured data is also inherently viable. Preliminary results are presented through a real-world scenario based on a collection of dream reports.

[LG-45] Exploring GPT -4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework

链接: https://arxiv.org/abs/2503.23601
作者: Thomas O’Brien,Ysobel Sims
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the use of GPT-4 on a humanoid robot in simulation and the real world as proof of concept of a novel large language model (LLM) driven behaviour method. LLMs have shown the ability to perform various tasks, including robotic agent behaviour. The problem involves prompting the LLM with a goal, and the LLM outputs the sub-tasks to complete to achieve that goal. Previous works focus on the executability and correctness of the LLM’s generated tasks. We propose a method that successfully addresses practical concerns around safety, transitions between tasks, time horizons of tasks and state feedback. In our experiments we have found that our approach produces output for feasible requests that can be executed every time, with smooth transitions. User requests are achieved most of the time across a range of goal time horizons.

[LG-46] Bridging conformal prediction and scenario optimization

链接: https://arxiv.org/abs/2503.23561
作者: Niall O’Sullivan,Licio Romao,Kostas Margellos
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Conformal prediction and scenario optimization constitute two important classes of statistical learning frameworks to certify decisions made using data. They have found numerous applications in control theory, machine learning and robotics. Despite intense research in both areas, and apparently similar results, a clear connection between these two frameworks has not been established. By focusing on the so-called vanilla conformal prediction, we show rigorously how to choose appropriate score functions and set predictor map to recover well-known bounds on the probability of constraint violation associated with scenario programs. We also show how to treat ranking of nonconformity scores as a one-dimensional scenario program with discarded constraints, and use such connection to recover vanilla conformal prediction guarantees on the validity of the set predictor. We also capitalize on the main developments of the scenario approach, and show how we could analyze calibration conditional conformal prediction under this lens. Our results establish a theoretical bridge between conformal prediction and scenario optimization.

[LG-47] Redundant feature screening method for human activity recognition based on attention purification mechanism

链接: https://arxiv.org/abs/2503.23537
作者: Hanyu Liu,Xiaoyang Li,Yixuan Jiang,Haotian Tang,Dongchen Wu,Yameng Guo
类目: Machine Learning (cs.LG)
*备注: 12 pages,7 figures

点击查看摘要

Abstract:In the field of sensor-based Human Activity Recognition (HAR), deep neural networks provide advanced technical support. Many studies have proven that recognition accuracy can be improved by increasing the depth or width of the network. However, for wearable devices, the balance between network performance and resource consumption is crucial. With minimum resource consumption as the basic principle, we propose a universal attention feature purification mechanism, called MSAP, which is suitable for multi-scale networks. The mechanism effectively solves the feature redundancy caused by the superposition of multi-scale features by means of inter-scale attention screening and connection method. In addition, we have designed a network correction module that integrates seamlessly between layers of individual network modules to mitigate inherent problems in deep networks. We also built an embedded deployment system that is in line with the current level of wearable technology to test the practical feasibility of the HAR model, and further prove the efficiency of the method. Extensive experiments on four public datasets show that the proposed method model effectively reduces redundant features in filtered data and provides excellent performance with little resource consumption.

[LG-48] In-silico biological discovery with large perturbation models

链接: https://arxiv.org/abs/2503.23535
作者: Djordje Miladinovic,Tobias Höppe,Mathieu Chevalley,Andreas Georgiou,Lachlan Stuart,Arash Mehrjou,Marcus Bantscheff,Bernhard Schölkopf,Patrick Schwab
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Data generated in perturbation experiments link perturbations to the changes they elicit and therefore contain information relevant to numerous biological discovery tasks – from understanding the relationships between biological entities to developing therapeutics. However, these data encompass diverse perturbations and readouts, and the complex dependence of experimental outcomes on their biological context makes it challenging to integrate insights across experiments. Here, we present the Large Perturbation Model (LPM), a deep-learning model that integrates multiple, heterogeneous perturbation experiments by representing perturbation, readout, and context as disentangled dimensions. LPM outperforms existing methods across multiple biological discovery tasks, including in predicting post-perturbation transcriptomes of unseen experiments, identifying shared molecular mechanisms of action between chemical and genetic perturbations, and facilitating the inference of gene-gene interaction networks.

[LG-49] owards Trustworthy GUI Agents : A Survey

链接: https://arxiv.org/abs/2503.23434
作者: Yucheng Shi,Wenhao Yu,Wenlin Yao,Wenhu Chen,Ninghao Liu
类目: Machine Learning (cs.LG)
*备注: 10 pages, work in process

点击查看摘要

Abstract:GUI agents, powered by large foundation models, can interact with digital interfaces, enabling various applications in web automation, mobile navigation, and software testing. However, their increasing autonomy has raised critical concerns about their security, privacy, and safety. This survey examines the trustworthiness of GUI agents in five critical dimensions: security vulnerabilities, reliability in dynamic environments, transparency and explainability, ethical considerations, and evaluation methodologies. We also identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making, and a lack of realistic evaluation benchmarks. These issues not only hinder real-world deployment but also call for comprehensive mitigation strategies beyond task success. As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential. This survey provides a foundation for advancing trustworthy GUI agents through systematic understanding and future research.

[LG-50] Solve sparse PCA problem by employing Hamiltonian system and leapfrog method

链接: https://arxiv.org/abs/2503.23335
作者: Loc Hoang Tran
类目: Machine Learning (cs.LG)
*备注: 2 tables

点击查看摘要

Abstract:Principal Component Analysis (PCA) is a widely utilized technique for dimensionality reduction; however, its inherent lack of interpretability-stemming from dense linear combinations of all feature-limits its applicability in many domains. In this paper, we propose a novel sparse PCA algorithm that imposes sparsity through a smooth L1 penalty and leverages a Hamiltonian formulation solved via geometric integration techniques. Specifically, we implement two distinct numerical methods-one based on the Proximal Gradient (ISTA) approach and another employing a leapfrog (fourth-order Runge-Kutta) scheme-to minimize the energy function that balances variance maximization with sparsity enforcement. To extract a subset of sparse principal components, we further incorporate a deflation technique and subsequently transform the original high-dimensional face data into a lower-dimensional feature space. Experimental evaluations on a face recognition dataset-using both k-nearest neighbor and kernel ridge regression classifiers-demonstrate that the proposed sparse PCA methods consistently achieve higher classification accuracy than conventional PCA. Future research will extend this framework to integrate sparse PCA with modern deep learning architectures for multimodal recognition tasks.

[LG-51] Enhancing Physics-Informed Neural Networks with a Hybrid Parallel Kolmogorov-Arnold and MLP Architecture

链接: https://arxiv.org/abs/2503.23289
作者: Zuyu Xu,Bin Lv
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Neural networks have emerged as powerful tools for modeling complex physical systems, yet balancing high accuracy with computational efficiency remains a critical challenge in their convergence behavior. In this work, we propose the Hybrid Parallel Kolmogorov-Arnold Network (KAN) and Multi-Layer Perceptron (MLP) Physics-Informed Neural Network (HPKM-PINN), a novel architecture that synergistically integrates parallelized KAN and MLP branches within a unified PINN framework. The HPKM-PINN introduces a scaling factor \xi, to optimally balance the complementary strengths of KAN’s interpretable function approximation and MLP’s nonlinear feature learning, thereby enhancing predictive performance through a weighted fusion of their outputs. Through systematic numerical evaluations, we elucidate the impact of the scaling factor \xi on the model’s performance in both function approximation and partial differential equation (PDE) solving tasks. Benchmark experiments across canonical PDEs, such as the Poisson and Advection equations, demonstrate that HPKM-PINN achieves a marked decrease in loss values (reducing relative error by two orders of magnitude) compared to standalone KAN or MLP models. Furthermore, the framework exhibits numerical stability and robustness when applied to various physical systems. These findings highlight the HPKM-PINN’s ability to leverage KAN’s interpretability and MLP’s expressivity, positioning it as a versatile and scalable tool for solving complex PDE-driven problems in computational science and engineering.

[LG-52] Joint Source-Environment Adaptation for Deep Learning-Based Underwater Acoustic Source Ranging

链接: https://arxiv.org/abs/2503.23262
作者: Dariush Kari,Andrew C. Singer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we propose a method to adapt a pre-trained deep-learning-based model for underwater acoustic localization to a new environment. We use unsupervised domain adaptation to improve the generalization performance of the model, i.e., using an unsupervised loss, fine-tune the pre-trained network parameters without access to any labels of the target environment or any data used to pre-train the model. This method improves the pre-trained model prediction by coupling that with an almost independent estimation based on the received signal energy (that depends on the source). We show the effectiveness of this approach on Bellhop generated data in an environment similar to that of the SWellEx-96 experiment contaminated with real ocean noise from the KAM11 experiment.

[LG-53] Mismatch-Robust Underwater Acoustic Localization Using A Differentiable Modular Forward Model

链接: https://arxiv.org/abs/2503.23260
作者: Dariush Kari,Yongjie Zhuang,Andrew C. Singer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:In this paper, we study the underwater acoustic localization in the presence of environmental mismatch. Especially, we exploit a pre-trained neural network for the acoustic wave propagation in a gradient-based optimization framework to estimate the source location. To alleviate the effect of mismatch between the training data and the test data, we simultaneously optimize over the network weights at the inference time, and provide conditions under which this method is effective. Moreover, we introduce a physics-inspired modularity in the forward model that enables us to learn the path lengths of the multipath structure in an end-to-end training manner without access to the specific path labels. We investigate the validity of the assumptions in a simple yet illustrative environment model.

[LG-54] Joint Source-Environment Adaptation of Data-Driven Underwater Acoustic Source Ranging Based on Model Uncertainty

链接: https://arxiv.org/abs/2503.23258
作者: Dariush Kari,Hari Vishnu,Andrew C. Singer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Adapting pre-trained deep learning models to new and unknown environments is a difficult challenge in underwater acoustic localization. We show that although pre-trained models have performance that suffers from mismatch between the training and test data, they generally exhibit a higher ``implied uncertainty’’ in environments where there is more mismatch. Leveraging this notion of implied uncertainty, we partition the test samples into more certain and less certain sets, and implement an estimation method using the certain samples to improve the labeling for uncertain samples, which helps to adapt the model. We use an efficient method to quantify model prediction uncertainty, and an innovative approach to adapt a pre-trained model to unseen underwater environments at test time. This eliminates the need for labeled data from the target environment or the original training data. This adaptation is enhanced by integrating an independent estimate based on the received signal energy. We validate the approach extensively using real experimental data, as well as synthetic data consisting of model-generated signals with real ocean noise. The results demonstrate significant improvements in model prediction accuracy, underscoring the potential of the method to enhance underwater acoustic localization in diverse, noisy, and unknown environments.

[LG-55] UP-ROM : Uncertainty-Aware and Parametrised dynamic Reduced-Order Model application to unsteady flows

链接: https://arxiv.org/abs/2503.23236
作者: Ismaël Zighed,Nicolas Thome,Patrick Gallinari,Nicolas Thome
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Reduced order models (ROMs) play a critical role in fluid mechanics by providing low-cost predictions, making them an attractive tool for engineering applications. However, for ROMs to be widely applicable, they must not only generalise well across different regimes, but also provide a measure of confidence in their predictions. While recent data-driven approaches have begun to address nonlinear reduction techniques to improve predictions in transient environments, challenges remain in terms of robustness and parametrisation. In this work, we present a nonlinear reduction strategy specifically designed for transient flows that incorporates parametrisation and uncertainty quantification. Our reduction strategy features a variational auto-encoder (VAE) that uses variational inference for confidence measurement. We use a latent space transformer that incorporates recent advances in attention mechanisms to predict dynamical systems. Attention’s versatility in learning sequences and capturing their dependence on external parameters enhances generalisation across a wide range of dynamics. Prediction, coupled with confidence, enables more informed decision making and addresses the need for more robust models. In addition, this confidence is used to cost-effectively sample the parameter space, improving model performance a priori across the entire parameter space without requiring evaluation data for the entire domain.

[LG-56] Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus

链接: https://arxiv.org/abs/2503.23229
作者: Claas Beger,Carl-Leander Henneking
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (this https URL), as well as an implementation harness that works with several different LLM implementations.

[LG-57] Unsupervised Learning: Comparative Analysis of Clustering Techniques on High-Dimensional Data

链接: https://arxiv.org/abs/2503.23215
作者: Vishnu Vardhan Baligodugula,Fathi Amsaad
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive comparative analysis of prominent clustering algorithms K-means, DBSCAN, and Spectral Clustering on high-dimensional datasets. We introduce a novel evaluation framework that assesses clustering performance across multiple dimensionality reduction techniques (PCA, t-SNE, and UMAP) using diverse quantitative metrics. Experiments conducted on MNIST, Fashion-MNIST, and UCI HAR datasets reveal that preprocessing with UMAP consistently improves clustering quality across all algorithms, with Spectral Clustering demonstrating superior performance on complex manifold structures. Our findings show that algorithm selection should be guided by data characteristics, with Kmeans excelling in computational efficiency, DBSCAN in handling irregular clusters, and Spectral Clustering in capturing complex relationships. This research contributes a systematic approach for evaluating and selecting clustering techniques for high dimensional data applications.

[LG-58] A QUBO Framework for Team Formation

链接: https://arxiv.org/abs/2503.23209
作者: Karan Vombatkere,Evimaria Terzi,Theodoros Lappas
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The team formation problem assumes a set of experts and a task, where each expert has a set of skills and the task requires some skills. The objective is to find a set of experts that maximizes coverage of the required skills while simultaneously minimizing the costs associated with the experts. Different definitions of cost have traditionally led to distinct problem formulations and algorithmic solutions. We introduce the unified TeamFormation formulation that captures all cost definitions for team formation problems that balance task coverage and expert cost. Specifically, we formulate three TeamFormation variants with different cost functions using quadratic unconstrained binary optimization (QUBO), and we evaluate two distinct general-purpose solution methods. We show that solutions based on the QUBO formulations of TeamFormation problems are at least as good as those produced by established baselines. Furthermore, we show that QUBO-based solutions leveraging graph neural networks can effectively learn representations of experts and skills to enable transfer learning, allowing node embeddings from one problem instance to be efficiently applied to another.

[LG-59] Graph ODEs and Beyond: A Comprehensive Survey on Integrating Differential Equations with Graph Neural Networks

链接: https://arxiv.org/abs/2503.23167
作者: Zewen Liu,Xiaoda Wang,Bohan Wang,Zijie Huang,Carl Yang,Wei Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) and differential equations (DEs) are two rapidly advancing areas of research that have shown remarkable synergy in recent years. GNNs have emerged as powerful tools for learning on graph-structured data, while differential equations provide a principled framework for modeling continuous dynamics across time and space. The intersection of these fields has led to innovative approaches that leverage the strengths of both, enabling applications in physics-informed learning, spatiotemporal modeling, and scientific computing. This survey aims to provide a comprehensive overview of the burgeoning research at the intersection of GNNs and DEs. We will categorize existing methods, discuss their underlying principles, and highlight their applications across domains such as molecular modeling, traffic prediction, and epidemic spreading. Furthermore, we identify open challenges and outline future research directions to advance this interdisciplinary field. A comprehensive paper list is provided at this https URL. This survey serves as a resource for researchers and practitioners seeking to understand and contribute to the fusion of GNNs and DEs

[LG-60] he geomagnetic storm and Kp prediction using Wasserstein transformer

链接: https://arxiv.org/abs/2503.23102
作者: Beibei Li
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:The accurate forecasting of geomagnetic activity is important. In this work, we present a novel multimodal Transformer based framework for predicting the 3 days and 5 days planetary Kp index by integrating heterogeneous data sources, including satellite measurements, solar images, and KP time series. A key innovation is the incorporation of the Wasserstein distance into the transformer and the loss function to align the probability distributions across modalities. Comparative experiments with the NOAA model demonstrate performance, accurately capturing both the quiet and storm phases of geomagnetic activity. This study underscores the potential of integrating machine learning techniques with traditional models for improved real time forecasting.

[LG-61] Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion ISCA2025

链接: https://arxiv.org/abs/2503.23076
作者: Arash Nasr-Esfahany,Mohammad Alizadeh,Victor Lee,Hanna Alam,Brett W. Coon,David Culler,Vidushi Dadu,Martin Dixon,Henry M. Levy,Santosh Pandey,Parthasarathy Ranganathan,Amir Yazdanbakhsh
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 15 pages, 17 figures, To be published in ISCA 2025

点击查看摘要

Abstract:Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program’s performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.

[LG-62] RACE: Intra-visit Clinical Event Nowcasting via Effective Patient Trajectory Encoding WWW’25

链接: https://arxiv.org/abs/2503.23072
作者: Yuyang Liang,Yankai Chen,Yixiang Fang,Laks V. S. Lakshmanan,Chenhao Ma
类目: Machine Learning (cs.LG)
*备注: Accepted by WWW’25 short paper track

点击查看摘要

Abstract:Electronic Health Records (EHR) have become a valuable resource for a wide range of predictive tasks in healthcare. However, existing approaches have largely focused on inter-visit event predictions, overlooking the importance of intra-visit nowcasting, which provides prompt clinical insights during an ongoing patient visit. To address this gap, we introduce the task of laboratory measurement prediction within a hospital visit. We study the laboratory data that, however, remained underexplored in previous work. We propose TRACE, a Transformer-based model designed for clinical event nowcasting by encoding patient trajectories. TRACE effectively handles long sequences and captures temporal dependencies through a novel timestamp embedding that integrates decay properties and periodic patterns of data. Additionally, we introduce a smoothed mask for denoising, improving the robustness of the model. Experiments on two large-scale electronic health record datasets demonstrate that the proposed model significantly outperforms previous methods, highlighting its potential for improving patient care through more accurate laboratory measurement nowcasting. The code is available at this https URL.

[LG-63] Unsupervised Anomaly Detection in Multivariate Time Series across Heterogeneous Domains

链接: https://arxiv.org/abs/2503.23060
作者: Vincent Jacob,Yanlei Diao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of digital services, along with the scale and complexity at which they operate, has made incidents in IT operations increasingly more likely, diverse, and impactful. This has led to the rapid development of a central aspect of “Artificial Intelligence for IT Operations” (AIOps), focusing on detecting anomalies in vast amounts of multivariate time series data generated by service entities. In this paper, we begin by introducing a unifying framework for benchmarking unsupervised anomaly detection (AD) methods, and highlight the problem of shifts in normal behaviors that can occur in practical AIOps scenarios. To tackle anomaly detection under domain shift, we then cast the problem in the framework of domain generalization and propose a novel approach, Domain-Invariant VAE for Anomaly Detection (DIVAD), to learn domain-invariant representations for unsupervised anomaly detection. Our evaluation results using the Exathlon benchmark show that the two main DIVAD variants significantly outperform the best unsupervised AD method in maximum performance, with 20% and 15% improvements in maximum peak F1-scores, respectively. Evaluation using the Application Server Dataset further demonstrates the broader applicability of our domain generalization methods.

[LG-64] VLM-C4L: Continual Core Dataset Learning with Corner Case Optimization via Vision-Language Models for Autonomous Driving

链接: https://arxiv.org/abs/2503.23046
作者: Haibo Hu,Jiacheng Zuo,Yang Lou,Yufei Cui,Jianping Wang,Nan Guan,Jin Wang,Yung-Hui Li,Chun Jason Xue
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the widespread adoption and deployment of autonomous driving, handling complex environments has become an unavoidable challenge. Due to the scarcity and diversity of extreme scenario datasets, current autonomous driving models struggle to effectively manage corner cases. This limitation poses a significant safety risk, according to the National Highway Traffic Safety Administration (NHTSA), autonomous vehicle systems have been involved in hundreds of reported crashes annually in the United States, occurred in corner cases like sun glare and fog, which caused a few fatal accident. Furthermore, in order to consistently maintain a robust and reliable autonomous driving system, it is essential for models not only to perform well on routine scenarios but also to adapt to newly emerging scenarios, especially those corner cases that deviate from the norm. This requires a learning mechanism that incrementally integrates new knowledge without degrading previously acquired capabilities. However, to the best of our knowledge, no existing continual learning methods have been proposed to ensure consistent and scalable corner case learning in autonomous driving. To address these limitations, we propose VLM-C4L, a continual learning framework that introduces Vision-Language Models (VLMs) to dynamically optimize and enhance corner case datasets, and VLM-C4L combines VLM-guided high-quality data extraction with a core data replay strategy, enabling the model to incrementally learn from diverse corner cases while preserving performance on previously routine scenarios, thus ensuring long-term stability and adaptability in real-world autonomous driving. We evaluate VLM-C4L on large-scale real-world autonomous driving datasets, including Waymo and the corner case dataset CODA.

[LG-65] Function Fitting Based on Kolmogorov-Arnold Theorem and Kernel Functions

链接: https://arxiv.org/abs/2503.23038
作者: Jianpeng Liu,Qizhi Pan
类目: Machine Learning (cs.LG)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:This paper proposes a unified theoretical framework based on the Kolmogorov-Arnold representation theorem and kernel methods. By analyzing the mathematical relationship among kernels, B-spline basis functions in Kolmogorov-Arnold Networks (KANs) and the inner product operation in self-attention mechanisms, we establish a kernel-based feature fitting framework that unifies the two models as linear combinations of kernel functions. Under this framework, we propose a low-rank Pseudo-Multi-Head Self-Attention module (Pseudo-MHSA), which reduces the parameter count of traditional MHSA by nearly 50%. Furthermore, we design a Gaussian kernel multi-head self-attention variant (Gaussian-MHSA) to validate the effectiveness of nonlinear kernel functions in feature extraction. Experiments on the CIFAR-10 dataset demonstrate that Pseudo-MHSA model achieves performance comparable to the ViT model of the same dimensionality under the MAE framework and visualization analysis reveals their similarity of multi-head distribution patterns. Our code is publicly available.

[LG-66] Buyer-Initiated Auction Mechanism for Data Redemption in Machine Unlearning

链接: https://arxiv.org/abs/2503.23001
作者: Bin Han,Di Feng,Jie Wang,Hans D. Schotten
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Submitted to IEEE GLOBECOM 2025

点击查看摘要

Abstract:The rapid growth of artificial intelligence (AI) has raised privacy concerns over user data, leading to regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). With the essential toolbox provided by machine unlearning, AI service providers are now able to remove user data from their trained models as well as the training datasets, so as to comply with such regulations. However, extensive data redemption can be costly and degrade model accuracy. To balance the cost of unlearning and the privacy protection, we propose a buyer-initiated auction mechanism for data redemption, enabling the service provider to purchase data from willing users with appropriate compensation. This approach does not require the server to have any a priori knowledge about the users’ privacy preference, and provides an efficient solution for maximizing the social welfare in the investigated problem.

[LG-67] Multimodal machine learning with large language embedding model for polymer property prediction

链接: https://arxiv.org/abs/2503.22962
作者: Tianren Zhang,Dai-Bei Yang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Contemporary large language models (LLMs), such as GPT-4 and Llama, have harnessed extensive computational power and diverse text corpora to achieve remarkable proficiency in interpreting and generating domain-specific content, including materials science. To leverage the domain knowledge embedded within these models, we propose a simple yet effective multimodal architecture, PolyLLMem, which integrates text embeddings generated by Llama 3 with molecular structure embeddings derived from Uni-Mol, for polymer properties prediction tasks. In our model, Low-rank adaptation (LoRA) layers were also incorporated during the property prediction tasks to refine the embeddings based on our limited polymer dataset, thereby enhancing their chemical relevance for polymer SMILES representation. This balanced fusion of fine-tuned textual and structural information enables PolyLLMem to accurately predict a variety of polymer properties despite the scarcity of training data. Its performance is comparable to, and in some cases exceeds, that of graph-based models, as well as transformer-based models that typically require pretraining on millions of polymer samples. These findings demonstrate that LLM, such as Llama, can effectively capture chemical information encoded in polymer PSMILES, and underscore the efficacy of multimodal fusion of LLM embeddings and molecular structure embeddings in overcoming data scarcity and accelerating the discovery of advanced polymeric materials.

[LG-68] MNT-TNN: Spatiotemporal Traffic Data Imputation via Compact Multimode Nonlinear Transform-based Tensor Nuclear Norm

链接: https://arxiv.org/abs/2503.22955
作者: Yihang Lu,Mahwish Yousaf,Xianwei Meng,Enhong Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imputation of random or non-random missing data is a long-standing research topic and a crucial application for Intelligent Transportation Systems (ITS). However, with the advent of modern communication technologies such as Global Satellite Navigation Systems (GNSS), traffic data collection has outpaced traditional methods, introducing new challenges in random missing value imputation and increasing demands for spatiotemporal dependency modelings. To address these issues, we propose a novel spatiotemporal traffic imputation method, Multimode Nonlinear Transformed Tensor Nuclear Norm (MNT-TNN), grounded in the Transform-based Tensor Nuclear Norm (TTNN) optimization framework which exhibits efficient mathematical representations and theoretical guarantees for the recovery of random missing values. Specifically, we strictly extend the single-mode transform in TTNN to a multimode transform with nonlinear activation, effectively capturing the intrinsic multimode spatiotemporal correlations and low-rankness of the traffic tensor, represented as location \times location \times time. To solve the nonconvex optimization problem, we design a proximal alternating minimization (PAM) algorithm with theoretical convergence guarantees. We suggest an Augmented Transform-based Tensor Nuclear Norm Families (ATTNNs) framework to enhance the imputation results of TTNN techniques, especially at very high miss rates. Extensive experiments on real datasets demonstrate that our proposed MNT-TNN and ATTNNs can outperform the compared state-of-the-art imputation methods, completing the benchmark of random missing traffic value imputation.

[LG-69] Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification An Interpretable Multi-Omics Approach

链接: https://arxiv.org/abs/2503.22939
作者: Fadi Alharbi,Nishant Budhiraja,Aleksandar Vakanski,Boyu Zhang,Murtada K. Elbashir,Mohanad Mohammed
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The integration of multi-omics data presents a major challenge in precision medicine, requiring advanced computational methods for accurate disease classification and biological interpretation. This study introduces the Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN), a deep learning model that integrates messenger RNA, micro RNA sequences, and DNA methylation data with Protein-Protein Interaction (PPI) networks for accurate and interpretable cancer classification across 31 cancer types. MOGKAN employs a hybrid approach combining differential expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi-omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov-Arnold theorem principle, using trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28 percent and demonstrates low experimental variability with a standard deviation that is reduced by 1.58 to 7.30 percents compared to Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs). The biomarkers identified by MOGKAN have been validated as cancer-related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. The proposed model presents an ability to uncover molecular oncogenesis mechanisms by detecting phosphoinositide-binding substances and regulating sphingolipid cellular processes. By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates superior predictive performance and interpretability that has the potential to enhance the translation of complex multi-omics data into clinically actionable cancer diagnostics.

[LG-70] Learning Library Cell Representations in Vector Space

链接: https://arxiv.org/abs/2503.22900
作者: Rongjian Liang,Yi-Chen Lu,Wen-Hao Liu,Haoxing Ren
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:We propose Lib2Vec, a novel self-supervised framework to efficiently learn meaningful vector representations of library cells, enabling ML models to capture essential cell semantics. The framework comprises three key components: (1) an automated method for generating regularity tests to quantitatively evaluate how well cell representations reflect inter-cell relationships; (2) a self-supervised learning scheme that systematically extracts training data from Liberty files, removing the need for costly labeling; and (3) an attention-based model architecture that accommodates various pin counts and enables the creation of property-specific cell and arc embeddings. Experimental results demonstrate that Lib2Vec effectively captures functional and electrical similarities. Moreover, linear algebraic operations on cell vectors reveal meaningful relationships, such as vector(BUF) - vector(INV) + vector(NAND) ~ vector(AND), showcasing the framework’s nuanced representation capabilities. Lib2Vec also enhances downstream circuit learning applications, especially when labeled data is scarce.

[LG-71] ask Tokens: A Flexible Approach to Adapting Behavior Foundation Models

链接: https://arxiv.org/abs/2503.22886
作者: Ron Vainshtein,Zohar Rimon,Shie Mannor,Chen Tessler
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Recent advancements in imitation learning have led to transformer-based behavior foundation models (BFMs) that enable multi-modal, human-like control for humanoid agents. While excelling at zero-shot generation of robust behaviors, BFMs often require meticulous prompt engineering for specific tasks, potentially yielding suboptimal results. We introduce “Task Tokens”, a method to effectively tailor BFMs to specific tasks while preserving their flexibility. Our approach leverages the transformer architecture of BFMs to learn a new task-specific encoder through reinforcement learning, keeping the original BFM frozen. This allows incorporation of user-defined priors, balancing reward design and prompt engineering. By training a task encoder to map observations to tokens, used as additional BFM inputs, we guide performance improvement while maintaining the model’s diverse control characteristics. We demonstrate Task Tokens’ efficacy across various tasks, including out-of-distribution scenarios, and show their compatibility with other prompting modalities. Our results suggest that Task Tokens offer a promising approach for adapting BFMs to specific control tasks while retaining their generalization capabilities.

[LG-72] Harnessing uncertainty when learning through Equilibrium Propagation in neural networks

链接: https://arxiv.org/abs/2503.22810
作者: Jonathan Peters,Philippe Talatchian
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Equilibrium Propagation (EP) is a supervised learning algorithm that trains network parameters using local neuronal activity. This is in stark contrast to backpropagation, where updating the parameters of the network requires significant data shuffling. Avoiding data movement makes EP particularly compelling as a learning framework for energy-efficient training on neuromorphic systems. In this work, we assess the ability of EP to learn on hardware that contain physical uncertainties. This is particularly important for researchers concerned with hardware implementations of self-learning systems that utilize EP. Our results demonstrate that deep, multi-layer neural network architectures can be trained successfully using EP in the presence of finite uncertainties, up to a critical limit. This limit is independent of the training dataset, and can be scaled through sampling the network according to the central limit theorem. Additionally, we demonstrate improved model convergence and performance for finite levels of uncertainty on the MNIST, KMNIST and FashionMNIST datasets. Optimal performance is found for networks trained with uncertainties close to the critical limit. Our research supports future work to build self-learning hardware in situ with EP.

[LG-73] Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

链接: https://arxiv.org/abs/2503.22779
作者: Junkai Hu,Li Xia
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

[LG-74] Invariant Control Strategies for Active Flow Control using Graph Neural Networks

链接: https://arxiv.org/abs/2503.22775
作者: Marius Kurz,Rohan Kaushik,Marcel Blind,Patrick Kopper,Anna Schwarz,Felix Rodach,Andrea Beck
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Reinforcement learning has gained traction for active flow control tasks, with initial applications exploring drag mitigation via flow field augmentation around a two-dimensional cylinder. RL has since been extended to more complex turbulent flows and has shown significant potential in learning complex control strategies. However, such applications remain computationally challenging due to its sample inefficiency and associated simulation costs. This fact is worsened by the lack of generalization capabilities of these trained policy networks, often being implicitly tied to the input configurations of their training conditions. In this work, we propose the use of graph neural networks to address this particular limitation, effectively increasing the range of applicability and getting more value out of the upfront RL training cost. GNNs can naturally process unstructured, threedimensional flow data, preserving spatial relationships without the constraints of a Cartesian grid. Additionally, they incorporate rotational, reflectional, and permutation invariance into the learned control policies, thus improving generalization and thereby removing the shortcomings of commonly used CNN or MLP architectures. To demonstrate the effectiveness of this approach, we revisit the well-established two-dimensional cylinder benchmark problem for active flow control. The RL training is implemented using Relexi, a high-performance RL framework, with flow simulations conducted in parallel using the high-order discontinuous Galerkin framework FLEXI. Our results show that GNN-based control policies achieve comparable performance to existing methods while benefiting from improved generalization properties. This work establishes GNNs as a promising architecture for RL-based flow control and highlights the capabilities of Relexi and FLEXI for large-scale RL applications in fluid dynamics.

[LG-75] Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation

链接: https://arxiv.org/abs/2503.22760
作者: Rafiqul Rabin,Sean McGregor,Nick Judd
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: The 3rd International Workshop on Mining Software Repositories Applications for Privacy and Security (MSR4PS), co-located with SANER 2025

点击查看摘要

Abstract:This paper explores the risk that a large language model (LLM) trained for code generation on data mined from software repositories will generate content that discloses sensitive information included in its training data. We decompose this risk, known in the literature as ``unintended memorization,‘’ into two components: unintentional disclosure (where an LLM presents secrets to users without the user seeking them out) and malicious disclosure (where an LLM presents secrets to an attacker equipped with partial knowledge of the training data). We observe that while existing work mostly anticipates malicious disclosure, unintentional disclosure is also a concern. We describe methods to assess unintentional and malicious disclosure risks side-by-side across different releases of training datasets and models. We demonstrate these methods through an independent assessment of the Open Language Model (OLMo) family of models and its Dolma training datasets. Our results show, first, that changes in data source and processing are associated with substantial changes in unintended memorization risk; second, that the same set of operational changes may increase one risk while mitigating another; and, third, that the risk of disclosing sensitive information varies not only by prompt strategies or test datasets but also by the types of sensitive information. These contributions rely on data mining to enable greater privacy and security testing required for the LLM training data supply chain.

[LG-76] Combating the Bullwhip Effect in Rival Online Food Delivery Platforms Using Deep Learning

链接: https://arxiv.org/abs/2503.22753
作者: Tisha Ghosh
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The wastage of perishable items has led to significant health and economic crises, increasing business uncertainty and fluctuating customer demand. This issue is worsened by online food delivery services, where frequent and unpredictable orders create inefficiencies in supply chain management, contributing to the bullwhip effect. This effect results in stockouts, excess inventory, and inefficiencies. Accurate demand forecasting helps stabilize inventory, optimize supplier orders, and reduce waste. This paper presents a Third-Party Logistics (3PL) supply chain model involving restaurants, online food apps, and customers, along with a deep learning-based demand forecasting model using a two-phase Long Short-Term Memory (LSTM) network. Phase one, intra-day forecasting, captures short-term variations, while phase two, daily forecasting, predicts overall demand. A two-year dataset from January 2023 to January 2025 from Swiggy and Zomato is used, employing discrete event simulation and grid search for optimal LSTM hyperparameters. The proposed method is evaluated using RMSE, MAE, and R-squared score, with R-squared as the primary accuracy measure. Phase one achieves an R-squared score of 0.69 for Zomato and 0.71 for Swiggy with a training time of 12 minutes, while phase two improves to 0.88 for Zomato and 0.90 for Swiggy with a training time of 8 minutes. To mitigate demand fluctuations, restaurant inventory is dynamically managed using the newsvendor model, adjusted based on forecasted demand. The proposed framework significantly reduces the bullwhip effect, improving forecasting accuracy and supply chain efficiency. For phase one, supply chain instability decreases from 2.61 to 0.96, and for phase two, from 2.19 to 0.80. This demonstrates the model’s effectiveness in minimizing food waste and maintaining optimal restaurant inventory levels. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Applications (stat.AP); Machine Learning (stat.ML) Cite as: arXiv:2503.22753 [cs.LG] (or arXiv:2503.22753v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22753 Focus to learn more arXiv-issued DOI via DataCite

[LG-77] Graph-Based Uncertainty-Aware Self-Training with Stochastic Node Labeling

链接: https://arxiv.org/abs/2503.22745
作者: Tom Liu,Anna Wu,Chao Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Self-training has become a popular semi-supervised learning technique for leveraging unlabeled data. However, the over-confidence of pseudo-labels remains a key challenge. In this paper, we propose a novel \emphgraph-based uncertainty-aware self-training (GUST) framework to combat over-confidence in node classification. Drawing inspiration from the uncertainty integration idea introduced by Wang \emphet al.~\citewang2024uncertainty, our method largely diverges from previous self-training approaches by focusing on \emphstochastic node labeling grounded in the graph topology. Specifically, we deploy a Bayesian-inspired module to estimate node-level uncertainty, incorporate these estimates into the pseudo-label generation process via an expectation-maximization (EM)-like step, and iteratively update both node embeddings and adjacency-based transformations. Experimental results on several benchmark graph datasets demonstrate that our GUST framework achieves state-of-the-art performance, especially in settings where labeled data is extremely sparse.

[LG-78] Uncertainty-Aware Graph Self-Training with Expectation-Maximization Regularization

链接: https://arxiv.org/abs/2503.22744
作者: Emily Wang,Michael Chen,Chao Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel \emphuncertainty-aware graph self-training approach for semi-supervised node classification. Our method introduces an Expectation-Maximization (EM) regularization scheme to incorporate an uncertainty mechanism during pseudo-label generation and model retraining. Unlike conventional graph self-training pipelines that rely on fixed pseudo-labels, our approach iteratively refines label confidences with an EM-inspired uncertainty measure. This ensures that the predictive model focuses on reliable graph regions while gradually incorporating ambiguous nodes. Inspired by prior work on uncertainty-aware self-training techniques~\citewang2024uncertainty, our framework is designed to handle noisy graph structures and feature spaces more effectively. Through extensive experiments on several benchmark graph datasets, we demonstrate that our method outperforms strong baselines by a margin of up to 2.5% in accuracy while maintaining lower variance in performance across multiple runs.

[LG-79] Adaptive State-Space Mamba for Real-Time Sensor Data Anomaly Detection

链接: https://arxiv.org/abs/2503.22743
作者: Alice Zhang,Chao Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:State-space modeling has emerged as a powerful paradigm for sequence analysis in various tasks such as natural language processing, time-series forecasting, and signal processing. In this work, we propose an \emphAdaptive State-Space Mamba (\textbfASSM) framework for real-time sensor data anomaly detection. While state-space models have been previously employed for image processing applications (e.g., style transfer \citewang2024stylemamba), our approach leverages the core idea of sequential hidden states to tackle a significantly different domain: detecting anomalies on streaming sensor data. In particular, we introduce an adaptive gating mechanism that dynamically modulates the hidden state update based on contextual and learned statistical cues. This design ensures that our model remains computationally efficient and scalable, even under rapid data arrival rates. Extensive experiments on real-world and synthetic sensor datasets demonstrate that our method achieves superior detection performance compared to existing baselines. Our approach is easily extensible to other time-series tasks that demand rapid and reliable detection capabilities. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.22743 [cs.LG] (or arXiv:2503.22743v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.22743 Focus to learn more arXiv-issued DOI via DataCite

[LG-80] Concept Map Assessment Through Structure Classification

链接: https://arxiv.org/abs/2503.22741
作者: Laís P. V. Vossen,Isabela Gasparini,Elaine H. T. Oliveira,Berrit Czinczel,Ute Harms,Lukas Menzel,Sebastian Gombert,Knut Neumann,Hendrik Drachsler
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to their versatility, concept maps are used in various educational settings and serve as tools that enable educators to comprehend students’ knowledge construction. An essential component for analyzing a concept map is its structure, which can be categorized into three distinct types: spoke, network, and chain. Understanding the predominant structure in a map offers insights into the student’s depth of comprehension of the subject. Therefore, this study examined 317 distinct concept map structures, classifying them into one of the three types, and used statistical and descriptive information from the maps to train multiclass classification models. As a result, we achieved an 86% accuracy in classification using a Decision Tree. This promising outcome can be employed in concept map assessment systems to provide real-time feedback to the student.

[LG-81] ShieldAgent : Shielding Agents via Verifiable Safety Policy Reasoning

链接: https://arxiv.org/abs/2503.22738
作者: Zhaorun Chen,Mintong Kang,Bo Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents.

[LG-82] A Methodology to extract Geo-Referenced Standard Routes from AIS Data

链接: https://arxiv.org/abs/2503.22734
作者: Michela Corvino,Filippo Daffinà,Chiara Francalanci,Paolo Giacomazzi,Martina Magliani,Paolo Ravanelli,Torbjorn Stahl
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Maritime AIS (Automatic Identification Systems) data serve as a valuable resource for studying vessel behavior. This study proposes a methodology to analyze route between maritime points of interest and extract geo-referenced standard routes, as maritime patterns of life, from raw AIS data. The underlying assumption is that ships adhere to consistent patterns when travelling in certain maritime areas due to geographical, environmental, or economic factors. Deviations from these patterns may be attributed to weather conditions, seasonality, or illicit activities. This enables maritime surveillance authorities to analyze the navigational behavior between ports, providing insights on vessel route patterns, possibly categorized by vessel characteristics (type, flag, or size). Our methodological process begins by segmenting AIS data into distinct routes using a finite state machine (FSM), which describes routes as seg-ments connecting pairs of points of interest. The extracted segments are ag-gregated based on their departure and destination ports and then modelled using iterative density-based clustering to connect these ports. The cluster-ing parameters are assigned manually to sample and then extended to the en-tire dataset using linear regression. Overall, the approach proposed in this paper is unsupervised and does not require any ground truth to be trained. The approach has been tested on data on the on a six-year AIS dataset cover-ing the Arctic region and the Europe, Middle East, North Africa areas. The total size of our dataset is 1.15 Tbytes. The approach has proved effective in extracting standard routes, with less than 5% outliers, mostly due to routes with either their departure or their destination port not included in the test areas.

[LG-83] RBFleX-NAS: Training-Free Neural Architecture Search Using Radial Basis Function Kernel and Hyperparameter Detection

链接: https://arxiv.org/abs/2503.22733
作者: Tomomasa Yamasaki,Zhehui Wang,Tao Luo,Niangjun Chen,Bo Wang
类目: Machine Learning (cs.LG)
*备注: 15 pages, 17 figures, IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Neural Architecture Search (NAS) is an automated technique to design optimal neural network architectures for a specific workload. Conventionally, evaluating candidate networks in NAS involves extensive training, which requires significant time and computational resources. To address this, training-free NAS has been proposed to expedite network evaluation with minimal search time. However, state-of-the-art training-free NAS algorithms struggle to precisely distinguish well-performing networks from poorly-performing networks, resulting in inaccurate performance predictions and consequently sub-optimal top-1 network accuracy. Moreover, they are less effective in activation function exploration. To tackle the challenges, this paper proposes RBFleX-NAS, a novel training-free NAS framework that accounts for both activation outputs and input features of the last layer with a Radial Basis Function (RBF) kernel. We also present a detection algorithm to identify optimal hyperparameters using the obtained activation outputs and input feature maps. We verify the efficacy of RBFleX-NAS over a variety of NAS benchmarks. RBFleX-NAS significantly outperforms state-of-the-art training-free NAS methods in terms of top-1 accuracy, achieving this with short search time in NAS-Bench-201 and NAS-Bench-SSS. In addition, it demonstrates higher Kendall correlation compared to layer-based training-free NAS algorithms. Furthermore, we propose NAFBee, a new activation design space that extends the activation type to encompass various commonly used functions. In this extended design space, RBFleX-NAS demonstrates its superiority by accurately identifying the best-performing network during activation function search, providing a significant advantage over other NAS algorithms.

[LG-84] MoRE-LLM : Mixture of Rule Experts Guided by a Large Language Model ICDM

链接: https://arxiv.org/abs/2503.22731
作者: Alexander Koebler,Ingo Thon,Florian Buettner
类目: Machine Learning (cs.LG)
*备注: 2024 IEEE International Conference on Data Mining (ICDM)

点击查看摘要

Abstract:To ensure the trustworthiness and interpretability of AI systems, it is essential to align machine learning models with human domain knowledge. This can be a challenging and time-consuming endeavor that requires close communication between data scientists and domain experts. Recent leaps in the capabilities of Large Language Models (LLMs) can help alleviate this burden. In this paper, we propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM) which combines a data-driven black-box model with knowledge extracted from an LLM to enable domain knowledge-aligned and transparent predictions. While the introduced Mixture of Rule Experts (MoRE) steers the discovery of local rule-based surrogates during training and their utilization for the classification task, the LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them. Importantly, our method does not rely on access to the LLM during test time and ensures interpretability while not being prone to LLM-based confabulations. We evaluate our method on several tabular data sets and compare its performance with interpretable and non-interpretable baselines. Besides performance, we evaluate our grey-box method with respect to the utilization of interpretable rules. In addition to our quantitative evaluation, we shed light on how the LLM can provide additional context to strengthen the comprehensibility and trustworthiness of the model’s reasoning process.

[LG-85] Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

链接: https://arxiv.org/abs/2503.22730
作者: Abdoulaye Sakho(LPSM),Emmanuel Malherbe,Carl-Erik Gauthier,Erwan Scornet(LPSM)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC…), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.

[LG-86] Uncertainty Weighted Gradients for Model Calibration

链接: https://arxiv.org/abs/2503.22725
作者: Jinxu Lin,Linwei Tao,Minjing Dong,Chang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model calibration is essential for ensuring that the predictions of deep neural networks accurately reflect true probabilities in real-world classification tasks. However, deep networks often produce over-confident or under-confident predictions, leading to miscalibration. Various methods have been proposed to address this issue by designing effective loss functions for calibration, such as focal loss. In this paper, we analyze its effectiveness and provide a unified loss framework of focal loss and its variants, where we mainly attribute their superiority in model calibration to the loss weighting factor that estimates sample-wise uncertainty. Based on our analysis, existing loss functions fail to achieve optimal calibration performance due to two main issues: including misalignment during optimization and insufficient precision in uncertainty estimation. Specifically, focal loss cannot align sample uncertainty with gradient scaling and the single logit cannot indicate the uncertainty. To address these issues, we reformulate the optimization from the perspective of gradients, which focuses on uncertain samples. Meanwhile, we propose using the Brier Score as the loss weight factor, which provides a more accurate uncertainty estimation via all the logits. Extensive experiments on various models and datasets demonstrate that our method achieves state-of-the-art (SOTA) performance.

[LG-87] A Spatial-temporal Deep Probabilistic Diffusion Model for Reliable Hail Nowcasting with Radar Echo Extrapolation

链接: https://arxiv.org/abs/2503.22724
作者: Haonan Shi,Long Tian,Jie Tao,Yufei Li,Liming Wang,Xiyang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hail nowcasting is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through precise forecast that has high resolution, long lead times and local details with large landscapes. Existing medium-range weather forecasting methods primarily rely on changes in upper air currents and cloud layers to predict precipitation events, such as heavy rainfall, which are unsuitable for hail nowcasting since it is mainly caused by low-altitude local strong convection associated with terrains. Additionally, radar captures the status of low cloud layers, such as water vapor, droplets, and ice crystals, providing rich signals suitable for hail nowcasting. To this end, we introduce a Spatial-Temporal gEnerAtive Model called SteamCast for hail nowcasting with radar echo extrapolation, it is a deep probabilistic diffusion model based on spatial-temporal representations including radar echoes as well as their position/time embeddings, which we trained on historical reanalysis archive from Yan’an Meteorological Bureau in China, where the crop yield like apple suffers greatly from hail damage. Considering the short-term nature of hail, SteamCast provides 30-minute nowcasts at 6-minute intervals for a single radar reflectivity variable, across 9 different vertical angles, on a latitude-longitude grid with approximately 1 km * 1 km resolution per pixel in Yan’an City, China. By successfully fusing the spatial-temporal features of radar echoes, SteamCast delivers competitive, and in some cases superior, results compared to other deep learning-based models such as PredRNN and VMRNN.

[LG-88] PlatMetaX: An Integrated MATLAB platform for Meta-Black-Box Optimization

链接: https://arxiv.org/abs/2503.22722
作者: Xu Yang,Rui Wang,Kaiwen Li,Wenhua Li,Tao Zhang,Fujun He
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:The landscape of optimization problems has become increasingly complex, necessitating the development of advanced optimization techniques. Meta-Black-Box Optimization (MetaBBO), which involves refining the optimization algorithms themselves via meta-learning, has emerged as a promising approach. Recognizing the limitations in existing platforms, we presents PlatMetaX, a novel MATLAB platform for MetaBBO with reinforcement learning. PlatMetaX integrates the strengths of MetaBox and PlatEMO, offering a comprehensive framework for developing, evaluating, and comparing optimization algorithms. The platform is designed to handle a wide range of optimization problems, from single-objective to multi-objective, and is equipped with a rich set of baseline algorithms and evaluation metrics. We demonstrate the utility of PlatMetaX through extensive experiments and provide insights into its design and implementation. PlatMetaX is available at: \hrefthis https URLthis https URL.

[LG-89] PowerGNN: A Topology-Aware Graph Neural Network for Electricity Grids

链接: https://arxiv.org/abs/2503.22721
作者: Dhruv Suri,Mohak Mangal
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The increasing penetration of renewable energy sources introduces significant variability and uncertainty in modern power systems, making accurate state prediction critical for reliable grid operation. Conventional forecasting methods often neglect the power grid’s inherent topology, limiting their ability to capture complex spatio temporal dependencies. This paper proposes a topology aware Graph Neural Network (GNN) framework for predicting power system states under high renewable integration. We construct a graph based representation of the power network, modeling buses and transmission lines as nodes and edges, and introduce a specialized GNN architecture that integrates GraphSAGE convolutions with Gated Recurrent Units (GRUs) to model both spatial and temporal correlations in system dynamics. The model is trained and evaluated on the NREL 118 test system using realistic, time synchronous renewable generation profiles. Our results show that the proposed GNN outperforms baseline approaches including fully connected neural networks, linear regression, and rolling mean models, achieving substantial improvements in predictive accuracy. The GNN achieves average RMSEs of 0.13 to 0.17 across all predicted variables and demonstrates consistent performance across spatial locations and operational conditions. These results highlight the potential of topology aware learning for scalable and robust power system forecasting in future grids with high renewable penetration.

[LG-90] Risk-Calibrated Affective Speech Recognition via Conformal Coverag e Guarantees: A Stochastic Calibrative Framework for Emergent Uncertainty Quantification

链接: https://arxiv.org/abs/2503.22712
作者: Zijun Jia
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Traffic safety challenges arising from extreme driver emotions highlight the urgent need for reliable emotion recognition systems. Traditional deep learning approaches in speech emotion recognition suffer from overfitting and poorly calibrated confidence estimates. We propose a framework integrating Conformal Prediction (CP) and Risk Control,using Mel-spectrogram features processed through a pre-trained convolutional neural network. Our key innovation is the development of a nonconformity score that heuristically measures how closely a classifier’s predictions align with given inputs. Through calibration samples, we compute this score and derive a statistically rigorous threshold based on user-specified risk level \alpha , constructing prediction sets with provable coverage guarantees ( \geq 1-\alpha ). The Risk Control framework enables task-specific adaptation through customizable loss functions, dynamically adjusting prediction set sizes while maintaining coverage guarantees. Cross-dataset experiments on IEMOCAP and TESS demonstrate: 1) Strict coverage guarantee, 2) Significant negative correlation between Average Prediction Set Size (APSS) and \alpha , revealing reduced model uncertainty under high-risk conditions. We further propose APSS as a novel metric for evaluating classification uncertainty. This approach enhances speech emotion recognition reliability, with direct applications in intelligent transportation systems and real-time emotion monitoring.

[LG-91] From Occurrence to Consequence: A Comprehensive Data-driven Analysis of Building Fire Risk

链接: https://arxiv.org/abs/2503.22689
作者: Chenzhi Ma,Hongru Du,Shengzhi Luan,Ensheng Dong,Lauren M. Gardner,Thomas Gernay
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Building fires pose a persistent threat to life, property, and infrastructure, emphasizing the need for advanced risk mitigation strategies. This study presents a data-driven framework analyzing U.S. fire risks by integrating over one million fire incident reports with diverse fire-relevant datasets, including social determinants, building inventories, weather conditions, and incident-specific factors. By adapting machine learning models, we identify key risk factors influencing fire occurrence and consequences. Our findings show that vulnerable communities, characterized by socioeconomic disparities or the prevalence of outdated or vacant buildings, face higher fire risks. Incident-specific factors, such as fire origins and safety features, strongly influence fire consequences. Buildings equipped with fire detectors and automatic extinguishing systems experience significantly lower fire spread and injury risks. By pinpointing high-risk areas and populations, this research supports targeted interventions, including mandating fire safety systems and providing subsidies for disadvantaged communities. These measures can enhance fire prevention, protect vulnerable groups, and promote safer, more equitable communities.

[LG-92] ruth in Text: A Meta-Analysis of ML-Based Cyber Information Influence Detection Approaches

链接: https://arxiv.org/abs/2503.22686
作者: Jason M. Pittman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, 5 tables, 2 appendices

点击查看摘要

Abstract:Cyber information influence, or disinformation in general terms, is widely regarded as one of the biggest threats to social progress and government stability. From US presidential elections to European Union referendums and down to regional news reporting of wildfires, lies and post-truths have normalized radical decision-making. Accordingly, there has been an explosion in research seeking to detect disinformation in online media. The frontier of disinformation detection research is leveraging a variety of ML techniques such as traditional ML algorithms like Support Vector Machines, Random Forest, and Naïve Bayes. Other research has applied deep learning models including Convolutional Neural Networks, Long Short-Term Memory networks, and transformer-based architectures. Despite the overall success of such techniques, the literature demonstrates inconsistencies when viewed holistically which limits our understanding of the true effectiveness. Accordingly, this work employed a two-stage meta-analysis to (a) demonstrate an overall meta statistic for ML model effectiveness in detecting disinformation and (b) investigate the same by subgroups of ML model types. The study found the majority of the 81 ML detection techniques sampled have greater than an 80% accuracy with a Mean sample effectiveness of 79.18% accuracy. Meanwhile, subgroups demonstrated no statistically significant difference between-approaches but revealed high within-group variance. Based on the results, this work recommends future work in replication and development of detection methods operating at the ML model level.

[LG-93] Solving the Best Subset Selection Problem via Suboptimal Algorithms

链接: https://arxiv.org/abs/2503.24300
作者: Vikram Singh,Min Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Best subset selection in linear regression is well known to be nonconvex and computationally challenging to solve, as the number of possible subsets grows rapidly with increasing dimensionality of the problem. As a result, finding the global optimal solution via an exact optimization method for a problem with dimensions of 1000s may take an impractical amount of CPU time. This suggests the importance of finding suboptimal procedures that can provide good approximate solutions using much less computational effort than exact methods. In this work, we introduce a new procedure and compare it with other popular suboptimal algorithms to solve the best subset selection problem. Extensive computational experiments using synthetic and real data have been performed. The results provide insights into the performance of these methods in different data settings. The new procedure is observed to be a competitive suboptimal algorithm for solving the best subset selection problem for high-dimensional data.

[LG-94] Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach

链接: https://arxiv.org/abs/2503.24271
作者: Francesco Pio Ramunno,Paolo Massa,Vitaliy Kinakh,Brandon Panos,André Csillaghy,Slava Voloshynovskiy
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for publication on AA

点击查看摘要

Abstract:The spatial properties of the solar magnetic field are crucial to decoding the physical processes in the solar interior and their interplanetary effects. However, observations from older instruments, such as the Michelson Doppler Imager (MDI), have limited spatial or temporal resolution, which hinders the ability to study small-scale solar features in detail. Super resolving these older datasets is essential for uniform analysis across different solar cycles, enabling better characterization of solar flares, active regions, and magnetic network dynamics. In this work, we introduce a novel diffusion model approach for Super-Resolution and we apply it to MDI magnetograms to match the higher-resolution capabilities of the Helioseismic and Magnetic Imager (HMI). By training a Latent Diffusion Model (LDM) with residuals on downscaled HMI data and fine-tuning it with paired MDI/HMI data, we can enhance the resolution of MDI observations from 2"/pixel to 0.5"/pixel. We evaluate the quality of the reconstructed images by means of classical metrics (e.g., PSNR, SSIM, FID and LPIPS) and we check if physical properties, such as the unsigned magnetic flux or the size of an active region, are preserved. We compare our model with different variations of LDM and Denoising Diffusion Probabilistic models (DDPMs), but also with two deterministic architectures already used in the past for performing the Super-Resolution task. Furthermore, we show with an analysis in the Fourier domain that the LDM with residuals can resolve features smaller than 2", and due to the probabilistic nature of the LDM, we can asses their reliability, in contrast with the deterministic models. Future studies aim to super-resolve the temporal scale of the solar MDI instrument so that we can also have a better overview of the dynamics of the old events.

[LG-95] Data-driven construction of a generalized kinetic collision operator from molecular dynamics

链接: https://arxiv.org/abs/2503.24208
作者: Yue Zhao,William Burby,Andrew Christlieb,Huan Lei
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:We introduce a data-driven approach to learn a generalized kinetic collision operator directly from molecular dynamics. Unlike the conventional (e.g., Landau) models, the present operator takes an anisotropic form that accounts for a second energy transfer arising from the collective interactions between the pair of collision particles and the environment. Numerical results show that preserving the broadly overlooked anisotropic nature of the collision energy transfer is crucial for predicting the plasma kinetics with non-negligible correlations, where the Landau model shows limitations.

[LG-96] A Comparison of Parametric Dynamic Mode Decomposition Algorithms for Thermal-Hydraulics Applications

链接: https://arxiv.org/abs/2503.24205
作者: Stefano Riva,Andrea Missaglia,Carolina Introini,In Cheol Bang,Antonio Cammi
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, algorithms aiming at learning models from available data have become quite popular due to two factors: 1) the significant developments in Artificial Intelligence techniques and 2) the availability of large amounts of data. Nevertheless, this topic has already been addressed by methodologies belonging to the Reduced Order Modelling framework, of which perhaps the most famous equation-free technique is Dynamic Mode Decomposition. This algorithm aims to learn the best linear model that represents the physical phenomena described by a time series dataset: its output is a best state operator of the underlying dynamical system that can be used, in principle, to advance the original dataset in time even beyond its span. However, in its standard formulation, this technique cannot deal with parametric time series, meaning that a different linear model has to be derived for each parameter realization. Research on this is ongoing, and some versions of a parametric Dynamic Mode Decomposition already exist. This work contributes to this research field by comparing the different algorithms presently deployed and assessing their advantages and shortcomings compared to each other. To this aim, three different thermal-hydraulics problems are considered: two benchmark ‘flow over cylinder’ test cases at diverse Reynolds numbers, whose datasets are, respectively, obtained with the FEniCS finite element solver and retrieved from the CFDbench dataset, and the DYNASTY experimental facility operating at Politecnico di Milano, which studies the natural circulation established by internally heated fluids for Generation IV nuclear applications, whose dataset was generated using the RELAP5 nodal solver.

[LG-97] Inductive Graph Representation Learning with Quantum Graph Neural Networks

链接: https://arxiv.org/abs/2503.24111
作者: Arthur M. Faria,Ignacio F. Graña,Savvas Varsamopoulos
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Quantum Graph Neural Networks (QGNNs) present a promising approach for combining quantum computing with graph-structured data processing. While classical Graph Neural Networks (GNNs) are renowned for their scalability and robustness, existing QGNNs often lack flexibility due to graph-specific quantum circuit designs, limiting their applicability to a narrower range of graph-structured problems, falling short of real-world scenarios. To address these limitations, we propose a versatile QGNN framework inspired by the classical GraphSAGE approach, utilizing quantum models as aggregators. In this work, we integrate established techniques for inductive representation learning on graphs with parametrized quantum convolutional and pooling layers, effectively bridging classical and quantum paradigms. The convolutional layer is flexible, enabling tailored designs for specific problems. Benchmarked on a node regression task with the QM9 dataset, we demonstrate that our framework successfully models a non-trivial molecular dataset, achieving performance comparable to classical GNNs. In particular, we show that our quantum approach exhibits robust generalization across molecules with varying numbers of atoms without requiring circuit modifications, slightly outperforming classical GNNs. Furthermore, we numerically investigate the scalability of the QGNN framework. Specifically, we demonstrate the absence of barren plateaus in our architecture as the number of qubits increases, suggesting that the proposed quantum model can be extended to handle larger and more complex graph-based problems effectively.

[LG-98] New universal operator approximation theorem for encoder-decoder architectures (Preprint)

链接: https://arxiv.org/abs/2503.24092
作者: Janek Gödeke,Pascal Fernsel
类目: Functional Analysis (math.FA); Machine Learning (cs.LG); General Topology (math.GN)
*备注: 34 pages

点击查看摘要

Abstract:Motivated by the rapidly growing field of mathematics for operator approximation with neural networks, we present a novel universal operator approximation theorem for a broad class of encoder-decoder architectures. In this study, we focus on approximating continuous operators in \mathcalC(\mathcalX, \mathcalY) , where \mathcalX and \mathcalY are infinite-dimensional normed or metric spaces, and we consider uniform convergence on compact subsets of \mathcalX . Unlike standard results in the operator learning literature, we investigate the case where the approximating operator sequence can be chosen independently of the compact sets. Taking a topological perspective, we analyze different types of operator approximation and show that compact-set-independent approximation is a strictly stronger property in most relevant operator learning frameworks. To establish our results, we introduce a new approximation property tailored to encoder-decoder architectures, which enables us to prove a universal operator approximation theorem ensuring uniform convergence on every compact subset. This result unifies and extends existing universal operator approximation theorems for various encoder-decoder architectures, including classical DeepONets, BasisONets, special cases of MIONets, architectures based on frames and other related approaches.

[LG-99] Controlled Latent Diffusion Models for 3D Porous Media Reconstruction

链接: https://arxiv.org/abs/2503.24083
作者: Danilo Naiff,Bernardo P. Schaeffer,Gustavo Pires,Dragan Stojkovic,Thomas Rapstine,Fabio Ramos
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 58 pages

点击查看摘要

Abstract:Three-dimensional digital reconstruction of porous media presents a fundamental challenge in geoscience, requiring simultaneous resolution of fine-scale pore structures while capturing representative elementary volumes. We introduce a computational framework that addresses this challenge through latent diffusion models operating within the EDM framework. Our approach reduces dimensionality via a custom variational autoencoder trained in binary geological volumes, improving efficiency and also enabling the generation of larger volumes than previously possible with diffusion models. A key innovation is our controlled unconditional sampling methodology, which enhances distribution coverage by first sampling target statistics from their empirical distributions, then generating samples conditioned on these values. Extensive testing on four distinct rock types demonstrates that conditioning on porosity - a readily computable statistic - is sufficient to ensure a consistent representation of multiple complex properties, including permeability, two-point correlation functions, and pore size distributions. The framework achieves better generation quality than pixel-space diffusion while enabling significantly larger volume reconstruction (256-cube voxels) with substantially reduced computational requirements, establishing a new state-of-the-art for digital rock physics applications.

[LG-100] Riemannian Multiplicative Update for Sparse Simplex constraint using oblique rotation manifold

链接: https://arxiv.org/abs/2503.24075
作者: Flavia Esposito,Andersen Ang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:We propose a new manifold optimization method to solve low-rank problems with sparse simplex constraints (variables are simultaneous nonnegativity, sparsity, and sum-to-1) that are beneficial in applications. The proposed approach exploits oblique rotation manifolds, rewrite the problem, and introduce a new Riemannian optimization method. Experiments on synthetic datasets compared to the standard Euclidean method show the effectiveness of the proposed method.

[LG-101] Physics-informed neural networks for hidden boundary detection and flow field reconstruction

链接: https://arxiv.org/abs/2503.24074
作者: Yongzheng Zhu,Weizheng Chen,Jian Deng,Xin Bian
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 21 pages, 17 figures

点击查看摘要

Abstract:Simultaneously detecting hidden solid boundaries and reconstructing flow fields from sparse observations poses a significant inverse challenge in fluid mechanics. This study presents a physics-informed neural network (PINN) framework designed to infer the presence, shape, and motion of static or moving solid boundaries within a flow field. By integrating a body fraction parameter into the governing equations, the model enforces no-slip/no-penetration boundary conditions in solid regions while preserving conservation laws of fluid dynamics. Using partial flow field data, the method simultaneously reconstructs the unknown flow field and infers the body fraction distribution, thereby revealing solid boundaries. The framework is validated across diverse scenarios, including incompressible Navier-Stokes and compressible Euler flows, such as steady flow past a fixed cylinder, an inline oscillating cylinder, and subsonic flow over an airfoil. The results demonstrate accurate detection of hidden boundaries, reconstruction of missing flow data, and estimation of trajectories and velocities of a moving body. Further analysis examines the effects of data sparsity, velocity-only measurements, and noise on inference accuracy. The proposed method exhibits robustness and versatility, highlighting its potential for applications when only limited experimental or numerical data are available.

[LG-102] AutoML Algorithms for Online Generalized Additive Model Selection: Application to Electricity Demand Forecasting

链接: https://arxiv.org/abs/2503.24019
作者: Keshav Das,Julie Keisler,Margaux Brégère,Amaury Durand
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 13 pages, 1 figure

点击查看摘要

Abstract:Electricity demand forecasting is key to ensuring that supply meets demand lest the grid would blackout. Reliable short-term forecasts may be obtained by combining a Generalized Additive Models (GAM) with a State-Space model (Obst et al., 2021), leading to an adaptive (or online) model. A GAM is an over-parameterized linear model defined by a formula and a state-space model involves hyperparameters. Both the formula and adaptation parameters have to be fixed before model training and have a huge impact on the model’s predictive performance. We propose optimizing them using the DRAGON package of Keisler (2025), originally designed for neural architecture search. This work generalizes it for automated online generalized additive model selection by defining an efficient modeling of the search space (namely, the space of the GAM formulae and adaptation parameters). Its application to short-term French electricity demand forecasting demonstrates the relevance of the approach

[LG-103] he more the merrier: logical and multistage processors in credit scoring

链接: https://arxiv.org/abs/2503.23979
作者: Arturo Pérez-Peralta,Sandra Benítez-Peña,Rosa E. Lillo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 34 pages, 14 figures

点击查看摘要

Abstract:Machine Learning algorithms are ubiquitous in key decision-making contexts such as organizational justice or healthcare, which has spawned a great demand for fairness in these procedures. In this paper we focus on the application of fair ML in finance, more concretely on the use of fairness techniques on credit scoring. This paper makes two contributions. On the one hand, it addresses the existent gap concerning the application of established methods in the literature to the case of multiple sensitive variables through the use of a new technique called logical processors (LP). On the other hand, it also explores the novel method of multistage processors (MP) to investigate whether the combination of fairness methods can work synergistically to produce solutions with improved fairness or accuracy. Furthermore, we examine the intersection of these two lines of research by exploring the integration of fairness methods in the multivariate case. The results are very promising and suggest that logical processors are an appropriate way of handling multiple sensitive variables. Furthermore, multistage processors are capable of improving the performance of existing methods.

[LG-104] Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics

链接: https://arxiv.org/abs/2503.23927
作者: Sebastian Springer,Andre Scaffidi,Maximilian Autenrieth,Gabriella Contardo,Alessandro Laio,Roberto Trotta,Heikki Haario
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours’ membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.

[LG-105] Feature learning from non-Gaussian inputs: the case of Independent Component Analysis in high dimensions

链接: https://arxiv.org/abs/2503.23896
作者: Fabiola Ricci,Lorenzo Bardone,Sebastian Goldt
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Deep neural networks learn structured features from complex, non-Gaussian inputs, but the mechanisms behind this process remain poorly understood. Our work is motivated by the observation that the first-layer filters learnt by deep convolutional neural networks from natural images resemble those learnt by independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we leverage this connection to investigate the interplay between data structure and optimisation in feature learning for the most popular ICA algorithm, FastICA, and stochastic gradient descent (SGD), which is used to train deep networks. We rigorously establish that FastICA requires at least n\gtrsim d^4 samples to recover a single non-Gaussian direction from d -dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity n \gtrsim d^2 can be reached by smoothing the loss, albeit in a data-dependent way. We finally demonstrate the existence of a search phase for FastICA on ImageNet, and discuss how the strong non-Gaussianity of said images compensates for the poor sample complexity of FastICA.

[LG-106] Adaptive Attention-Based Model for 5G Radio-based Outdoor Localization

链接: https://arxiv.org/abs/2503.23810
作者: Ilayda Yaman,Guoda Tian,Fredrik Tufvesson,Ove Edfors,Zhengya Zhang,Liang Liu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 6 figures

点击查看摘要

Abstract:Radio-based localization in dynamic environments, such as urban and vehicular settings, requires systems that can efficiently adapt to varying signal conditions and environmental changes. Factors such as multipath interference and obstructions introduce different levels of complexity that affect the accuracy of the localization. Although generalized models offer broad applicability, they often struggle to capture the nuances of specific environments, leading to suboptimal performance in real-world deployments. In contrast, specialized models can be tailored to particular conditions, enabling more precise localization by effectively handling domain-specific variations and noise patterns. However, deploying multiple specialized models requires an efficient mechanism to select the most appropriate one for a given scenario. In this work, we develop an adaptive localization framework that combines shallow attention-based models with a router/switching mechanism based on a single-layer perceptron (SLP). This enables seamless transitions between specialized localization models optimized for different conditions, balancing accuracy, computational efficiency, and robustness to environmental variations. We design three low-complex localization models tailored for distinct scenarios, optimized for reduced computational complexity, test time, and model size. The router dynamically selects the most suitable model based on real-time input characteristics. The proposed framework is validated using real-world vehicle localization data collected from a massive MIMO base station (BS), demonstrating its ability to seamlessly adapt to diverse deployment conditions while maintaining high localization accuracy.

[LG-107] Force-Free Molecular Dynamics Through Autoregressive Equivariant Networks

链接: https://arxiv.org/abs/2503.23794
作者: Fabian L. Thiemann,Thiago Reschützegger,Massimiliano Esposito,Tseden Taddese,Juan D. Olarte-Plata,Fausto Martelli
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 25 pages total (19 manuscript, 6 SI). 5 figures in manuscript, 3 figures and 2 tables in SI

点击查看摘要

Abstract:Molecular dynamics (MD) simulations play a crucial role in scientific research. Yet their computational cost often limits the timescales and system sizes that can be explored. Most data-driven efforts have been focused on reducing the computational cost of accurate interatomic forces required for solving the equations of motion. Despite their success, however, these machine learning interatomic potentials (MLIPs) are still bound to small time-steps. In this work, we introduce TrajCast, a transferable and data-efficient framework based on autoregressive equivariant message passing networks that directly updates atomic positions and velocities lifting the constraints imposed by traditional numerical integration. We benchmark our framework across various systems, including a small molecule, crystalline material, and bulk liquid, demonstrating excellent agreement with reference MD simulations for structural, dynamical, and energetic properties. Depending on the system, TrajCast allows for forecast intervals up to 30\times larger than traditional MD time-steps, generating over 15 ns of trajectory data per day for a solid with more than 4,000 atoms. By enabling efficient large-scale simulations over extended timescales, TrajCast can accelerate materials discovery and explore physical phenomena beyond the reach of traditional simulations and experiments. An open-source implementation of TrajCast is accessible under this https URL.

[LG-108] Scalable Geometric Learning with Correlation-Based Functional Brain Networks

链接: https://arxiv.org/abs/2503.23653
作者: Kisung You,Hae-Jeong Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The correlation matrix is a central representation of functional brain networks in neuroimaging. Traditional analyses often treat pairwise interactions independently in a Euclidean setting, overlooking the intrinsic geometry of correlation matrices. While earlier attempts have embraced the quotient geometry of the correlation manifold, they remain limited by computational inefficiency and numerical instability, particularly in high-dimensional contexts. This paper presents a novel geometric framework that employs diffeomorphic transformations to embed correlation matrices into a Euclidean space, preserving salient manifold properties and enabling large-scale analyses. The proposed method integrates with established learning algorithms - regression, dimensionality reduction, and clustering - and extends naturally to population-level inference of brain networks. Simulation studies demonstrate both improved computational speed and enhanced accuracy compared to conventional manifold-based approaches. Moreover, applications in real neuroimaging scenarios illustrate the framework’s utility, enhancing behavior score prediction, subject fingerprinting in resting-state fMRI, and hypothesis testing in electroencephalogram data. An open-source MATLAB toolbox is provided to facilitate broader adoption and advance the application of correlation geometry in functional brain network research.

[LG-109] Learning a Single Index Model from Anisotropic Data with vanilla Stochastic Gradient Descent

链接: https://arxiv.org/abs/2503.23642
作者: Guillaume Braun,Minh Ha Quang,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the problem of learning a Single Index Model (SIM)- a popular model for studying the ability of neural networks to learn features - from anisotropic Gaussian inputs by training a neuron using vanilla Stochastic Gradient Descent (SGD). While the isotropic case has been extensively studied, the anisotropic case has received less attention and the impact of the covariance matrix on the learning dynamics remains unclear. For instance, Mousavi-Hosseini et al. (2023b) proposed a spherical SGD that requires a separate estimation of the data covariance matrix, thereby oversimplifying the influence of covariance. In this study, we analyze the learning dynamics of vanilla SGD under the SIM with anisotropic input data, demonstrating that vanilla SGD automatically adapts to the data’s covariance structure. Leveraging these results, we derive upper and lower bounds on the sample complexity using a notion of effective dimension that is determined by the structure of the covariance matrix instead of the input data dimension.

[LG-110] Online Convex Optimization and Integral Quadratic Constraints: A new approach to regret analysis

链接: https://arxiv.org/abs/2503.23600
作者: Fabian Jakob,Andrea Iannelli
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We propose a novel approach for analyzing dynamic regret of first-order constrained online convex optimization algorithms for strongly convex and Lipschitz-smooth objectives. Crucially, we provide a general analysis that is applicable to a wide range of first-order algorithms that can be expressed as an interconnection of a linear dynamical system in feedback with a first-order oracle. By leveraging Integral Quadratic Constraints (IQCs), we derive a semi-definite program which, when feasible, provides a regret guarantee for the online algorithm. For this, the concept of variational IQCs is introduced as the generalization of IQCs to time-varying monotone operators. Our bounds capture the temporal rate of change of the problem in the form of the path length of the time-varying minimizer and the objective function variation. In contrast to standard results in OCO, our results do not require nerither the assumption of gradient boundedness, nor that of a bounded feasible set. Numerical analyses showcase the ability of the approach to capture the dependence of the regret on the function class condition number.

[LG-111] Multi-Objective Optimization and Hyperparameter Tuning With Desirability Functions

链接: https://arxiv.org/abs/2503.23595
作者: Thomas Bartz-Beielstein
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The goal of this article is to provide an introduction to the desirability function approach to multi-objective optimization (direct and surrogate model-based), and multi-objective hyperparameter tuning. This work is based on the paper by Kuhn (2016). It presents a Python implementation of Kuhn’s R package desirability. The Python package spotdesirability is available as part of the sequential parameter optimization framework. After a brief introduction to the desirability function approach is presented, three examples are given that demonstrate how to use the desirability functions for classical optimization, surrogate-model based optimization, and hyperparameter tuning.

[LG-112] p-Adic Polynomial Regression as Alternative to Neural Network for Approximating p-Adic Functions of Many Variables

链接: https://arxiv.org/abs/2503.23488
作者: Alexander P. Zubarev
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA); Number Theory (math.NT); Optimization and Control (math.OC)
*备注: 10 pages

点击查看摘要

Abstract:A method for approximating continuous functions \mathbbZ_p^n\rightarrow\mathbbZ_p by a linear superposition of continuous functions \mathbbZ_p\rightarrow\mathbbZ_p is presented and a polynomial regression model is constructed that allows approximating such functions with any degree of accuracy. A physical interpretation of such a model is given and possible methods for its training are discussed. The proposed model can be considered as a simple alternative to possible p -adic models based on neural network architecture.

[LG-113] Accelerated Stein Variational Gradient Flow

链接: https://arxiv.org/abs/2503.23462
作者: Viktor Stein,Wuchen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to GSI’25, 9 pages, 2 figures, comments welcome

点击查看摘要

Abstract:Stein variational gradient descent (SVGD) is a kernel-based particle method for sampling from a target distribution, e.g., in generative modeling and Bayesian inference. SVGD does not require estimating the gradient of the log-density, which is called score estimation. In practice, SVGD can be slow compared to score-estimation based sampling algorithms. To design fast and efficient high-dimensional sampling algorithms, we introduce ASVGD, an accelerated SVGD, based on an accelerated gradient flow in a metric space of probability densities following Nesterov’s method. We then derive a momentum-based discrete-time sampling algorithm, which evolves a set of particles deterministically. To stabilize the particles’ momentum update, we also study a Wasserstein metric regularization. For the generalized bilinear kernel and the Gaussian kernel, toy numerical examples with varied target distributions demonstrate the effectiveness of ASVGD compared to SVGD and other popular sampling methods.

[LG-114] DGSAM: Domain Generalization via Individual Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2503.23430
作者: Youngjun Song,Youngsik Hwang,Jonghun Lee,Heechang Lee,Dong-Young Lim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Domain generalization (DG) aims to learn models that can generalize well to unseen domains by training only on a set of source domains. Sharpness-Aware Minimization (SAM) has been a popular approach for this, aiming to find flat minima in the total loss landscape. However, we show that minimizing the total loss sharpness does not guarantee sharpness across individual domains. In particular, SAM can converge to fake flat minima, where the total loss may exhibit flat minima, but sharp minima are present in individual domains. Moreover, the current perturbation update in gradient ascent steps is ineffective in directly updating the sharpness of individual domains. Motivated by these findings, we introduce a novel DG algorithm, Decreased-overhead Gradual Sharpness-Aware Minimization (DGSAM), that applies gradual domain-wise perturbation to reduce sharpness consistently across domains while maintaining computational efficiency. Our experiments demonstrate that DGSAM outperforms state-of-the-art DG methods, achieving improved robustness to domain shifts and better performance across various benchmarks, while reducing computational overhead compared to SAM.

[LG-115] Quantum-Assisted Machine Learning Models for Enhanced Weather Prediction

链接: https://arxiv.org/abs/2503.23408
作者: Saiyam Sakhuja,Shivanshu Siyanwal,Abhishek Tiwari,Britant,Savita Kashyap
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) presents as a revolutionary approach to weather forecasting by using quantum computing to improve predictive modeling capabilities. In this study, we apply QML models, including Quantum Gated Recurrent Units (QGRUs), Quantum Neural Networks (QNNs), Quantum Long Short-Term Memory(QLSTM), Variational Quantum Circuits(VQCs), and Quantum Support Vector Machines(QSVMs), to analyze meteorological time-series data from the ERA5 dataset. Our methodology includes preprocessing meteorological features, implementing QML architectures for both classification and regression tasks. The results demonstrate that QML models can achieve reasonable accuracy in both prediction and classification tasks, particularly in binary classification. However, challenges such as quantum hardware limitations and noise affect scalability and generalization. This research provides insights into the feasibility of QML for weather prediction, paving the way for further exploration of hybrid quantum-classical frameworks to enhance meteorological forecasting.

[LG-116] Reinforcement Learning for Active Matter

链接: https://arxiv.org/abs/2503.23308
作者: Wenjie Cai,Gongyi Wang,Yu Zhang,Xiang Qu,Zihan Huang
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Robotics (cs.RO); Biological Physics (physics.bio-ph)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:Active matter refers to systems composed of self-propelled entities that consume energy to produce motion, exhibiting complex non-equilibrium dynamics that challenge traditional models. With the rapid advancements in machine learning, reinforcement learning (RL) has emerged as a promising framework for addressing the complexities of active matter. This review systematically introduces the integration of RL for guiding and controlling active matter systems, focusing on two key aspects: optimal motion strategies for individual active particles and the regulation of collective dynamics in active swarms. We discuss the use of RL to optimize the navigation, foraging, and locomotion strategies for individual active particles. In addition, the application of RL in regulating collective behaviors is also examined, emphasizing its role in facilitating the self-organization and goal-directed control of active swarms. This investigation offers valuable insights into how RL can advance the understanding, manipulation, and control of active matter, paving the way for future developments in fields such as biological systems, robotics, and medical science.

[LG-117] SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

链接: https://arxiv.org/abs/2503.23108
作者: Hyeongju Kim,Jinhyeok Yang,Yechan Yu,Seunghun Ji,Jacob Morton,Frederik Bous,Joon Byun,Juheon Lee
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 19 pages, preprint

点击查看摘要

Abstract:We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at: this https URL.

[LG-118] Engineering Microbial Symbiosis for Mars Habitability

链接: https://arxiv.org/abs/2503.23015
作者: Randall R. Correll,Simon P. Worden
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: 25 pages, 1 figure

点击查看摘要

Abstract:The colonization of Mars presents extraordinary challenges, including radiation exposure, low atmospheric pressure, and toxic regolith. Recent advancements in synthetic biology and genetic engineering offer unprecedented opportunities to address these obstacles by utilizing terrestrial extremophiles and engineered organisms. This paper examines the potential for creating symbiotic relationships between terrestrial microbes and hypothetical Martian life forms, should they exist, to support a sustainable human presence on Mars. Inspired by natural examples of endosymbiosis, such as mitochondria and chloroplasts, we propose methods to engineer life forms capable of enduring Martian conditions. Key components include experimental designs, laboratory simulations, and bioengineering approaches essential to this endeavor. The ethical, political, and technological challenges of introducing engineered life to Mars are critically evaluated, with an emphasis on international collaboration and robust planetary protection policies. This research underscores engineered symbiosis as a transformative strategy for enabling life to adapt and thrive on Mars while advancing humanity’s aspirations for interplanetary habitation and exploration. By addressing these challenges, this work highlights a path toward sustainable life on Mars, reflecting both scientific ingenuity and ethical stewardship.

[LG-119] Nested Stochastic Gradient Descent for (Generalized) Sinkhorn Distance-Regularized Distributionally Robust Optimization

链接: https://arxiv.org/abs/2503.22923
作者: Yufeng Yang,Yi Zhou,Zhaosong Lu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 20 figures, 1 table

点击查看摘要

Abstract:Distributionally robust optimization (DRO) is a powerful technique to train robust models against data distribution shift. This paper aims to solve regularized nonconvex DRO problems, where the uncertainty set is modeled by a so-called generalized Sinkhorn distance and the loss function is nonconvex and possibly unbounded. Such a distance allows to model uncertainty of distributions with different probability supports and divergence functions. For this class of regularized DRO problems, we derive a novel dual formulation taking the form of nested stochastic programming, where the dual variable depends on the data sample. To solve the dual problem, we provide theoretical evidence to design a nested stochastic gradient descent (SGD) algorithm, which leverages stochastic approximation to estimate the nested stochastic gradients. We study the convergence rate of nested SGD and establish polynomial iteration and sample complexities that are independent of the data size and parameter dimension, indicating its potential for solving large-scale DRO problems. We conduct numerical experiments to demonstrate the efficiency and robustness of the proposed algorithm.

[LG-120] Quantum Doeblin Coefficients: Interpretations and Applications

链接: https://arxiv.org/abs/2503.22823
作者: Ian George,Christoph Hirche,Theshani Nuradha,Mark M. Wilde
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 88 pages, 2 figures

点击查看摘要

Abstract:In classical information theory, the Doeblin coefficient of a classical channel provides an efficiently computable upper bound on the total-variation contraction coefficient of the channel, leading to what is known as a strong data-processing inequality. Here, we investigate quantum Doeblin coefficients as a generalization of the classical concept. In particular, we define various new quantum Doeblin coefficients, one of which has several desirable properties, including concatenation and multiplicativity, in addition to being efficiently computable. We also develop various interpretations of two of the quantum Doeblin coefficients, including representations as minimal singlet fractions, exclusion values, reverse max-mutual and oveloH informations, reverse robustnesses, and hypothesis testing reverse mutual and oveloH informations. Our interpretations of quantum Doeblin coefficients as either entanglement-assisted or unassisted exclusion values are particularly appealing, indicating that they are proportional to the best possible error probabilities one could achieve in state-exclusion tasks by making use of the channel. We also outline various applications of quantum Doeblin coefficients, ranging from limitations on quantum machine learning algorithms that use parameterized quantum circuits (noise-induced barren plateaus), on error mitigation protocols, on the sample complexity of noisy quantum hypothesis testing, on the fairness of noisy quantum models, and on mixing times of time-varying channels. All of these applications make use of the fact that quantum Doeblin coefficients appear in upper bounds on various trace-distance contraction coefficients of a channel. Furthermore, in all of these applications, our analysis using Doeblin coefficients provides improvements of various kinds over contributions from prior literature, both in terms of generality and being efficiently computable.

[LG-121] Congenital Heart Disease Classification Using Phonocardiograms: A Scalable Screening Tool for Diverse Environments

链接: https://arxiv.org/abs/2503.22773
作者: Abdul Jabbar,Ethan Grooby,Jack Crozier,Alexander Gallon,Vivian Pham,Khawza I Ahmad,Md Hassanuzzaman,Raqibul Mostafa,Ahsan H. Khandoker,Faezeh Marzbanrad
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:Congenital heart disease (CHD) is a critical condition that demands early detection, particularly in infancy and childhood. This study presents a deep learning model designed to detect CHD using phonocardiogram (PCG) signals, with a focus on its application in global health. We evaluated our model on several datasets, including the primary dataset from Bangladesh, achieving a high accuracy of 94.1%, sensitivity of 92.7%, specificity of 96.3%. The model also demonstrated robust performance on the public PhysioNet Challenge 2022 and 2016 datasets, underscoring its generalizability to diverse populations and data sources. We assessed the performance of the algorithm for single and multiple auscultation sites on the chest, demonstrating that the model maintains over 85% accuracy even when using a single location. Furthermore, our algorithm was able to achieve an accuracy of 80% on low-quality recordings, which cardiologists deemed non-diagnostic. This research suggests that an AI- driven digital stethoscope could serve as a cost-effective screening tool for CHD in resource-limited settings, enhancing clinical decision support and ultimately improving patient outcomes.

[LG-122] Multiple Embeddings for Quantum Machine Learning

链接: https://arxiv.org/abs/2503.22758
作者: Siyu Han,Lihan Jia,Lanzhe Guo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work focuses on the limitations about the insufficient fitting capability of current quantum machine learning methods, which results from the over-reliance on a single data embedding strategy. We propose a novel quantum machine learning framework that integrates multiple quantum data embedding strategies, allowing the model to fully exploit the diversity of quantum computing when processing various datasets. Experimental results validate the effectiveness of the proposed framework, demonstrating significant improvements over existing state-of-the-art methods and achieving superior performance in practical applications.

[LG-123] Symmetry-Informed Graph Neural Networks for Carbon Dioxide Isotherm and Adsorption Prediction in Aluminum-Substituted Zeolites

链接: https://arxiv.org/abs/2503.22737
作者: Marko Petković,José-Manuel Vicent Luna,El=ıza Beate Dinne,Vlado Menkovski,Sofía Calero
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting adsorption properties in nanoporous materials using Deep Learning models remains a challenging task. This challenge becomes even more pronounced when attempting to generalize to structures that were not part of the training data… In this work, we introduce SymGNN, a graph neural network architecture that leverages material symmetries to improve adsorption property prediction. By incorporating symmetry operations into the message-passing mechanism, our model enhances parameter sharing across different zeolite topologies, leading to improved generalization. We evaluate SymGNN on both interpolation and generalization tasks, demonstrating that it successfully captures key adsorption trends, including the influence of both the framework and aluminium distribution on CO _2 adsorption. Furthermore, we apply our model to the characterization of experimental adsorption isotherms, using a genetic algorithm to infer likely aluminium distributions. Our results highlight the effectiveness of machine learning models trained on simulations for studying real materials and suggest promising directions for fine-tuning with experimental data and generative approaches for the inverse design of multifunctional nanomaterials.

信息检索

[IR-0] Combining Query Performance Predictors: A Reproducibility Study

链接: https://arxiv.org/abs/2503.24251
作者: Sourav Saha,Suchana Datta,Dwaipayan Roy,Mandar Mitra,Derek Greene
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A large number of approaches to Query Performance Prediction (QPP) have been proposed over the last two decades. As early as 2009, Hauff et al. [28] explored whether different QPP methods may be combined to improve prediction quality. Since then, significant research has been done both on QPP approaches, as well as their evaluation. This study revisits Hauff et al.s work to assess the reproducibility of their findings in the light of new prediction methods, evaluation metrics, and datasets. We expand the scope of the earlier investigation by: (i) considering post-retrieval methods, including supervised neural techniques (only pre-retrieval techniques were studied in [28]); (ii) using sMARE for evaluation, in addition to the traditional correlation coefficients and RMSE; and (iii) experimenting with additional datasets (Clueweb09B and TREC DL). Our results largely support previous claims, but we also present several interesting findings. We interpret these findings by taking a more nuanced look at the correlation between QPP methods, examining whether they capture diverse information or rely on overlapping factors.

[IR-1] Music Information Retrieval on Representative Mexican Folk Vocal Melodies Through MIDI Feature Extraction

链接: https://arxiv.org/abs/2503.24243
作者: Mario Alberto Vallejo Reyes
类目: ound (cs.SD); Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures, 2 tables

点击查看摘要

Abstract:This study analyzes representative Mexican folk vocal melodies using MIDI feature extraction, examining ambitus, pitch-class entropy, and interval distribution. It also explores the relationship between these features and song popularity, as measured by Spotify plays. The study employs MATLAB and the MIDI Toolbox for extracting musical features and performing statistical analysis. The findings reveal a significant variation in ambitus, with values ranging from 8 to 27 semitones, indicating a diverse compositional style and vocal demand across the genre. The analysis of pitch-class entropy showcases a broad spectrum of melodic complexity, with Armando Manzanero’s Somos Novios' displaying the highest entropy, suggesting varied and complex melodic structures, while traditional pieces like La Bamba’ exhibit lower entropy, indicating simpler, more repetitive patterns. The interval distribution predominantly features prime intervals (P1), major and minor seconds (M2, m2), pointing to a compositional preference for close, contiguous intervals that contribute to the melodies’ accessibility and appeal. Statistical analysis do not establish a significant correlation between the ambitus or entropy and the number of Spotify plays.

[IR-2] xt2Tracks: Prompt-based Music Recommendation via Generative Retrieval

链接: https://arxiv.org/abs/2503.24193
作者: Enrico Palumbo,Gustavo Penha,Andreas Damianou,José Luis Redondo García,Timothy Christopher Heath,Alice Wang,Hugues Bouchard,Mounia Lalmas
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have enabled users to provide highly specific music recommendation requests using natural language prompts (e.g. “Can you recommend some old classics for slow dancing?”). In this setup, the recommended tracks are predicted by the LLM in an autoregressive way, i.e. the LLM generates the track titles one token at a time. While intuitive, this approach has several limitation. First, it is based on a general purpose tokenization that is optimized for words rather than for track titles. Second, it necessitates an additional entity resolution layer that matches the track title to the actual track identifier. Third, the number of decoding steps scales linearly with the length of the track title, slowing down inference. In this paper, we propose to address the task of prompt-based music recommendation as a generative retrieval task. Within this setting, we introduce novel, effective, and efficient representations of track identifiers that significantly outperform commonly used strategies. We introduce Text2Tracks, a generative retrieval model that learns a mapping from a user’s music recommendation prompt to the relevant track IDs directly. Through an offline evaluation on a dataset of playlists with language inputs, we find that (1) the strategy to create IDs for music tracks is the most important factor for the effectiveness of Text2Tracks and semantic IDs significantly outperform commonly used strategies that rely on song titles as identifiers (2) provided with the right choice of track identifiers, Text2Tracks outperforms sparse and dense retrieval solutions trained to retrieve tracks from language prompts.

[IR-3] On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents ECIR2025

链接: https://arxiv.org/abs/2503.23824
作者: Emmanouil Georgios Lionis,Jia-Huei Ju
类目: Information Retrieval (cs.IR)
*备注: This is a preprint of our paper accepted at ECIR 2025

点击查看摘要

Abstract:Document retrieval is one of the most challenging tasks in Information Retrieval. It requires handling longer contexts, often resulting in higher query latency and increased computational overhead. Recently, Learned Sparse Retrieval (LSR) has emerged as a promising approach to address these challenges. Some have proposed adapting the LSR approach to longer documents by aggregating segmented document using different post-hoc methods, including n-grams and proximity scores, adjusting representations, and learning to ensemble all signals. In this study, we aim to reproduce and examine the mechanisms of adapting LSR for long documents. Our reproducibility experiments confirmed the importance of specific segments, with the first segment consistently dominating document retrieval performance. Furthermore, We re-evaluate recently proposed methods – ExactSDM and SoftSDM – across varying document lengths, from short (up to 2 segments) to longer (3+ segments). We also designed multiple analyses to probe the reproduced methods and shed light on the impact of global information on adapting LSR to longer contexts. The complete code and implementation for this project is available at: this https URL.

[IR-4] Understanding Visual Saliency of Outlier Items in Product Search

链接: https://arxiv.org/abs/2503.23596
作者: Fatemeh Sarvi,Mohammad Aliannejadi,Sebastian Schelter,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In two-sided marketplaces, items compete for user attention, which translates to revenue for suppliers. Item exposure, indicated by the amount of attention items receive in a ranking, can be influenced by factors like position bias. Recent work suggests that inter-item dependencies, such as outlier items in a ranking, also affect item exposure. Outlier items are items that observably deviate from the other items in a ranked list. Understanding outlier items is crucial for determining an item’s exposure distribution. In our previous work, we investigated the impact of different presentational features on users’ perception of outlier in search results. In this work, we focus on two key questions left unanswered by our previous work: (i) What is the effect of isolated bottom-up visual factors on item outlierness in product lists? (ii) How do top-down factors influence users’ perception of item outlierness in a realistic online shopping scenario? We start with bottom-up factors and employ visual saliency models to evaluate their ability to detect outlier items in product lists purely based on visual attributes. Then, to examine top-down factors, we conduct eye-tracking experiments on an online shopping task. Moreover, we employ eye-tracking to not only be closer to the real-world case but also to address the accuracy problem of reaction time in the visual search task. Our experiments show the ability of visual saliency models to detect bottom-up factors, consistently highlighting areas with strong visual contrasts. The results of our eye-tracking experiment for lists without outliers show that despite being less visually attractive, product descriptions captured attention the fastest, indicating the importance of top-down factors. In our eye-tracking experiments, we observed that outlier items engaged users for longer durations compared to non-outlier items.

[IR-5] Design and Experimental Validation of an Autonomous USV for Sensor Fusion-Based Navigation in GNSS-Denied Environments

链接: https://arxiv.org/abs/2503.23445
作者: Samuel Cohen-Salmon,Itzik Klein
类目: Robotics (cs.RO); Information Retrieval (cs.IR)
*备注: submitted to IEEE OCEANS 2025 Brest

点击查看摘要

Abstract:This paper presents the design, development, and experimental validation of MARVEL, an autonomous unmanned surface vehicle built for real-world testing of sensor fusion-based navigation algorithms in GNSS-denied environments. MARVEL was developed under strict constraints of cost-efficiency, portability, and seaworthiness, with the goal of creating a modular, accessible platform for high-frequency data acquisition and experimental learning. It integrates electromagnetic logs, Doppler velocity logs, inertial sensors, and real-time kinematic GNSS positioning. MARVEL enables real-time, in-situ validation of advanced navigation and AI-driven algorithms using redundant, synchronized sensors. Field experiments demonstrate the system’s stability, maneuverability, and adaptability in challenging sea conditions. The platform offers a novel, scalable approach for researchers seeking affordable, open-ended tools to evaluate sensor fusion techniques under real-world maritime constraints.

[IR-6] Filtering with Time-frequency Analysis: An Adaptive and Lightweight Model for Sequential Recommender Systems Based on Discrete Wavelet Transform

链接: https://arxiv.org/abs/2503.23436
作者: Sheng Lu,Mingxi Ge,Jiuyi Zhang,Wanli Zhu,Guanjin Li,Fangming Gu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential Recommender Systems (SRS) aim to model sequential behaviors of users to capture their interests which usually evolve over time. Transformer-based SRS have achieved distinguished successes recently. However, studies reveal self-attention mechanism in Transformer-based models is essentially a low-pass filter and ignores high frequency information potentially including meaningful user interest patterns. This motivates us to seek better filtering technologies for SRS, and finally we find Discrete Wavelet Transform (DWT), a famous time-frequency analysis technique from digital signal processing field, can effectively process both low-frequency and high-frequency information. We design an adaptive time-frequency filter with DWT technique, which decomposes user interests into multiple signals with different frequency and time, and can automatically learn weights of these signals. Furthermore, we develop DWTRec, a model for sequential recommendation all based on the adaptive time-frequency filter. Thanks to fast DWT technique, DWTRec has a lower time complexity and space complexity theoretically, and is Proficient in modeling long sequences. Experiments show that our model outperforms state-of-the-art baseline models in datasets with different domains, sparsity levels and average sequence lengths. Especially, our model shows great performance increase in contrast with previous models when the sequence grows longer, which demonstrates another advantage of our model.

[IR-7] LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search WWW2025

链接: https://arxiv.org/abs/2503.23409
作者: Ximu Zeng,Liwei Deng,Penghao Chen,Xu Chen,Han Su,Kai Zheng
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: This paper is accepted by WWW 2025

点击查看摘要

Abstract:Approximate nearest neighbor search is fundamental in information retrieval. Previous partition-based methods enhance search efficiency by probing partial partitions, yet they face two common issues. In the query phase, a common strategy is to probe partitions based on the distance ranks of a query to partition centroids, which inevitably probes irrelevant partitions as it ignores data distribution. In the partition construction phase, all partition-based methods face the boundary problem that separates a query’s nearest neighbors to multiple partitions, resulting in a long-tailed kNN distribution and degrading the optimal nprobe (i.e., the number of probing partitions). To address this gap, we propose LIRA, a LearnIng-based queRy-aware pArtition framework. Specifically, we propose a probing model to directly probe the partitions containing the kNN of a query, which can reduce probing waste and allow for query-aware probing with nprobe individually. Moreover, we incorporate the probing model into a learning-based redundancy strategy to mitigate the adverse impact of the long-tailed kNN distribution on search efficiency. Extensive experiments on real-world vector datasets demonstrate the superiority of LIRA in the trade-off among accuracy, latency, and query fan-out. The codes are available at this https URL.

[IR-8] RuleAgent : Discovering Rules for Recommendation Denoising with Autonomous Language Agents

链接: https://arxiv.org/abs/2503.23374
作者: Zongwei Wang,Min Gao,Junliang Yu,Yupeng Hou,Shazia Sadiq,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 4 figures

点击查看摘要

Abstract:The implicit feedback (e.g., clicks) in real-world recommender systems is often prone to severe noise caused by unintentional interactions, such as misclicks or curiosity-driven behavior. A common approach to denoising this feedback is manually crafting rules based on observations of training loss patterns. However, this approach is labor-intensive and the resulting rules often lack generalization across diverse scenarios. To overcome these limitations, we introduce RuleAgent, a language agent based framework which mimics real-world data experts to autonomously discover rules for recommendation denoising. Unlike the high-cost process of manual rule mining, RuleAgent offers rapid and dynamic rule discovery, ensuring adaptability to evolving data and varying scenarios. To achieve this, RuleAgent is equipped with tailored profile, memory, planning, and action modules and leverages reflection mechanisms to enhance its reasoning capabilities for rule discovery. Furthermore, to avoid the frequent retraining in rule discovery, we propose LossEraser-an unlearning strategy that streamlines training without compromising denoising performance. Experiments on benchmark datasets demonstrate that, compared with existing denoising methods, RuleAgent not only derives the optimal recommendation performance but also produces generalizable denoising rules, assisting researchers in efficient data cleaning.

[IR-9] Graph-Structured Driven Dual Adaptation for Mitigating Popularity Bias

链接: https://arxiv.org/abs/2503.23358
作者: Miaomiao Cai,Lei Chen,Yifan Wang,Zhiyong Cheng,Min Zhang,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Popularity bias challenges recommender systems by causing uneven recommendation performance and amplifying the Matthew effect. Limited user-item interactions confine unpopular items within embedding neighborhoods of few users, leading to representation collapse and reduced model generalization. Existing supervised alignment and reweighting methods mitigate this bias but have key limitations: (1) ignoring inherent variability across Graph Convolutional Networks (GCNs) layers, causing negative effects in deeper layers; (2) reliance on fixed hyperparameters to balance item popularity, restricting adaptability and increasing complexity. To address these issues, we propose the Graph-Structured Dual Adaptation Framework (GSDA). Our theoretical analysis identifies a crucial limitation of supervised alignment methods caused by over-smoothing in GCNs. As GCN layers deepen, popular and unpopular items increasingly lose distinctiveness, quantified by reduced conditional entropy. This diminished distinctiveness weakens supervised alignment effectiveness in mitigating popularity bias. Motivated by this, GSDA captures structural and distribution characteristics from the adjacency matrix through a dual adaptive strategy. First, a hierarchical adaptive alignment mechanism uses the adjacency matrix’s Frobenius norm for layer-specific weight decay, countering conditional entropy reduction effects at deeper layers. Second, a distribution-aware dynamic contrast weighting strategy, guided by a real-time Gini coefficient, removes dependence on fixed hyperparameters, enabling adaptability to diverse data. Experiments on three benchmark datasets demonstrate GSDA significantly alleviates popularity bias and consistently outperforms state-of-the-art recommendation methods. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2503.23358 [cs.IR] (or arXiv:2503.23358v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.23358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-10] CAWAL: A novel unified analytics framework for enterprise web applications and multi-server environments

链接: https://arxiv.org/abs/2503.23244
作者: Özkan Canay,Ümit Kocabıçak
类目: Human-Computer Interaction (cs.HC); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
*备注: This is a preprint version of a research article printed in journal. The manuscript includes 21 pages, 10 figures, and 3 tables

点击查看摘要

Abstract:In web analytics, cloud-based solutions have limitations in data ownership and privacy, whereas client-side user tracking tools face challenges such as data accuracy and a lack of server-side metrics. This paper presents the Combined Analytics and Web Application Log (CAWAL) framework as an alternative model and an on-premises framework, offering web analytics with application logging integration. CAWAL enables precise data collection and cross-domain tracking in web farms while complying with data ownership and privacy regulations. The framework also improves software diagnostics and troubleshooting by incorporating application-specific data into analytical processes. Integrated into an enterprise-grade web application, CAWAL has demonstrated superior performance, achieving approximately 24% and 85% lower response times compared to Open Web Analytics (OWA) and Matomo, respectively. The empirical evaluation demonstrates that the framework eliminates certain limitations in existing tools and provides a robust data infrastructure for enhanced web analytics.

[IR-11] Reproducibility Companion Paper:In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems

链接: https://arxiv.org/abs/2503.23040
作者: Yixiu Liu,Zehui He,Yuyuan Li,Zhongxuan Han,Chaochao Chen,Xiaolin Zheng
类目: Information Retrieval (cs.IR)
*备注: 4 pages

点击查看摘要

Abstract:In this paper, we reproduce experimental results presented in our earlier work titled “In-processing User Constrained Dominant Sets for User-Oriented Fairness in Recommender Systems” that was presented in the proceeding of the 31st ACM International Conference on this http URL work aims to verify the effectiveness of our previously proposed method and provide guidance for reproducibility. We present detailed descriptions of our preprocessed datasets, the structure of our source code, configuration file settings, experimental environment, and the reproduced experimental results.

[IR-12] Imagine All The Relevance: Scenario-Profiled Indexing with Knowledge Expansion for Dense Retrieval

链接: https://arxiv.org/abs/2503.23033
作者: Sangam Lee,Ryang Heo,SeongKu Kang,Dongha Lee
类目: Information Retrieval (cs.IR)
*备注: 9 pages

点击查看摘要

Abstract:Existing dense retrieval models struggle with reasoning-intensive retrieval task as they fail to capture implicit relevance that requires reasoning beyond surface-level semantic information. To address these challenges, we propose Scenario-Profiled Indexing with Knowledge Expansion (SPIKE), a dense retrieval framework that explicitly indexes implicit relevance by decomposing documents into scenario-based retrieval units. SPIKE organizes documents into scenario, which encapsulates the reasoning process necessary to uncover implicit relationships between hypothetical information needs and document content. SPIKE constructs a scenario-augmented dataset using a powerful teacher large language model (LLM), then distills these reasoning capabilities into a smaller, efficient scenario generator. During inference, SPIKE incorporates scenario-level relevance alongside document-level relevance, enabling reasoning-aware retrieval. Extensive experiments demonstrate that SPIKE consistently enhances retrieval performance across various query types and dense retrievers. It also enhances the retrieval experience for users through scenario and offers valuable contextual information for LLMs in retrieval-augmented generation (RAG).

[IR-13] Federated Semantic Learning for Privacy-preserving Cross-domain Recommendation

链接: https://arxiv.org/abs/2503.23026
作者: Ziang Lu,Lei Guo,Xu Yu,Zhiyong Cheng,Xiaohui Han,Lei Zhu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the evolving landscape of recommender systems, the challenge of effectively conducting privacy-preserving Cross-Domain Recommendation (CDR), especially under strict non-overlapping constraints, has emerged as a key focus. Despite extensive research has made significant progress, several limitations still exist: 1) Previous semantic-based methods fail to deeply exploit rich textual information, since they quantize the text into codes, losing its original rich semantics. 2) The current solution solely relies on the text-modality, while the synergistic effects with the ID-modality are ignored. 3) Existing studies do not consider the impact of irrelevant semantic features, leading to inaccurate semantic representation. To address these challenges, we introduce federated semantic learning and devise FFMSR as our solution. For Limitation 1, we locally learn items’semantic encodings from their original texts by a multi-layer semantic encoder, and then cluster them on the server to facilitate the transfer of semantic knowledge between domains. To tackle Limitation 2, we integrate both ID and Text modalities on the clients, and utilize them to learn different aspects of items. To handle Limitation 3, a Fast Fourier Transform (FFT)-based filter and a gating mechanism are developed to alleviate the impact of irrelevant semantic information in the local model. We conduct extensive experiments on two real-world datasets, and the results demonstrate the superiority of our FFMSR method over other SOTA methods. Our source codes are publicly available at: this https URL.

[IR-14] DAT: Dynamic Alpha Tuning for Hybrid Retrieval in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2503.23013
作者: Hsin-Ling Hsu,Jengnan Tzeng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Hybrid retrieval techniques in Retrieval-Augmented Generation (RAG) systems enhance information retrieval by combining dense and sparse (e.g., BM25-based) retrieval methods. However, existing approaches struggle with adaptability, as fixed weighting schemes fail to adjust to different queries. To address this, we propose DAT (Dynamic Alpha Tuning), a novel hybrid retrieval framework that dynamically balances dense retrieval and BM25 for each query. DAT leverages a large language model (LLM) to evaluate the effectiveness of the top-1 results from both retrieval methods, assigning an effectiveness score to each. It then calibrates the optimal weighting factor through effectiveness score normalization, ensuring a more adaptive and query-aware weighting between the two approaches. Empirical results show that DAT consistently significantly outperforms fixed-weighting hybrid retrieval methods across various evaluation metrics. Even on smaller models, DAT delivers strong performance, highlighting its efficiency and adaptability.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-01

目录

概览 (2025-04-01)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载