Arxiv今日论文 | 2025-03-07

本篇博文主要内容为 2025-03-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决自然语言中长程依赖建模的问题，提出了一种新的双部分互信息（bipartite mutual information）尺度定律，该定律独立于传统的两点互信息（two-point mutual information），用于描述长上下文语言建模的关键特性。论文的关键突破在于通过这一尺度定律，建立了长上下文语言建模（Long-context Language Modeling, L²M）条件，揭示了模型有效建模长上下文长度的能力与其潜在状态大小扩展之间的关系。通过在Transformer和状态空间模型上的实验验证，证明了该理论的有效性，从而为大规模语言模型向更长上下文长度发展提供了理论指导。

链接: https://arxiv.org/abs/2503.04725
作者: Zhuo Chen,Oriol Mayné i Comas,Zhuotao Jin,Di Luo,Marin Soljačić
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注: 29 pages, 12 figures, 1 table

点击查看摘要

Abstract:We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L ^2 M) condition, which relates a model’s capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.
zh

[NLP-1] LLM VoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

【速读】：该论文旨在解决现有基于大型语言模型（LLMs）的语音对话系统在微调需求、高计算开销以及文本与语音对齐方面的局限性，同时避免因修改LLM以适配语音功能而导致的语言能力下降的问题。论文提出的解决方案核心在于引入LLMVoX，这是一种轻量级（30M参数）、与LLM无关、自回归流式文本到语音（TTS）系统。LLMVoX通过多队列令牌流式处理架构将语音合成与LLM处理解耦，不仅实现了高质量语音生成且延迟低，还完整保留了基础LLM的语言能力。此外，其插拔式设计支持扩展至不同任务，并通过数据集适应即可泛化至新语言，同时在阿拉伯语语音任务中表现出较低的字符错误率（CER）。关键创新点在于通过解耦机制实现无限长度对话支持，同时无需额外的多模态训练即可与视觉-语言模型结合，形成具备语音、文本和视觉能力的统一模型。

链接: https://arxiv.org/abs/2503.04724
作者: Sambal Shikhar,Mohammed Irfan Kurpath,Sahal Shaji Mullappilly,Jean Lahoud,Fahad Khan,Rao Muhammad Anwer,Salman Khan,Hisham Cholakkal
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学); LinkГ¶ping University (林雪平大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at this https URL .
zh

[NLP-2] Shifting Long-Context LLM s Research from Input to Output

【速读】：该论文试图解决长上下文大型语言模型（Long Context LLMs）在生成长篇输出（Long-output Generation）方面的能力不足问题。当前研究主要集中在扩展输入上下文的理解能力上，而对生成连贯、内容丰富且逻辑一致的长文本任务关注较少。这些任务包括小说创作、长期规划及复杂推理等，它们对模型提出了既要理解广泛上下文又要生成高质量长文本的双重挑战，暴露出现有LLM在这一领域的显著能力缺口。论文的关键在于呼吁将自然语言处理（NLP）的研究重点转向开发专门用于生成高质量长文本的基础模型，强调这一方向的重要性和实际应用潜力。

链接: https://arxiv.org/abs/2503.04723
作者: Yuhao Wu,Yushi Bai,Zhiqing Hu,Shangqing Tu,Ming Shan Hee,Juanzi Li,Roy Ka-Wei Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in long-context Large Language Models (LLMs) have primarily concentrated on processing extended input contexts, resulting in significant strides in long-context comprehension. However, the equally critical aspect of generating long-form outputs has received comparatively less attention. This paper advocates for a paradigm shift in NLP research toward addressing the challenges of long-output generation. Tasks such as novel writing, long-term planning, and complex reasoning require models to understand extensive contexts and produce coherent, contextually rich, and logically consistent extended text. These demands highlight a critical gap in current LLM capabilities. We underscore the importance of this under-explored domain and call for focused efforts to develop foundational LLMs tailored for generating high-quality, long-form outputs, which hold immense potential for real-world applications.
zh

[NLP-3] Enough Coin Flips Can Make LLM s Act Bayesian

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在基于少量示例的提示（in-context learning, ICL）下展现的推理能力是否遵循贝叶斯框架进行结构化推理，或者仅依赖于模式匹配。论文的关键解决方案在于通过受控的偏置硬币翻转实验设置，分析LLMs在零样本（zero-shot）和带样本（few-shot）条件下的行为表现，发现LLMs通常具有偏置先验（biased priors），但在上下文证据的影响下能够以接近贝叶斯后验更新的方式调整其先验，并且注意力幅度对贝叶斯推理几乎没有影响。这表明LLMs主要依赖校准良好的先验而非模式匹配来实现结构化推理。

链接: https://arxiv.org/abs/2503.04722
作者: Ritwik Gupta,Rodolfo Corona,Jiaxin Ge,Eric Wang,Dan Klein,Trevor Darrell,David M. Chan
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs utilize ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner.
zh

[NLP-4] Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities

【速读】：该论文旨在解决现有口语对话模型（Spoken Dialogue Models, SDMs）评估方法的局限性问题，特别是针对全双工（full-duplex）SDMs在自然和交互式对话行为评估方面的不足。当前的评估方法通常局限于基于轮次的指标或语料库级别的分析（如轮次间隙和停顿），无法全面衡量全双工对话系统的复杂交互特性。为了解决这一问题，论文提出了一种名为Full-Duplex-Bench的新基准，其关键是系统性地评估关键对话行为，包括停顿处理、应答（backchanneling）、轮次切换（turn-taking）以及中断管理（interruption management）。通过引入自动化的评估指标，该框架实现了对SDMs交互性能的一致性和可重复性评估。论文的目标是推动口语对话建模领域的发展，并促进更互动和自然对话系统的构建。

链接: https://arxiv.org/abs/2503.04721
作者: Guan-Ting Lin,Jiachen Lian,Tingle Li,Qirui Wang,Gopala Anumanchipalli,Alexander H. Liu,Hung-yi Lee
机构: Graduate Institute of Communication Engineering, National Taiwan University (台湾大学通信工程研究所); UC Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Spoken dialogue modeling introduces unique challenges beyond text-based language modeling, demanding robust turn-taking, backchanneling, and real-time interaction. Although most Spoken Dialogue Models (SDMs) rely on half-duplex processing (handling speech one turn at a time), emerging full-duplex SDMs can listen and speak simultaneously, enabling more natural and engaging conversations. However, current evaluations of such models remain limited, often focusing on turn-based metrics or high-level corpus analyses (e.g., turn gaps, pauses). To address this gap, we present Full-Duplex-Bench, a new benchmark that systematically evaluates key conversational behaviors: pause handling, backchanneling, turn-taking, and interruption management. Our framework uses automatic metrics for consistent and reproducible assessments of SDMs’ interactive performance. By offering an open and standardized evaluation benchmark, we aim to advance spoken dialogue modeling and encourage the development of more interactive and natural dialogue systems.
zh

[NLP-5] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

【速读】：该论文试图解决现有推理语言模型在测试时无法精确控制链式思维（chain-of-thought）长度的问题，这限制了其在不同任务中对计算资源和性能的精细分配。为了解决这一问题，论文提出了一种名为Length Controlled Policy Optimization (LCPO) 的强化学习方法，其关键是通过优化模型的准确性以及对用户指定长度约束的遵守程度，训练出能够生成满足预设长度要求的推理输出的语言模型L1。这种方法不仅实现了推理长度的精确控制，还意外地揭示了模型在短链式思维下的强大能力，并在多个任务中优于现有最先进的S1方法。

链接: https://arxiv.org/abs/2503.04697
作者: Pranjal Aggarwal,Sean Welleck
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning language models have shown an uncanny ability to improve performance at test-time by ``thinking longer’'-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning language model that produces outputs satisfying a length constraint given in its prompt. L1’s length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at this https URL
zh

[NLP-6] UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在训练过程中不可避免地吸收有害信息后，传统去学习（unlearning）方法在消除这些有害信息影响时存在的性能瓶颈问题。现有方法如基于梯度上升的策略主要关注于遗忘特定目标数据，而忽视了逻辑相关知识对去学习效果的重要影响。论文通过理论与实验分析表明，模型能够通过逻辑推理重建被遗忘的目标内容，这是导致去学习性能不佳的关键原因。为解决此问题，论文提出了一种名为通过参数外推改进去学习（Unlearning Improvement via Parameter Extrapolation, UIPE）的方法，其核心在于移除与遗忘目标高度相关的知识。实验结果表明，UIPE显著提升了多种主流LLM去学习方法在TOFU基准测试中的性能。

链接: https://arxiv.org/abs/2503.04693
作者: Wenyu Wang,Mengqi Zhang,Xiaotian Ye,Zhaochun Ren,Zhumin Chen,Pengjie Ren
机构: Shandong University (山东大学); Beijing University of Posts and Telecommunications (北京邮电大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) inevitably acquire harmful information during training on massive datasets. LLM unlearning aims to eliminate the influence of such harmful information while maintaining the model’s overall performance. Existing unlearning methods, represented by gradient ascent-based approaches, primarily focus on forgetting target data while overlooking the crucial impact of logically related knowledge on the effectiveness of unlearning. In this paper, through both theoretical and experimental analyses, we first demonstrate that a key reason for the suboptimal unlearning performance is that models can reconstruct the target content through reasoning with logically related knowledge. To address this issue, we propose Unlearning Improvement via Parameter Extrapolation (UIPE), a method that removes knowledge highly correlated with the forgetting targets. Experimental results show that UIPE significantly enhances the performance of various mainstream LLM unlearning methods on the TOFU benchmark.
zh

[NLP-7] Quantifying the Reasoning Abilities of LLM s on Real-world Clinical Cases

【速读】：本文旨在解决大型语言模型（LLMs）在医疗领域推理能力应用评估中的不足，不仅关注最终生成结果的质量，还特别强调对其推理过程的全面考察。当前针对高级推理增强型LLMs的研究虽已取得显著进展，但将其应用于高度专业化且复杂的医疗场景时，其推理质量和可靠性尚未得到充分验证，尤其是在处理诊断建议、疾病分诊及治疗方案规划等复杂任务时的表现仍显不足。此外，缺乏一套能够客观衡量这些模型推理质量与效率的标准方法也是一个亟待解决的问题。

为了解决上述挑战，本研究提出了MedR-Bench——一个专注于医疗推理能力评估的新基准数据集，它包含了从病例报告中提取出的1,453个结构化患者案例及其对应的推理参考答案。该基准涵盖了13个身体系统和10种专科疾病的常见病与罕见病，并设计了一个包含三个关键临床阶段（即评估推荐、诊断决策制定以及治疗计划制定）的综合性框架来全面评估LLMs在整个患者诊疗流程中的表现。

同时，为了更有效地评价这些模型的推理能力，研究团队开发了一套名为“Reasoning Evaluator”的新型代理评估系统。这套系统可以从效率、事实准确性以及完整性三个方面自动且客观地量化自由文本形式的推理回应，通过动态搜索与交叉引用验证等方式实现可扩展性评估。通过使用这种方法，研究人员对五种最先进的推理型LLMs进行了测试分析，结果显示尽管目前最先进的LLMs能够在一些相对简单的诊断任务中表现出超过85%的准确率，并且其推理过程总体上具有较高的事实准确性（超过90%），但在面对更为复杂的临床推理任务时仍然存在明显的局限性，尤其是在某些重要推理步骤上容易遗漏信息。

综上所述，本文的关键在于创建了一个新的评估框架和工具，使得我们能够更精确地理解现有临床LLMs的优势与不足之处，从而为未来改进这类技术提供了明确的方向指引。

链接: https://arxiv.org/abs/2503.04691
作者: Pengcheng Qiu,Chaoyi Wu,Shuyu Liu,Weike Zhao,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: Shanghai Jiao Tong University, Shanghai, China (上海交通大学, 中国); Shanghai Artificial Intelligence Laboratory, Shanghai, China (上海人工智能实验室, 中国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The latest reasoning-enhanced large language models (reasoning LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated remarkable success. However, the application of such reasoning enhancements to the highly professional medical domain has not been clearly evaluated, particularly regarding with not only assessing the final generation but also examining the quality of their reasoning processes. In this study, we present MedR-Bench, a reasoning-focused medical evaluation benchmark comprising 1,453 structured patient cases with reasoning references mined from case reports. Our benchmark spans 13 body systems and 10 specialty disorders, encompassing both common and rare diseases. In our evaluation, we introduce a versatile framework consisting of three critical clinical stages: assessment recommendation, diagnostic decision-making, and treatment planning, comprehensively capturing the LLMs’ performance across the entire patient journey in healthcare. For metrics, we propose a novel agentic system, Reasoning Evaluator, designed to automate and objectively quantify free-text reasoning responses in a scalable manner from the perspectives of efficiency, factuality, and completeness by dynamically searching and performing cross-referencing checks. As a result, we assess five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and others. Our results reveal that current LLMs can handle relatively simple diagnostic tasks with sufficient critical assessment results, achieving accuracy generally over 85%. However, they still struggle with more complex tasks, such as assessment recommendation and treatment planning. In reasoning, their reasoning processes are generally reliable, with factuality scores exceeding 90%, though they often omit critical reasoning steps. Our study clearly reveals further development directions for current clinical LLMs.
zh

[NLP-8] DIMSUM: Discourse in Mathematical Reasoning as a Supervision Module

【速读】：该论文试图解决的问题是如何提升模型在GSM8k数据集上的推理能力，尤其是在模型可能已经对训练集产生记忆而非真正具备推理能力的情况下。论文指出当前大型语言模型（LLMs）在该数据集上的表现提升可能并非源于更好的推理能力，而是由于预训练数据分布的扩展。为了解决这一问题，论文提出利用话语结构（discourse structure）作为新的信息来源来帮助模型改善推理能力。关键在于引入话语结构信息，通过这种方式，即使是像Llama2 13b这样的模型，其性能也能提高多达160%，并且显著提升了模型在外分布样本上的预测性能。

链接: https://arxiv.org/abs/2503.04685
作者: Krish Sharma,Niyar R Barman,Nicholas Asher,Akshay Chaturvedi
机构: IRIT, Toulouse (IRIT, 图卢兹); NIT Silchar (国家信息技术学院, 西尔查尔)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We look at reasoning on GSM8k, a dataset of short texts presenting primary school, math problems. We find, with Mirzadeh et al. (2024), that current LLM progress on the data set may not be explained by better reasoning but by exposure to a broader pretraining data distribution. We then introduce a novel information source for helping models with less data or inferior training reason better: discourse structure. We show that discourse structure improves performance for models like Llama2 13b by up to 160%. Even for models that have most likely memorized the data set, adding discourse structural information to the model still improves predictions and dramatically improves large model performance on out of distribution examples.
zh

[NLP-9] LLM -guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue NAACL2025

【速读】：该论文旨在解决用户满意度估计（User Satisfaction Estimation, USE）中存在的两个主要问题：一是对用户不满的根本原因理解有限；二是标注用户意图的成本较高。为应对这些挑战，论文提出了一种名为PRAISE（Plan and Retrieval Alignment for Interpretable Satisfaction Estimation）的可解释框架。PRAISE的关键在于其三个核心模块：Strategy Planner通过自然语言标准制定策略来分类用户满意度；Feature Retriever利用大型语言模型（Large Language Models, LLMs）的知识从对话记录中检索相关特征；Score Analyzer则评估策略预测并最终分类用户满意度。该方法不仅在三个USE任务基准上达到了最先进的性能，还通过增强实例级解释性和消除推理阶段对LLMs的依赖，提供了更高的效率和更好的可解释性。

链接: https://arxiv.org/abs/2503.04675
作者: Sangyeop Kim,Sohhyung Park,Jaewon Jung,Jinseok Kim,Sungzoon Cho
机构: Seoul National University (首尔国立大学); Coxwave (Coxwave)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Understanding user satisfaction with conversational systems, known as User Satisfaction Estimation (USE), is essential for assessing dialogue quality and enhancing user experiences. However, existing methods for USE face challenges due to limited understanding of underlying reasons for user dissatisfaction and the high costs of annotating user intentions. To address these challenges, we propose PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction Estimation), an interpretable framework for effective user satisfaction prediction. PRAISE operates through three key modules. The Strategy Planner develops strategies, which are natural language criteria for classifying user satisfaction. The Feature Retriever then incorporates knowledge on user satisfaction from Large Language Models (LLMs) and retrieves relevance features from utterances. Finally, the Score Analyzer evaluates strategy predictions and classifies user satisfaction. Experimental results demonstrate that PRAISE achieves state-of-the-art performance on three benchmarks for the USE task. Beyond its superior performance, PRAISE offers additional benefits. It enhances interpretability by providing instance-level explanations through effective alignment of utterances with strategies. Moreover, PRAISE operates more efficiently than existing approaches by eliminating the need for LLMs during the inference phase.
zh

[NLP-10] An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding AAAI2025

【速读】：该论文旨在解决多任务学习（Multi-Task Learning, MTL）中共享表示不足以及冗余特征带来的负面影响问题，特别是在数据受限和噪声场景下的语言理解能力提升。论文提出了一种新的原理性多任务表示学习框架（InfoMTL），其关键在于通过两个核心原则实现目标：首先，提出共享信息最大化原则以学习对所有目标任务都更充分的共享表示，避免因表征压缩导致的不足；其次，设计任务特定信息最小化原则，以压缩与任务无关的冗余信息，保留与目标任务相关的关键信息，从而增强多任务预测的效率和鲁棒性。实验结果表明，该方法在多个分类基准上优于12种对比方法，并展现出更高的表示充分性、数据效率和鲁棒性。

链接: https://arxiv.org/abs/2503.04667
作者: Dou Hu,Lingwei Wei,Wei Zhou,Songlin Hu
机构: 未知
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 11 pages, accepted to AAAI 2025 (main conference), the code is available at this https URL

点击查看摘要

Abstract:This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
zh

[NLP-11] Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment

【速读】：该论文旨在解决多语言偏好对齐中数据稀缺的问题，即如何在有限的多语言偏好数据条件下有效对齐大型语言模型（LLMs）与人类偏好。为了解决这一问题，论文提出了一种新颖的方法，通过隐式奖励从已对齐的英语模型中捕获学习到的偏好，并通过迭代训练将其转移到其他语言。关键在于利用英语直接偏好优化（DPO）对齐模型及其对应的参考模型的日志概率来构建隐式奖励模型，并基于此模型标注跨语言指令跟随对的偏好关系，从而实现从英语到其他语言的偏好知识迁移。这种方案显著减少了对大量多语言偏好数据的依赖。

链接: https://arxiv.org/abs/2503.04647
作者: Wen Yang,Junhong Wu,Chen Wang,Chengqing Zong,Jiajun Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that \textitcaptures learned preferences from well-aligned English models by implicit rewards and \textittransfers them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at this https URL
zh

[NLP-12] IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval NAACL2025

【速读】：该论文试图解决信息检索（Information Retrieval, IR）在专家领域（expert domains）中遵循指令（instruction-following）能力评估不足的问题。为了解决这一问题，论文提出了IFIR，首个全面的基准数据集，涵盖金融、法律、医疗和科学文献四个专业领域的八个子集，包含2,426个高质量示例，并模拟真实世界场景中的多样化检索任务。论文的关键创新在于通过引入不同复杂度级别的指令来详细分析模型的指令遵循能力，并提出了一种基于大型语言模型（Large Language Model, LLM）的新评估方法，以提供更精确和可靠的性能评估。实验结果表明，当前模型在处理复杂的领域特定指令时面临显著挑战，而深入分析揭示了这些局限性，为未来检索模型的发展提供了重要指导。

链接: https://arxiv.org/abs/2503.04644
作者: Tingyu Song,Guo Gan,Mingsheng Shang,Yilun Zhao
机构: School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences (中国科学院大学交叉科学学院); Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences (中国科学院重庆绿色智能技术研究院); Zhejiang University (浙江大学); Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: NAACL 2025 Main

点击查看摘要

Abstract:We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.
zh

[NLP-13] Mark Your LLM : Detecting the Misuse of Open-Source Large Language Models via Watermarking ICLR2025

【速读】：该论文旨在解决开放源代码大语言模型（Open-source Large Language Models, LLMs）潜在滥用检测的问题，特别是针对两种主要滥用场景：知识产权（Intellectual Property, IP）侵权和LLM使用违规。现有水印技术要么仅适用于推理阶段且不适用于开源LLMs，要么主要针对分类任务的LLMs而非生成式LLMs，无法有效适应开源环境下的滥用检测需求。论文的关键在于探索推理阶段水印蒸馏（Inference-time Watermark Distillation）和后门水印（Backdoor Watermarking）在上述滥用场景中的应用，并提出全面的评估方法以分析进一步微调对水印效果及LLM性能的影响。实验结果表明，后门水印在检测IP侵权方面表现出较高的有效性，而推理阶段水印蒸馏虽适用于两种场景，但对进一步微调的鲁棒性较差且对LLM性能影响更大。未来的研究方向应集中于开发更先进的水印技术以应对开源LLMs的滥用问题。

链接: https://arxiv.org/abs/2503.04636
作者: Yijie Xu,Aiwei Liu,Xuming Hu,Lijie Wen,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by the 1st Workshop on GenAI Watermarking, collocated with ICLR 2025

点击查看摘要

Abstract:As open-source large language models (LLMs) like Llama3 become more capable, it is crucial to develop watermarking techniques to detect their potential misuse. Existing watermarking methods either add watermarks during LLM inference, which is unsuitable for open-source LLMs, or primarily target classification LLMs rather than recent generative LLMs. Adapting these watermarks to open-source LLMs for misuse detection remains an open challenge. This work defines two misuse scenarios for open-source LLMs: intellectual property (IP) violation and LLM Usage Violation. Then, we explore the application of inference-time watermark distillation and backdoor watermarking in these contexts. We propose comprehensive evaluation methods to assess the impact of various real-world further fine-tuning scenarios on watermarks and the effect of these watermarks on LLM performance. Our experiments reveal that backdoor watermarking could effectively detect IP Violation, while inference-time watermark distillation is applicable in both scenarios but less robust to further fine-tuning and has a more significant impact on LLM performance compared to backdoor watermarking. Exploring more advanced watermarking methods for open-source LLMs to detect their misuse should be an important future direction.
zh

[NLP-14] SurveyForge: On the Outline Heuristics Memory-Driven Generation and Multi-dimensional Evaluation for Automated Survey Writing

【速读】：该论文旨在解决利用大型语言模型（LLMs）自动生成综述论文时存在的质量差距问题，特别是在提纲质量和引用准确性方面的不足。为解决这些问题，论文提出的关键方案是SurveyForge系统。该系统首先通过分析人工撰写综述的逻辑结构并参考检索到的相关领域文章来生成提纲；随后，借助由学者导航代理从记忆库中检索到的高质量论文，SurveyForge能够自动生成并优化生成文章的内容。此外，为了全面评估生成效果，构建了包含100篇人工撰写综述论文的SurveyBench数据集，并从引用、提纲和内容质量三个维度对AI生成的综述论文进行评估。实验表明，SurveyForge在这些方面显著优于先前的工作如AutoSurvey。

链接: https://arxiv.org/abs/2503.04629
作者: Xiangchao Yan,Shiyang Feng,Jiakang Yuan,Renqiu Xia,Bin Wang,Bo Zhang,Lei Bai
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Code and dataset are available for downloading at: this https URL 22 pages, 10 figures

点击查看摘要

Abstract:Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SurveyForge, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SurveyForge can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
zh

[NLP-15] START: Self-taught Reason er with Tools

【速读】：本文旨在解决大型推理模型（LRMs）在复杂推理任务中因仅依赖内部推理过程而容易产生幻觉和效率低下的问题。论文提出了一种名为START（Self-Taught Reasoner with Tools）的新方法，这是一种集成外部工具的长链路推理大语言模型（LLM），通过代码执行实现复杂计算、自我检查、探索多样方法及自我调试，从而克服LRMs的局限性。START的关键创新在于其自学习框架，包含两个核心技术：1）Hint-infer：证明在LLM推理过程中插入人工设计的提示（如“等等，也许在这里使用Python是个好主意。”）能够有效激发模型利用外部工具的能力，且无需任何演示数据；2）Hint Rejection Sampling Fine-Tuning（Hint-RFT）：结合Hint-infer与RFT，通过对LLM通过Hint-infer生成的带有工具调用的推理轨迹进行评分、过滤和修改后，再对LLRM进行微调。最终，通过此框架，QwQ-32B模型被微调为START，在PhD级别科学问答（GPQA）、竞赛级数学基准（AMC23、AIME24、AIME25）以及竞赛级代码基准（LiveCodeBench）上的准确率分别达到63.6%、95.0%、66.7%、47.1%和47.3%，显著优于基线QwQ-32B，并实现了与最先进的开源权重模型R1-Distill-Qwen-32B和专有模型o1-Preview相当的性能。

链接: https://arxiv.org/abs/2503.04625
作者: Chengpeng Li,Mingfeng Xue,Zhenru Zhang,Jiaxi Yang,Beichen Zhang,Xiang Wang,Bowen Yu,Binyuan Hui,Junyang Lin,Dayiheng Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages, 5 figures and 6 tables

点击查看摘要

Abstract:Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.‘’) during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
zh

[NLP-16] SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling

【速读】：该论文致力于解决电子商务平台用户动态情感模式分析中的数据稀疏性问题，特别是流式评论情感分析领域中因时间、空间及组合因素导致的数据不足现象。传统方法侧重于静态评论的情感分析，无法捕捉用户情感评分与文本内容随时间演化的关联；而现有流式评论情感分析方法虽改进了这一局限，但仍面临显著的数据稀疏性挑战。论文提出的关键解决方案是SynGraph框架，通过将用户划分为中尾部、长尾部及极端场景，并结合大语言模型（LLM）增强功能，在动态图结构中缓解数据稀疏性问题，从而提升流式评论情感建模的准确性与有效性。

链接: https://arxiv.org/abs/2503.04619
作者: Xin Zhang,Qiyu Wei,Yingjie Zhu,Linhai Zhang,Deyu Zhou,Sophia Ananiadou
机构: The University of Manchester (曼彻斯特大学); Harbin Institute of Technology (哈尔滨工业大学); King’s College London (伦敦国王学院); Southeast University (东南大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 17 figures

点击查看摘要

Abstract:User reviews on e-commerce platforms exhibit dynamic sentiment patterns driven by temporal and contextual factors. Traditional sentiment analysis methods focus on static reviews, failing to capture the evolving temporal relationship between user sentiment rating and textual content. Sentiment analysis on streaming reviews addresses this limitation by modeling and predicting the temporal evolution of user sentiments. However, it suffers from data sparsity, manifesting in temporal, spatial, and combined forms. In this paper, we introduce SynGraph, a novel framework designed to address data sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios and incorporating LLM-augmented enhancements within a dynamic graph-based structure. Experiments on real-world datasets demonstrate its effectiveness in addressing sparsity and improving sentiment modeling in streaming reviews.
zh

[NLP-17] Better Process Supervision with Bi-directional Rewarding Signals

【速读】：该论文旨在解决现有过程监督方法（如过程奖励模型 PRMs）仅关注当前步骤奖励信号而缺乏建模到达最终目标距离的局限性。论文提出的关键解决方案是引入BiRM（双向过程监督模型），它不仅评估先前步骤的正确性，还预测未来成功的概率。这一方案灵感来源于A*算法的思想，强调有效监督信号应同时考虑已发生成本与估计的剩余成本。实验结果表明，BiRM在数学推理任务中提供了更精确的过程评估，并在不同采样策略和搜索策略下显著优于现有方法。

链接: https://arxiv.org/abs/2503.04618
作者: Wenxiang Chen,Wei He,Zhiheng Xi,Honglin Guo,Boyang Hong,Jiazheng Zhang,Rui Zheng,Nijun Li,Tao Gui,Yun Li,Qi Zhang,Xuanjing Huang
机构: School of Computer Science, Fudan University (计算机科学学院，复旦大学); Cognitive AI Lab, Shanghai Huawei Technologies, China (认知人工智能实验室，上海华为技术有限公司，中国); Institute of Modern Languages and Linguistics, Fudan University (现代语言与语言学研究所，复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
zh

[NLP-18] HalluCounter: Reference-free LLM Hallucination Detection in the Wild!

【速读】：该论文旨在解决现有无参考幻觉检测（Reference-Free Hallucination Detection, RFHD）方法因无法捕捉查询-响应对齐模式而导致检测精度较低的问题。此外，还面临缺乏大规模跨领域基准数据集的挑战，多数现有数据集规模有限且覆盖范围狭窄。为了解决这些问题，论文提出了一种名为HalluCounter的新方法，其关键是结合响应-响应一致性和查询-响应一致性和对齐模式，训练出能够检测幻觉现象并提供置信度分数与最优响应的分类器。同时，引入了一个包含合成生成样本和人工标注样本的多领域基准数据集HalluCounterEval，以支持模型评估。实验结果显示，该方法在多个数据集上的幻觉检测平均置信度超过90%，显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.04615
作者: Ashok Urlana,Gopichand Kanumolu,Charaka Vinayak Kumar,Bala Mallikarjunarao Garlapati,Rahul Mishra
机构: IIIT Hyderabad (印度国际信息技术学院); TCS Research, Hyderabad, India (塔塔咨询服务公司，海得拉巴，印度); University of Oslo (奥斯陆大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 4 figures

点击查看摘要

Abstract:Response consistency-based, reference-free hallucination detection (RFHD) methods do not depend on internal model states, such as generation probabilities or gradients, which Grey-box models typically rely on but are inaccessible in closed-source LLMs. However, their inability to capture query-response alignment patterns often results in lower detection accuracy. Additionally, the lack of large-scale benchmark datasets spanning diverse domains remains a challenge, as most existing datasets are limited in size and scope. To this end, we propose HalluCounter, a novel reference-free hallucination detection method that utilizes both response-response and query-response consistency and alignment patterns. This enables the training of a classifier that detects hallucinations and provides a confidence score and an optimal response for user queries. Furthermore, we introduce HalluCounterEval, a benchmark dataset comprising both synthetically generated and human-curated samples across multiple domains. Our method outperforms state-of-the-art approaches by a significant margin, achieving over 90% average confidence in hallucination detection across datasets.
zh

[NLP-19] owards Data-Efficient Language Models: A Child-Inspired Approach to Language Learning

【速读】：该论文试图解决如何通过更少的数据训练语言模型（Language Models, LMs），以模仿人类儿童的语言学习过程，并实现高效的数据利用。论文的关键解决方案在于采用精心筛选的小规模数据集（1000万词，后过滤为850万词）和补充电视对话数据（150万词），模拟儿童通过媒体接触语言的方式；同时将词汇量缩小至32,000个token，以匹配儿童早期语言习得阶段的有限词汇能力。此外，通过课程学习（Curriculum Learning）策略进一步优化模型性能，证明了数据集选择、词汇量缩放及课程学习在构建更高效语言模型中的重要性。

链接: https://arxiv.org/abs/2503.04611
作者: Mohammad Amin Ghanizadeh,Mohammad Javad Dousti
机构: Department of Electrical and Computer Engineering (电气与计算机工程系), College of Engineering (工程学院), University of Tehran (德黑兰大学), Tehran (德黑兰), Iran (伊朗)
类目: Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:In this work, we explain our approach employed in the BabyLM Challenge, which uses various methods of training language models (LMs) with significantly less data compared to traditional large language models (LLMs) and are inspired by how human children learn. While a human child is exposed to far less linguistic input than an LLM, they still achieve remarkable language understanding and generation abilities. To this end, we develop a model trained on a curated dataset consisting of 10 million words, primarily sourced from child-directed transcripts. The 2024 BabyLM Challenge initial dataset of 10M words is filtered to 8.5M. Next, it is supplemented with a randomly selected subset of TVR dataset consisting of 1.5M words of television dialogues. The latter dataset ensures that similar to children, the model is also exposed to language through media. Furthermore, we reduce the vocabulary size to 32,000 tokens, aligning it with the limited vocabulary of children in the early stages of language acquisition. We use curriculum learning and is able to match the baseline on certain benchmarks while surpassing the baseline on others. Additionally, incorporating common LLM training datasets, such as MADLAD-400, degrades performance. These findings underscore the importance of dataset selection, vocabulary scaling, and curriculum learning in creating more data-efficient language models that better mimic human learning processes.
zh

[NLP-20] he Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）生成领域中两大主流范式——自回归语言模型与扩散模型各自存在的内在局限性。具体而言，自回归语言模型在视觉质量及错误累积方面表现不佳，而扩散模型则缺乏语义理解和因果建模能力。为应对这些挑战，论文提出了一种名为LanDiff的混合框架，通过粗到精的生成策略整合两种范式的优点。其关键创新点包括：(1)一种语义分词器，利用高效的语义压缩技术将三维视觉特征压缩为紧凑的一维离散表示，实现约14,000倍的压缩比；(2)一个生成高级语义关系语义标记的语言模型；(3)一个流式扩散模型，用于细化粗粒度语义以生成高保真视频。实验结果表明，LanDiff模型在VBench T2V基准测试中取得了85.43分的成绩，超越了现有开源模型如Hunyuan Video（13B参数量）以及商业模型如Sora、Keling和Hailuo，并且在长视频生成任务上也达到了当前最优性能。

链接: https://arxiv.org/abs/2503.04606
作者: Aoxiong Yin,Kai Shen,Yichong Leng,Xu Tan,Xinyu Zhou,Juncheng Li,Siliang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a \sim 14,000 \times compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Keling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at this https URL.
zh

[NLP-21] HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

【速读】：该论文旨在解决深度 Transformer 网络训练中的挑战，特别是层归一化（Layer Normalization, LN）位置对模型性能的影响问题。Pre-Norm 结构虽然通过更明显的恒等路径（identity path）简化了训练过程，但在大型语言模型（Large Language Models, LLMs）中通常表现出次优性能；而 Post-Norm 虽然性能更优，但训练稳定性较差。为了解决这一权衡，论文提出了一种名为 HybridNorm 的混合归一化策略，其关键在于结合 Pre-Norm 和 Post-Norm 的优点：在每个 Transformer 块的注意力机制中采用 QKV 归一化（QKV normalization），而在前馈网络（Feed-Forward Network, FFN）中使用 Post-Norm。这种设计不仅提升了训练稳定性，还显著改善了模型性能，尤其在 LLM 场景下表现优异。实验结果表明，HybridNorm 在密集架构和稀疏架构中均优于 Pre-Norm 和 Post-Norm，达到了当前最优（state-of-the-art）的效果。

链接: https://arxiv.org/abs/2503.04598
作者: Zhijian Zhuo,Yutao Zeng,Ya Wang,Sijun Zhang,Jian Yang,Xiaoqing Li,Xun Zhou,Jinwen Ma
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the location of layer normalization. While Pre-Norm structures facilitate easier training due to their more prominent identity path, they often yield suboptimal performance compared to Post-Norm. In this paper, we propose \textbfHybridNorm , a straightforward yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm approaches. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. This design not only stabilizes training but also enhances performance, particularly in the context of LLMs. Comprehensive experiments in both dense and sparse architectures show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches, achieving state-of-the-art results across various benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. %Code will be made publicly available. Code is available at this https URL.
zh

[NLP-22] Compositional Causal Reasoning Evaluation in Language Models

【速读】：该论文试图解决如何系统性地评估生成式 AI 模型在因果推理（Causal Reasoning）与组合推理（Compositional Reasoning）方面的行为能力问题。论文的关键在于提出了一种统一的视角，称为组合因果推理（Compositional Causal Reasoning, CCR），它关注因果度量如何复合以及因果量如何在图结构中传播的能力。为此，作者设计了一个框架，用于系统性评估平均处理效应（Average Treatment Effect）和必要性和充分性概率（Probability of Necessity and Sufficiency）的 CCR 表现，并通过语言模型（如 LLama、Phi 和 GPT 家族）的任务设计验证了这一框架的有效性。研究还发现，除了特定模型 o1 外，所有模型的 CCR 错误均随因果路径复杂度增加而上升，揭示了不同模型在处理复杂因果关系时的局限性。

链接: https://arxiv.org/abs/2503.04556
作者: Jacqueline R. M. A. Maasch,Alihan Hüyük,Xinnuo Xu,Aditya V. Nori,Javier Gonzalez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Causal reasoning and compositional reasoning are two core aspirations in generative AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate the design of CCR tasks for language models in the LLama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. Additionally, CCR errors increased with the complexity of causal paths for all models except o1.
zh

[NLP-23] Compositional Translation: A Novel LLM -based Approach for Low-resource Machine Translation

【速读】：该论文试图解决自然语言处理任务中生成式大语言模型（LLMs）在机器翻译（Machine Translation, MT）应用中的低资源场景或领域外翻译效果不佳的问题。解决方案的关键在于提出了一种基于LLM的新型翻译范式——组合式翻译（compositional translation），通过将源句分解为更简单的短语，并利用检索到的相似性示例进行逐段翻译，然后利用自动生成的短语-翻译对来提示LLM完成整个句子的翻译。这种方法的核心在于利用短语级翻译的简单性和与示例匹配的便利性，从而提升翻译质量，尤其适用于低资源或领域外的翻译场景。

链接: https://arxiv.org/abs/2503.04554
作者: Armel Zebaze,Benoît Sagot,Rachel Bawden
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19. Code and outputs are available at this https URL
zh

[NLP-24] An Empirical Study on Eliciting and Improving R1-like Reasoning Models

【速读】：该论文试图解决如何有效提升大型推理模型（Large Reasoning Models）的性能问题，特别是在强化学习（Reinforcement Learning, RL）训练和工具操作（Tool Manipulation）方面的优化。论文的关键解决方案在于通过系统性实验探索并验证强化学习训练方法能够显著提高基础模型（如Qwen2.5-32B）的响应长度和测试准确性，同时证明即使是已具备高初始性能的模型（如DeepSeek-R1-Distill-Qwen-1.5B），也能通过强化学习进一步提升至AIME 2024准确率达到39.33%。此外，论文还强调了工具操作在增强模型推理能力上的重要性，其结合贪婪搜索（Greedy Search）在AIME 2024上实现了86.67%的高精度，从而突显了这一方法的有效性。这些研究结果为实现“STILL”项目中的慢思考模型（Slow-Thinking Models）提供了重要的技术路径支持。

链接: https://arxiv.org/abs/2503.04548
作者: Zhipeng Chen,Yingqian Min,Beichen Zhang,Jie Chen,Jinhao Jiang,Daixuan Cheng,Wayne Xin Zhao,Zheng Liu,Xu Miao,Yang Lu,Lei Fang,Zhongyuan Wang,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学); BAAI (北京智源人工智能研究院); DataCanvas Alaya NeW (数 canvas 阿拉亚新)
类目: Computation and Language (cs.CL)
备注: Technical Report on Slow Thinking with LLMs: Part III

点击查看摘要

Abstract:In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: this https URL.
zh

[NLP-25] Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model

【速读】：本文旨在解决多模态大型语言模型（MLLMs）在特定应用场景中表现受限的问题。论文指出，针对下游任务微调MLLMs面临两大关键挑战：任务专家化（Task-Expert Specialization），即预训练与目标数据集之间的分布偏移限制了目标任务的表现；开放世界稳定性（Open-World Stabilization），即灾难性遗忘（catastrophic forgetting）会抹除模型的通用知识。为应对这些挑战，本文系统性回顾了近期MLLMs微调方法的进展，并将其划分为三种范式：选择性微调（Selective Tuning）、增量化微调（Additive Tuning）和重新参数化微调（Reparameterization Tuning）。通过在多种流行MLLM架构及多样化下游任务上的基准测试，建立了标准化的评估分析和系统的微调原则。最终，论文指出了该领域的若干开放性问题并提出了未来的研究方向，同时提供了一个持续追踪最新发展的公开资源库。

链接: https://arxiv.org/abs/2503.04543
作者: Wenke Huang,Jian Liang,Xianda Guo,Yiyang Fang,Guancheng Wan,Xuankun Rong,Chi Wen,Zekun Shi,Qingyun Li,Didi Zhu,Yanbiao Ma,Ke Liang,Bin Yang,He Li,Jiawei Shao,Mang Ye,Bo Du
机构: School of Computer Science, Wuhan University, Wuhan, 430072, China (武汉大学计算机学院); school of Electronics and Information Engineering, Harbin Institute of Technology, Harbin, 150001, China (哈尔滨工业大学电子与信息工程学院); Department of Computer Science and Technology, Zhejiang University, Hangzhou, 310058, China (浙江大学计算机科学与技术系); School of Artificial Intelligence, Xidian University, Xian, 710071, China (西安电子科技大学人工智能学院); College of Computer Science and Technology, National University of Defense Technology, Changsha, 410073, China (国防科技大学计算机科学与技术学院); Institute of Artificial Intelligence (TeleAI), China (智 teleAI 研所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: this https URL.
zh

[NLP-26] Large Language Models in Bioinformatics: A Survey

【速读】：本文旨在探讨大型语言模型（Large Language Models, LLMs）在生物信息学领域的革命性影响，特别是在基因组序列建模、RNA结构预测、蛋白质功能推断以及单细胞转录组学分析中的应用。论文系统回顾了近期的研究进展，并聚焦于如何应对数据稀缺性、计算复杂性和跨组学整合等关键挑战。解决方案的关键在于探索多模态学习（multimodal learning）、混合AI模型（hybrid AI models）以及临床应用等未来发展方向，以充分发挥LLMs在推动生物信息学创新和精准医学发展方面的潜力。

链接: https://arxiv.org/abs/2503.04490
作者: Zhenyu Wang,Zikang Wang,Jiyue Jiang,Pengan Chen,Xiangyu Shi,Yu Li
机构: The Chinese University of Hong Kong (香港中文大学); Peking University Third Hospital (北京大学第三医院); The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
zh

[NLP-27] Generalized Interpolating Discrete Diffusion

【速读】：该论文试图解决现有语言模型在生成过程中无法修订已生成标记的问题，以及当前扩散模型在文本生成任务中的局限性。论文的关键在于提出了一类广义插值离散扩散（General Interpolating Discrete Diffusion, GIDD）过程，通过扩展屏蔽扩散模型并推导其理论基础，实现了噪声设计的更大灵活性。此外，利用新颖的扩散Evidence Lower Bound (ELBO)，论文在扩散语言建模中实现了与计算匹配的最先进性能。同时，通过探索结合屏蔽噪声和均匀噪声的混合方法，论文显著提升了样本质量和模型自我修正能力，这是自回归模型长期面临的挑战。

链接: https://arxiv.org/abs/2503.04482
作者: Dimitri von Rütte,Janis Fluri,Yuhui Ding,Antonio Orvieto,Bernhard Schölkopf,Thomas Hofmann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion and derive the theoretical backbone of a family of general interpolating discrete diffusion (GIDD) processes offering greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD’s flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Our code and models are open-source: this https URL
zh

[NLP-28] Guiding LLM s to Generate High-Fidelity and High-Quality Counterfactual Explanations for Text Classification

【速读】：该论文试图解决深度学习模型解释性中的反事实（Counterfactual, CF）生成问题，特别是现有方法需要针对具体任务进行微调且生成的文本质量较低的问题。同时，尽管大型语言模型（Large Language Models, LLMs）擅长高质量文本生成，但在未经过微调的情况下难以有效生成能够改变预测结果的标签翻转型反事实。论文的关键解决方案在于提出两种简单的分类器引导方法，通过利用分类器信息指导LLMs生成高质量的反事实，无需针对具体任务微调即可保持LLMs的优势，并显著提升了反事实生成的效果及跨不同LLMs的通用性。此外，研究进一步表明，利用生成的反事实进行数据增强可以提高分类器的鲁棒性，同时揭示了LLMs在反事实生成中存在的关键问题：过度依赖参数化知识而非严格遵循分类器逻辑。

链接: https://arxiv.org/abs/2503.04463
作者: Van Bach Nguyen,Christin Seifert,Jörg Schlötterer
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The need for interpretability in deep learning has driven interest in counterfactual explanations, which identify minimal changes to an instance that change a model’s prediction. Current counterfactual (CF) generation methods require task-specific fine-tuning and produce low-quality text. Large Language Models (LLMs), though effective for high-quality text generation, struggle with label-flipping counterfactuals (i.e., counterfactuals that change the prediction) without fine-tuning. We introduce two simple classifier-guided approaches to support counterfactual generation by LLMs, eliminating the need for fine-tuning while preserving the strengths of LLMs. Despite their simplicity, our methods outperform state-of-the-art counterfactual generation methods and are effective across different LLMs, highlighting the benefits of guiding counterfactual generation by LLMs with classifier information. We further show that data augmentation by our generated CFs can improve a classifier’s robustness. Our analysis reveals a critical issue in counterfactual generation by LLMs: LLMs rely on parametric knowledge rather than faithfully following the classifier.
zh

[NLP-29] Quantifying patterns of punctuation in modern Chinese prose

【速读】：该论文试图探索中西文文本中标点符号分布的普遍性特征及其与Zipf定律的关系，并解决标点间距分布是否遵循特定统计模型的问题。论文的关键解决方案在于通过分析中西方文学作品中的标点间距分布，发现其符合离散Weibull分布，并验证了Zipf定律在中西文文本中标点模式下的适用性。此外，通过对中文文本的深入研究，揭示了句末标点与句子长度之间的变异性，进一步支持了复杂多分形句子结构的存在，特别是在高行健的《灵山》中。这些发现表明，中西文文本在标点和词频分布上具有普适性规律。

链接: https://arxiv.org/abs/2503.04449
作者: Michał Dolina,Jakub Dec,Stanisław Drożdż,Jarosław Kwapień,Jin Liu,Tomasz Stanisz
机构: Faculty of Computer Science and Telecommunications, Cracow University of Technology (克拉科夫工业大学); Complex Systems Theory Department, Institute of Nuclear Physics, Polish Academy of Sciences (波兰科学院核物理研究所复杂系统理论系); School of Modern Languages, Georgia Institute of Technology (佐治亚理工学院现代语言学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent research shows that punctuation patterns in texts exhibit universal features across languages. Analysis of Western classical literature reveals that the distribution of spaces between punctuation marks aligns with a discrete Weibull distribution, typically used in survival analysis. By extending this analysis to Chinese literature represented here by three notable contemporary works, it is shown that Zipf’s law applies to Chinese texts similarly to Western texts, where punctuation patterns also improve adherence to the law. Additionally, the distance distribution between punctuation marks in Chinese texts follows the Weibull model, though larger spacing is less frequent than in English translations. Sentence-ending punctuation, representing sentence length, diverges more from this pattern, reflecting greater flexibility in sentence length. This variability supports the formation of complex, multifractal sentence structures, particularly evident in Gao Xingjian’s “Soul Mountain”. These findings demonstrate that both Chinese and Western texts share universal punctuation and word distribution patterns, underscoring their broad applicability across languages.
zh

[NLP-30] A Dataset for Analysing News Framing in Chinese Media

【速读】：该论文试图解决中文新闻报道框架检测的问题，现有自动新闻框架检测数据集主要针对多种语言，但缺乏专注于中文这一复杂字符意义和独特语言特征的语言的数据集。为解决此问题，论文创建了首个中文新闻框架数据集（Chinese News Framing dataset），既可独立使用，也可作为SemEval-2023任务3数据集的补充资源。关键解决方案在于通过微调XLM-RoBERTa-Base模型以及在零样本设置下使用GPT-4o进行实验，验证了所创建数据集的价值，并提供了F1-micro评分结果，表明在仅使用中文数据集时得分为0.719，在结合SemEval数据集扩充后提升至0.753，证明了该数据集在中文新闻框架检测中的有效性与重要性。

链接: https://arxiv.org/abs/2503.04439
作者: Owen Cook,Yida Mu,Xinye Yang,Xingyi Song,Kalina Bontcheva
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Framing is an essential device in news reporting, allowing the writer to influence public perceptions of current affairs. While there are existing automatic news framing detection datasets in various languages, none of them focus on news framing in the Chinese language which has complex character meanings and unique linguistic features. This study introduces the first Chinese News Framing dataset, to be used as either a stand-alone dataset or a supplementary resource to the SemEval-2023 task 3 dataset. We detail its creation and we run baseline experiments to highlight the need for such a dataset and create benchmarks for future research, providing results obtained through fine-tuning XLM-RoBERTa-Base and using GPT-4o in the zero-shot setting. We find that GPT-4o performs significantly worse than fine-tuned XLM-RoBERTa across all languages. For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0.719 using only samples from our Chinese News Framing dataset and a score of 0.753 when we augment the SemEval dataset with Chinese news framing samples. With positive news frame detection results, this dataset is a valuable resource for detecting news frames in the Chinese language and is a valuable supplement to the SemEval-2023 task 3 dataset.
zh

[NLP-31] Revisiting the Othello World Model Hypothesis ICLR

【速读】：该论文旨在验证和扩展生成式语言模型是否能够通过学习序列化的棋盘状态来诱导和理解“Othello世界模型”，即能否基于先前的棋步预测下一步棋，并准确表征棋盘布局。论文的关键解决方案在于设计了一项针对Othello棋盘游戏的任务，通过对七种语言模型（包括GPT-2、T5、Bart、Flan-T5、Mistral、LLaMA-2和Qwen2.5）进行无监督训练与评估，分析其在棋盘特征学习及下一步预测任务上的表现。实验结果表明，所有模型均在无监督接地任务中达到高达99%的准确性，并表现出高度一致的棋盘特征学习能力，从而为“Othello世界模型假设”提供了更强有力的证据。

链接: https://arxiv.org/abs/2503.04421
作者: Yifei Yuan,Anders Søgaard
机构: University of Copenhagen (哥本哈根大学), Denmark
类目: Computation and Language (cs.CL)
备注: ICLR World Models Workshop

点击查看摘要

Abstract:Li et al. (2023) used the Othello board game as a test case for the ability of GPT-2 to induce world models, and were followed up by Nanda et al. (2023b). We briefly discuss the original experiments, expanding them to include more language models with more comprehensive probing. Specifically, we analyze sequences of Othello board states and train the model to predict the next move based on previous moves. We evaluate seven language models (GPT-2, T5, Bart, Flan-T5, Mistral, LLaMA-2, and Qwen2.5) on the Othello task and conclude that these models not only learn to play Othello, but also induce the Othello board layout. We find that all models achieve up to 99% accuracy in unsupervised grounding and exhibit high similarity in the board features they learned. This provides considerably stronger evidence for the Othello World Model Hypothesis than previous works.
zh

[NLP-32] Can Large Language Models Predict Antimicrobial Resistance Gene?

【速读】：该论文试图解决的问题是如何更灵活地利用生成式大语言模型（Generative Large Language Models）进行DNA序列分析与分类任务，以克服传统基于Transformer编码器模型的局限性。近年来，基于编码器的模型如DNABERT和Nucleotide Transformer在DNA序列分类任务中表现出色，但基于Transformer解码器的生成式模型尚未在该领域得到充分探索。论文的关键解决方案在于评估生成式大语言模型处理带有多种标签的DNA序列的有效性，并分析在提供额外文本信息时性能的变化。实验结果表明，生成式大语言模型在结合序列和文本信息时能够提供可比或可能更好的预测结果，展现了其灵活性和准确性。

链接: https://arxiv.org/abs/2503.04413
作者: Hyunwoo Yoo
机构: Drexel University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study demonstrates that generative large language models can be utilized in a more flexible manner for DNA sequence analysis and classification tasks compared to traditional transformer encoder-based models. While recent encoder-based models such as DNABERT and Nucleotide Transformer have shown significant performance in DNA sequence classification, transformer decoder-based generative models have not yet been extensively explored in this field. This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels and analyzes performance changes when additional textual information is provided. Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions, demonstrating flexibility and accuracy when incorporating both sequence and textual information. The code and data used in this work are available at the following GitHub repository: this https URL.
zh

[NLP-33] Comparative Study of Zero-Shot Cross-Lingual Transfer for Bodo POS and NER Tagging Using Gemini 2.0 Flash Thinking Experimental Model

【速读】：该论文旨在解决低资源语言（如博多语，Bodo）在词性标注（POS Tagging）和命名实体识别（NER）任务中可用工具和数据有限的问题。论文的关键在于探索并比较两种基于谷歌Gemini 2.0 Flash Thinking Experiment模型的零样本跨语言迁移方法：一是通过直接翻译英语句子到博多语后进行标签迁移；二是基于提示（prompt-based）的平行语料库（英语-博多语句对）上的标签迁移。这两种方法均利用Gemini 2.0模型的机器翻译与跨语言理解能力，将英语的POS和NER标注映射到博多语文本。研究结果表明，尽管两种方法都显示出一定的潜力，但基于提示的方法尤其在NER任务上表现出更优的性能。论文的核心解决方案在于评估不同迁移策略的效果，并揭示其背后的翻译质量、语法差异以及零样本跨语言迁移固有挑战的影响，为未来开发高精度的博多语POS和NER技术提供了方向。

链接: https://arxiv.org/abs/2503.04405
作者: Sanjib Narzary,Bihung Brahma,Haradip Mahilary,Mahananda Brahma,Bidisha Som,Sukumar Nandi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to SpringerNature MTAP journal. This article has not been reviewed yet. Submitting for public review!

点击查看摘要

Abstract:Named Entity Recognition (NER) and Part-of-Speech (POS) tagging are critical tasks for Natural Language Processing (NLP), yet their availability for low-resource languages (LRLs) like Bodo remains limited. This article presents a comparative empirical study investigating the effectiveness of Google’s Gemini 2.0 Flash Thinking Experiment model for zero-shot cross-lingual transfer of POS and NER tagging to Bodo. We explore two distinct methodologies: (1) direct translation of English sentences to Bodo followed by tag transfer, and (2) prompt-based tag transfer on parallel English-Bodo sentence pairs. Both methods leverage the machine translation and cross-lingual understanding capabilities of Gemini 2.0 Flash Thinking Experiment to project English POS and NER annotations onto Bodo text in CONLL-2003 format. Our findings reveal the capabilities and limitations of each approach, demonstrating that while both methods show promise for bootstrapping Bodo NLP, prompt-based transfer exhibits superior performance, particularly for NER. We provide a detailed analysis of the results, highlighting the impact of translation quality, grammatical divergences, and the inherent challenges of zero-shot cross-lingual transfer. The article concludes by discussing future research directions, emphasizing the need for hybrid approaches, few-shot fine-tuning, and the development of dedicated Bodo NLP resources to achieve high-accuracy POS and NER tagging for this low-resource language.
zh

[NLP-34] ableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）范式下理解表格数据（tabular data）的问题，特别是如何更好地将二维结构化信息序列化为一维序列，并有效表示表格结构。论文的关键解决方案是提出TableLoRA模块，它通过引入特殊标记（special tokens）结合特殊标记编码器实现表格序列化，并利用二维LoRA（2D LoRA）捕获单元位置的低秩信息，从而增强LLMs对表格结构的理解能力。实验结果表明，TableLoRA在多个表格相关数据集上的表现优于传统LoRA及多种对照方法，特别是在低参数设置下展现出处理表格任务的强大潜力。

链接: https://arxiv.org/abs/2503.04396
作者: Xinyi He,Yihao Liu,Mengyu Zhou,Yeye He,Haoyu Dong,Shi Han,Zejian Yuan,Dongmei Zhang
机构: Xi’an Jiaotong University (西安交通大学); Peking University (北京大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular data are crucial in many fields and their understanding by large language models (LLMs) under high parameter efficiency paradigm is important. However, directly applying parameter-efficient fine-tuning (PEFT) techniques to tabular tasks presents significant challenges, particularly in terms of better table serialization and the representation of two-dimensional structured information within a one-dimensional sequence. To address this, we propose TableLoRA, a module designed to improve LLMs’ understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions. Experiments on four tabular-related datasets demonstrate that TableLoRA consistently outperforms vanilla LoRA and surpasses various table encoding methods tested in control experiments. These findings reveal that TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process tabular data effectively, especially in low-parameter settings, demonstrating its potential as a robust solution for handling table-related tasks.
zh

[NLP-35] Shaping Shared Languages: Human and Large Language Models Inductive Biases in Emergent Communication

【速读】：该论文试图解决的问题是：探究人工语言在不同优化目标（人类与大型语言模型 Large Language Models, LLMs）下的演化特性及其在跨物种（人类与LLMs）协作中的表现。研究关注的重点在于，当人类与LLMs通过不同的交互方式（Human-Human、LLM-LLM 和 Human-LLM实验）进行沟通时，人工语言如何形成并支持可靠的信息传递。

解决方案的关键在于设计基于指代游戏的经典实验框架，并通过对比分析揭示人工语言在不同优化条件下的差异性特征。研究发现，无论是为人类还是LLMs优化的人工语言都能实现可靠的指代交流，但两者之间存在细微差别。有趣的是，当引入人类与LLMs的交互后，这些差异被缓解，最终形成更接近人类自然语言风格的词汇体系。这一结果表明，在LLMs的训练过程中引入人类交互的重要性，并提出以沟通成功率作为奖励信号是一种有前景的新方向。

链接: https://arxiv.org/abs/2503.04395
作者: Tom Kouwenhoven,Max Peeperkorn,Roy de Kleijn,Tessa Verhoef
机构: Leiden Institute of Advanced Computer Science, Leiden University (莱顿大学先进计算机科学研究所); School of Computing, University of Kent (肯特大学计算学院); Cognitive Psychology Unit, Institute of Psychology, Leiden (莱顿大学心理学研究所认知心理学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Languages are shaped by the inductive biases of their users. Using a classical referential game, we investigate how artificial languages evolve when optimised for inductive biases in humans and large language models (LLMs) via Human-Human, LLM-LLM and Human-LLM experiments. We show that referentially grounded vocabularies emerge that enable reliable communication in all conditions, even when humans and LLMs collaborate. Comparisons between conditions reveal that languages optimised for LLMs subtly differ from those optimised for humans. Interestingly, interactions between humans and LLMs alleviate these differences and result in vocabularies which are more human-like than LLM-like. These findings advance our understanding of how inductive biases in LLMs play a role in the dynamic nature of human language and contribute to maintaining alignment in human and machine communication. In particular, our work underscores the need to think of new methods that include human interaction in the training processes of LLMs, and shows that using communicative success as a reward signal can be a fruitful, novel direction.
zh

[NLP-36] More Documents Same Length: Isolating the Challenge of Multiple Documents in RAG

【速读】：该论文试图解决在 Retrieval-augmented generation (RAG) 场景下，文档数量对语言模型性能的影响问题。具体而言，研究关注的是当检索到的文档数量增加时，如何在保持上下文长度和相关信息位置不变的情况下评估语言模型的表现，并探讨这一过程对模型性能的具体挑战。论文的关键解决方案在于设计了一个实验框架，通过控制上下文长度和相关信息的位置，系统性地分析文档数量对 RAG 系统的影响，从而揭示处理多文档与应对长上下文是两个独立的挑战。此外，研究还提供了相关数据集和代码以供复现。

链接: https://arxiv.org/abs/2503.04388
作者: Shahar Levy,Nir Mazor,Lihi Shalmon,Michael Hassid,Gabriel Stanovsky
机构: School of Computer Science and Engineering (计算机科学与工程学院), The Hebrew University of Jerusalem (耶路撒冷希伯来大学), Jerusalem, Israel
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: this https URL .
zh

[NLP-37] RACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM -as-a-Judge

【速读】：本文旨在解决现有大语言模型（LLM）作为评分裁判（LLM-as-a-judge）方法在数值评分预测上的局限性。具体而言，传统基于交叉熵（Cross-Entropy, CE）损失的微调方法忽视了评分预测的数值性质，而近期的回归感知微调方法虽改进了数值预测能力，但未能结合链式思维（Chain-of-Thought, CoT）推理用于评分预测。为解决上述问题，本文提出TRACT（Two-stage Regression-Aware fine-tuning with CoT），通过结合CoT推理与回归感知训练来提升评分准确性。关键在于其两阶段设计：第一阶段利用种子模型微调生成CoT序列，作为第二阶段微调的监督信号；训练目标同时包含学习CoT推理能力的CE损失以及针对评分预测的回归感知损失。实验结果表明，TRACT在多个数据集和模型上显著优于现有方法，消融研究进一步验证了各组成部分的重要性。

链接: https://arxiv.org/abs/2503.04381
作者: Cheng-Han Chiang,Hung-yi Lee,Michal Lukasik
机构: National Taiwan University (台湾大学); Google Research (Google研究)
类目: Computation and Language (cs.CL)
备注: Codes and models are available at this https URL

点击查看摘要

Abstract:The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.
zh

[NLP-38] Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks

【速读】：该论文试图解决在开放域任务中利用推理时间扩展（Inference-Time Scaling）提升模型性能的问题。传统方法依赖于任务答案可验证性，限制了其在数学、编码和逻辑推理等领域的应用。为应对这一挑战，论文受到人类解决问题过程中尝试、获取反馈并改进的启发，提出了一种基于反馈与编辑机制的新方法。方案的关键在于设计并训练专门的反馈模型与编辑模型，在推理阶段通过多个模型协作完成初始响应生成、反馈接收及响应编辑的过程，从而显著提高开放域任务的表现。实验结果显示，当基于Llama 3家族70B参数模型的系统优化后，能够在Arena Hard基准测试中达到92.7的SoTA性能（截至2025年3月5日）。

链接: https://arxiv.org/abs/2503.04378
作者: Zhilin Wang,Jiaqi Zeng,Olivier Delalleau,Daniel Egert,Ellie Evans,Hoo-Chang Shin,Felipe Soares,Yi Dong,Oleksii Kuchaiev
机构: NVIDIA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 2 figures

点击查看摘要

Abstract:Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect data for and train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.
zh

[NLP-39] Assumed Identities: Quantifying Gender Bias in Machine Translation of Ambiguous Occupational Terms

【速读】：该论文旨在解决机器翻译系统在处理性别歧义场景时可能反映和强化社会刻板印象的问题。具体而言，当翻译任务缺乏明确指导或上下文线索时，翻译模型可能会在某些职业与特定性别之间形成系统性关联，从而产生潜在的性别偏见。传统基于单例评估的方法难以应对这种无单一标准答案的情况。论文的关键解决方案是通过聚合模型的响应来评估性别偏见，提出了一种检测源文本与译文之间性别失衡的方法，构建了一个包含模糊英语输入的基准数据集，并引入基于概率的度量方法以量化模型偏离规范标准的程度。

链接: https://arxiv.org/abs/2503.04372
作者: Orfeas Menis Mastromichalakis,Giorgos Filandrianos,Maria Symeonaki,Giorgos Stamou
机构: School of Electrical and Computer Engineering, National Technical University of Athens (国立雅典技术大学电气与计算机工程学院); Department of Social Policy, Panteion University of Social and Political Sciences (潘特翁社会与政治科学大学社会政策系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine Translation (MT) systems frequently encounter ambiguous scenarios where they must assign gender to certain occupations when translating without explicit guidance or contextual cues. While individual translations in such cases may not be inherently biased, systematic patterns-such as the repeated association of certain professions with specific genders-can emerge, reflecting and perpetuating societal stereotypes. This ambiguity challenges traditional instance-level single-answer evaluation approaches, as no single gold standard translation exists. To address this, we propose an approach that evaluates gender bias through aggregated model responses. Specifically, we introduce a methodology to detect gender imbalances between source texts and translations, a benchmarking dataset with ambiguous English inputs, and probability-based metrics to quantify a model’s divergence from normative standards or reference distributions.
zh

[NLP-40] Lost in Literalism: How Supervised Training Shapes Translationese in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在基于监督微调（Supervised Fine-Tuning, SFT）的机器翻译系统中产生的翻译ese问题，即过度逐字直译且不自然的翻译现象。尽管LLMs在大量自然语料库上进行了预训练，但其生成的翻译仍受SFT过程中引入的偏差影响，导致翻译质量下降。论文的关键解决方案在于通过引入方法减轻这些偏差，包括优化黄金参考译文（polishing golden references）和过滤不自然的训练样本（filtering unnatural training instances）。实证评估表明，这些方法显著减少了翻译ese现象，同时提升了翻译的自然度，并通过人工评价和自动指标验证了有效性。这一研究强调了在训练阶段进行针对性调整以优化LLM翻译输出的重要性，从而实现更流畅且目标语言一致的翻译结果。

链接: https://arxiv.org/abs/2503.04369
作者: Yafu Li,Ronghao Zhang,Zhilin Wang,Huajian Zhang,Leyang Cui,Yongjing Yin,Tong Xiao,Yue Zhang
机构: Shanghai AI Laboratory (上海人工智能实验室); Westlake University (西湖大学); Zhejiang University (浙江大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 19 pages;

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at this https URL.
zh

[NLP-41] Exploring the Multilingual NLG Evaluation Abilities of LLM -Based Evaluators

【速读】：该论文试图解决现有研究未充分探索大型语言模型（LLMs）在多语言自然语言生成（NLG）评估任务中跨语言评价能力差异的问题。论文的关键解决方案在于通过相关性分析、扰动攻击测试以及针对特定语言的数据微调，全面评估10种近期LLMs在高资源和低资源语言中的多语言评估性能。研究发现，关键在于去除提示中的参考答案并采用大参数LLM为基础的评估器以实现更优的跨语言表现，同时强调了低资源语言场景下评估能力提升的重要性。

链接: https://arxiv.org/abs/2503.04360
作者: Jiayi Chang,Mingqi Gao,Xinyu Hu,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学); School of Data Science and Intelligent Media, Communication University of China (中国传媒大学数据科学与智能媒体学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous research has shown that LLMs have potential in multilingual NLG evaluation tasks. However, existing research has not fully explored the differences in the evaluation capabilities of LLMs across different languages. To this end, this study provides a comprehensive analysis of the multilingual evaluation performance of 10 recent LLMs, spanning high-resource and low-resource languages through correlation analysis, perturbation attacks, and fine-tuning. We found that 1) excluding the reference answer from the prompt and using large-parameter LLM-based evaluators leads to better performance across various languages; 2) most LLM-based evaluators show a higher correlation with human judgments in high-resource languages than in low-resource languages; 3) in the languages where they are most sensitive to such attacks, they also tend to exhibit the highest correlation with human judgments; and 4) fine-tuning with data from a particular language yields a broadly consistent enhancement in the model’s evaluation performance across diverse languages. Our findings highlight the imbalance in LLMs’evaluation capabilities across different languages and suggest that low-resource language scenarios deserve more attention.
zh

[NLP-42] Layer-Specific Scaling of Positional Encodings for Superior Long-Context Modeling

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长上下文输入时普遍存在的“迷失于中间”（``lost-in-the-middle’'）问题，即上下文中段的关键信息容易被弱化或丢失。论文通过实验发现，这一问题可能源于Rotary Position Embedding (RoPE) 的长期衰减速度过快。为了解决此问题，论文提出了一种层特定的位置编码缩放方法，为每一层分配独特的缩放因子，从而减缓由RoPE引起的衰减速率，促使模型更加关注上下文的中间部分。关键解决方案在于利用遗传算法结合Bezier曲线设计，高效地为每层选择最优的缩放因子，同时通过层特定的插值策略提升模型的外推能力。实验表明，该方法显著缓解了“迷失于中间”的问题，并在Key-Value Retrieval数据集上实现了最高达20%的平均准确率提升。

链接: https://arxiv.org/abs/2503.04355
作者: Zhenghua Wang,Yiran Ding,Changze Lv,Zhibo Xu,Tianlong Li,Tianyuan Shi,Xiaoqing Zheng,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai Key Laboratory of Intelligent Information Processing (上海市智能信息处理重点实验室); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have achieved significant progress in handling long-context inputs, they still suffer from the lost-in-the-middle'' problem, where crucial information in the middle of the context is often underrepresented or lost. Our extensive experiments reveal that this issue may arise from the rapid long-term decay in Rotary Position Embedding (RoPE). To address this problem, we propose a layer-specific positional encoding scaling method that assigns distinct scaling factors to each layer, slowing down the decay rate caused by RoPE to make the model pay more attention to the middle context. A specially designed genetic algorithm is employed to efficiently select the optimal scaling factors for each layer by incorporating Bezier curves to reduce the search space. Through comprehensive experimentation, we demonstrate that our method significantly alleviates the lost-in-the-middle’’ problem. Our approach results in an average accuracy improvement of up to 20% on the Key-Value Retrieval dataset. Furthermore, we show that layer-specific interpolation, as opposed to uniform interpolation across all layers, enhances the model’s extrapolation capabilities when combined with PI and Dynamic-NTK positional encoding schemes.
zh

[NLP-43] Adding Alignment Control to Language Models

【速读】：该论文旨在解决语言模型（Language Models, LMs）在后训练对齐（post-training alignment）过程中对齐强度因个体偏好而异的问题。解决方案的关键在于提出一种名为CLM的方法，通过在模型的初始层之前添加一个身份层（identity layer），并在该层上仅进行偏好学习（preference learning），将未对齐的输入标记嵌入（token embeddings）映射到对齐空间。这种方法能够在效率上媲美全量微调（full fine-tuning），同时在推理阶段通过插值系数控制对齐程度，实现清晰的插值与外推现象（interpolation and extrapolation）。

链接: https://arxiv.org/abs/2503.04346
作者: Wenhong Zhu,Weinan Zhang,Rui Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.
zh

[NLP-44] In-depth Analysis of Graph-based RAG in a Unified Framework

【速读】：该论文试图解决的问题是如何在相同的实验设置下系统且全面地比较现有的图基 Retrieval-Augmented Generation (RAG) 方法，并分析其在不同问答（QA）任务中的有效性。论文的关键在于提出了一种统一的框架来整合所有图基 RAG 方法，并通过广泛的实验对比，不仅验证了现有方法的有效性，还结合已有技术发现了针对特定问答和抽象问答任务的新变体，这些新变体在性能上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2503.04338
作者: Yingli Zhou,Yaodong Su,Youran Sun,Shu Wang,Taotao Wang,Runyuan He,Yongwei Zhang,Sicong Liang,Xilin Liu,Yuchi Ma,Yixiang Fang
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs), improving their factual accuracy, adaptability, interpretability, and trustworthiness. A number of graph-based RAG methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework to incorporate all graph-based RAG methods from a high-level perspective. We then extensively compare representative graph-based RAG methods over a range of questing-answering (QA) datasets – from specific questions to abstract questions – and examine the effectiveness of all methods, providing a thorough analysis of graph-based RAG approaches. As a byproduct of our experimental analysis, we are also able to identify new variants of the graph-based RAG methods over specific QA and abstract QA tasks respectively, by combining existing techniques, which outperform the state-of-the-art methods. Finally, based on these findings, we offer promising research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide new valuable insights for future research.
zh

[NLP-45] Solving Word-Sense Disambiguation and Word-Sense Induction with Dictionary Examples

【速读】：该论文试图解决少资源语言在现代基于Transformer的大规模语言模型（LLMs）任务中缺乏大规模任务特定数据集的问题，尤其是在词义消歧（WSD）和词义引出（WSI）任务中的挑战。此外，尽管有许多语言学资源（如词典）包含大量信息，但它们在这一语境下很少被利用。论文的关键解决方案在于通过词在上下文中的任务（WiC）来间接实现这两个目标。WiC任务只需判断给定单词在两个句子中的意义是否不同，而无需依赖预先构建的包含足够示例的词义库存，这在少资源语言中通常是不可得的。论文提出利用大型语言模型从词典示例生成WiC任务的句子对，并证明了由此训练的模型在WiC、WSD和WSI任务上的表现优于现有方法。这种方法的核心优势在于有效利用了现有语言资源，同时规避了少资源语言中词义标注数据不足的难题。

链接: https://arxiv.org/abs/2503.04328
作者: Tadej Škvorc,Marko Robnik-Šikonja
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Many less-resourced languages struggle with a lack of large, task-specific datasets that are required for solving relevant tasks with modern transformer-based large language models (LLMs). On the other hand, many linguistic resources, such as dictionaries, are rarely used in this context despite their large information contents. We show how LLMs can be used to extend existing language resources in less-resourced languages for two important tasks: word-sense disambiguation (WSD) and word-sense induction (WSI). We approach the two tasks through the related but much more accessible word-in-context (WiC) task where, given a pair of sentences and a target word, a classification model is tasked with predicting whether the sense of a given word differs between sentences. We demonstrate that a well-trained model for this task can distinguish between different word senses and can be adapted to solve the WSD and WSI tasks. The advantage of using the WiC task, instead of directly predicting senses, is that the WiC task does not need pre-constructed sense inventories with a sufficient number of examples for each sense, which are rarely available in less-resourced languages. We show that sentence pairs for the WiC task can be successfully generated from dictionary examples using LLMs. The resulting prediction models outperform existing models on WiC, WSD, and WSI tasks. We demonstrate our methodology on the Slovene language, where a monolingual dictionary is available, but word-sense resources are tiny.
zh

[NLP-46] Computational Law: Datasets Benchmarks and Ontologies

【速读】：该论文旨在综述近年来为计算法学（Computational Law）领域提出的用于训练和测试的法律数据集、基准模型以及本体论（ontologies），以解决在机器学习和深度学习应用于法律领域的过程中，高性能模型所需大量领域特定数据的获取与利用问题。同时，论文强调了本体论等语义资源在构建大规模计算法律系统及确保系统互操作性方面的重要性。论文的关键在于通过全面回顾现有研究成果，为研究人员和从业者提供指导，帮助其开发和评估适用于计算法学的方法与系统。

链接: https://arxiv.org/abs/2503.04305
作者: Dilek Küçük,Fazli Can
机构: TÜBİTAK Marmara Research Center (TÜBİTAK马尔马拉研究中心); Bilkent University (比尔肯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in computer science and artificial intelligence have also contributed to the legal domain, as revealed by the number and range of related publications and applications. Machine and deep learning models require considerable amount of domain-specific data for training and comparison purposes, in order to attain high-performance in the legal domain. Additionally, semantic resources such as ontologies are valuable for building large-scale computational legal systems, in addition to ensuring interoperability of such systems. Considering these aspects, we present an up-to-date review of the literature on datasets, benchmarks, and ontologies proposed for computational law. We believe that this comprehensive and recent review will help researchers and practitioners when developing and testing approaches and systems for computational law.
zh

[NLP-47] Dual-Class Prompt Generation: Enhancing Indonesian Gender-Based Hate Speech Detection through Data Augmentation

【速读】：该论文旨在解决印尼社交媒体中基于性别的仇恨言论检测面临的挑战，特别是由于有限的标注数据集以及类别不平衡问题导致的细粒度分类（如性别针对性仇恨言论）研究不足的问题。为填补这一空白，论文比较了三种数据增强技术：后翻译（backtranslation）、单类别提示生成（仅使用仇恨言论示例）以及作者提出的双类别提示生成（同时使用仇恨言论和非仇恨言论示例）。关键在于通过引入来自两个类别的示例，双类别提示生成方法不仅提高了分类性能（在随机森林模型下达到88.5%的准确率和88.1%的F1分数），还通过语义相似性分析和T-SNE可视化证明其生成的内容更具有新颖性和多样性，同时保持了类别特性，从而有效缓解了小样本场景下的检测难题。

链接: https://arxiv.org/abs/2503.04279
作者: Muhammad Amien Ibrahim,Faisal,Tora Sangputra Yopie Winarto,Zefanya Delvin Sulistiya
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the 8th World Conference on Computing and Communication Technologies (WCCCT 2025)

点击查看摘要

Abstract:Detecting gender-based hate speech in Indonesian social media remains challenging due to limited labeled datasets. While binary hate speech classification has advanced, a more granular category like gender-targeted hate speech is understudied because of class imbalance issues. This paper addresses this gap by comparing three data augmentation techniques for Indonesian gender-based hate speech detection. We evaluate backtranslation, single-class prompt generation (using only hate speech examples), and our proposed dual-class prompt generation (using both hate speech and non-hate speech examples). Experiments show all augmentation methods improve classification performance, with our dual-class approach achieving the best results (88.5% accuracy, 88.1% F1-score using Random Forest). Semantic similarity analysis reveals dual-class prompt generation produces the most novel content, while T-SNE visualizations confirm these samples occupy distinct feature space regions while maintaining class characteristics. Our findings suggest that incorporating examples from both classes helps language models generate more diverse yet representative samples, effectively addressing limited data challenges in specialized hate speech detection.
zh

[NLP-48] On Fact and Frequency: LLM Responses to Misinformation Expressed with Uncertainty

【速读】：该论文试图解决的问题是：评估大型语言模型（LLMs）在面对经过不确定性转换后的虚假信息命题时，其事实核查分类是否会发生变化，并探究导致这种变化的关键因素。研究聚焦于三种广泛使用的LLMs（GPT-4o、LlaMA3、DeepSeek-v2）对经验证为假的信息命题转变为带有不确定性的陈述后的反应。

解决方案的关键在于通过实验分析LLMs在面对不同类型的不确定性表达（如模态性、语言线索或论证策略）时的事实核查分类变化情况。研究发现，在25%的情况下，LLMs会将原本被判定为“假”的命题重新分类为“非假”。进一步分析表明，这种变化不能完全由人类预期敏感的预测因子（如模态性、语言线索或论证策略）解释，但“信念型”（doxastic）转换（使用诸如“据信……”等语言提示短语）是一个例外。此外，为了深入理解这一现象，研究还考察了LLMs对这些不确定性陈述的其他非真实性相关判断，例如人们对这些陈述出现频率的估计。结果显示，关于事实判断与频率估计之间存在小但显著的相关性，这为进一步理解LLMs的行为提供了重要洞见。

链接: https://arxiv.org/abs/2503.04271
作者: Yana van de Sande,Gunes Açar,Thabo van Woudenberg,Martha Larson
机构: Centre for Language Studies (语言研究中心); iHub; Institute for Computing and Information Sciences (计算与信息科学研究所); Radboud University (拉德堡德大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 4 pages, 1 figure, 3 tables, conference

点击查看摘要

Abstract:We study LLM judgments of misinformation expressed with uncertainty. Our experiments study the response of three widely used LLMs (GPT-4o, LlaMA3, DeepSeek-v2) to misinformation propositions that have been verified false and then are transformed into uncertain statements according to an uncertainty typology. Our results show that after transformation, LLMs change their factchecking classification from false to not-false in 25% of the cases. Analysis reveals that the change cannot be explained by predictors to which humans are expected to be sensitive, i.e., modality, linguistic cues, or argumentation strategy. The exception is doxastic transformations, which use linguistic cue phrases such as “It is believed …”.To gain further insight, we prompt the LLM to make another judgment about the transformed misinformation statements that is not related to truth value. Specifically, we study LLM estimates of the frequency with which people make the uncertain statement. We find a small but significant correlation between judgment of fact and estimation of frequency.
zh

[NLP-49] DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models

【速读】：该论文旨在解决基于推理时间对齐（Inference-time alignment）方法在大语言模型（LLMs）与人类对齐过程中面临的挑战，例如因策略特定的价值函数导致的可扩展性限制以及推理阶段的时间延迟。论文提出了一种名为Diffusion-styled Preference Optimization (\model) 的新颖方法作为解决方案。其关键是通过句子级别的直接对齐操作，避免了基于标记级别生成的时间延迟，并设计为一个即插即用模块，能够无缝集成到多种基础模型中以提升其对齐性能。实验结果表明，\model 在不同基准测试（如AlpacaEval 2、MT-bench 和 HH-RLHF）中实现了卓越的对齐效果，同时保持了良好的对齐质量和推理时间延迟之间的平衡，并展示了模型无关的可扩展性优势，显著提升了包括Llama-3-70B在内的大型模型的表现。

链接: https://arxiv.org/abs/2503.04240
作者: Ruizhe Chen,Wenhao Chai,Zhifei Yang,Xiaotian Zhang,Joey Tianyi Zhou,Tony Quek,Soujanya Poria,Zuozhu Liu
机构: Zhejiang University (浙江大学); SUTD (Singapore University of Technology and Design (新加坡科技设计大学)); University of Washington (华盛顿大学); Peking University (北京大学); ASTAR Centre for Frontier AI Research (ASTAR 前沿人工智能研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (\model), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, \model~avoids the time latency associated with token-level generation. Designed as a plug-and-play module, \model~can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that \model~achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, \model~demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.
zh

[NLP-50] gea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models ACL2021

【速读】：该论文旨在解决预训练语言模型（Pretrained Language Models, PLMs）在文本生成任务中的错误分析与诊断问题。为深入了解PLMs的文本生成能力并进行诊断性评估，论文提出了TGEA（Error-Annotated Dataset for Text Generation from PLMs），这是一个包含多基准任务的错误注释数据集。解决方案的关键在于构建了一个全面标注的数据集，通过精心挑选的提示词引导GPT-2生成候选句子，并从中选择47K句进行人工错误标注，最终检测出12k个错误句。论文创建了一种错误分类法，涵盖24类基于语言学和常识知识的错误，并为每个错误提供了详尽的标注，包括错误片段、相关片段、最小修正、错误类型及其背后的原因。此外，TGEA被用作基准数据集，提出了包括错误检测、错误类型分类、相关片段检测及错误原因生成等一系列自动诊断任务，以促进PLMs生成文本的自动化错误检测与修正研究。

链接: https://arxiv.org/abs/2503.04232
作者: Jie He,Bo Peng,Yi Liao,Qun Liu,Deyi Xiong
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学部), China; Huawei Noah’s Ark Lab (华为诺亚方舟实验室), Hong Kong, China
类目: Computation and Language (cs.CL)
备注: ACL 2021

点击查看摘要

Abstract:In order to deeply understand the capability of pretrained language models in text generation and conduct a diagnostic evaluation, we propose TGEA, an error-annotated dataset with multiple benchmark tasks for text generation from pretrained language models (PLMs). We use carefully selected prompt words to guide GPT-2 to generate candidate sentences, from which we select 47K for error annotation. Crowdsourced workers manually check each of these sentences and detect 12k erroneous sentences. We create an error taxonomy to cover 24 types of errors occurring in these erroneous sentences according to the nature of errors with respect to linguistics and knowledge (eg, common sense). For each erroneous span in PLM-generated sentences, we also detect another span that is closely associated with it. Each error is hence manually labeled with comprehensive annotations, including the span of the error, the associated span, minimal correction to the error, the type of the error, and rationale behind the error. Apart from the fully annotated dataset, we also present a detailed description of the data collection procedure, statistics and analysis of the dataset. This is the first dataset with comprehensive annotations for PLM-generated texts, which facilitates the diagnostic evaluation of PLM-based text generation. Furthermore, we use TGEA as a benchmark dataset and propose a series of automatic diagnosis tasks, including error detection, error type classification, associated span detection, error rationale generation, to further promote future study on the automatic error detection and correction on texts generated by pretrained language models.
zh

[NLP-51] FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion

【速读】：该论文旨在通过将异构源大语言模型（Heterogeneous Source LLMs）的优势整合到更紧凑的目标大语言模型（Target LLMs）中，解决性能与资源效率之间的权衡问题。论文的关键创新在于开发了一种专门的数据构造协议以适应不同任务和领域，并设计了一个包含两阶段训练的FuseChat-3.0管道：第一阶段为监督微调（Supervised Fine-Tuning, SFT），用于对齐目标模型与源模型的分布；第二阶段为直接偏好优化（Direct Preference Optimization, DPO），利用多个源模型的偏好进一步优化目标模型。这种融合方法显著提升了模型在指令跟随、通用知识、数学及代码生成等任务上的表现，在AlpacaEval-2和Arena-Hard等基准测试中分别取得了37.1和30.1点的显著提升。

链接: https://arxiv.org/abs/2503.04222
作者: Ziyi Yang,Fanqi Wan,Longguang Zhong,Canbin Huang,Guosheng Liang,Xiaojun Quan
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: Technical report

点击查看摘要

Abstract:We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Our source models include the powerful Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along with two ultra-compact options, Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source models, we develop a specialized data construction protocol tailored to various tasks and domains. The FuseChat-3.0 training pipeline consists of two key stages: (1) supervised fine-tuning (SFT) to align the target and source model distributions, and (2) Direct Preference Optimization (DPO) to apply preferences from multiple source LLMs to fine-tune the target model. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and 30.1 points on the instruction-following benchmarks AlpacaEval-2 and Arena-Hard, respectively. Our code, models, and datasets are available at this https URL.
zh

[NLP-52] Knowledge-Decoupled Synergetic Learning: An MLLM based Collaborative Approach to Few-shot Multimodal Dialogue Intention Recognition

【速读】：该论文致力于解决电商领域中少样本多模态对话意图识别这一关键挑战。传统方法主要通过后训练技术提升模型分类能力，但研究发现，少样本多模态对话意图识别的训练涉及两个相互关联的任务，在多任务学习中表现出跷跷板效应，这是由于权重矩阵更新叠加导致的知识干扰所致。为应对这些挑战，论文提出了一种名为知识解耦协同学习（Knowledge-Decoupled Synergetic Learning, KDSL）的方法。其关键是利用较小模型将知识转化为可解释的规则，并结合较大模型的后训练，同时促进大、小多模态大型语言模型在预测中的协作。实验结果表明，该方法在两个真实的淘宝数据集上显著提升了性能，在在线加权F1分数上较现有最先进方法分别提高了6.37%和6.28%，验证了所提框架的有效性。

链接: https://arxiv.org/abs/2503.04201
作者: Bin Chen,Yu Zhang,Hongfei Ye,Ziyi Huang,Hongyang Chen
机构: University of Chinese Academy of Sciences (中国科学院大学); University of Chinese Academy of Sciences (中国科学院大学); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学); Zhejiang Lab (之江实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Few-shot multimodal dialogue intention recognition is a critical challenge in the e-commerce domainn. Previous methods have primarily enhanced model classification capabilities through post-training techniques. However, our analysis reveals that training for few-shot multimodal dialogue intention recognition involves two interconnected tasks, leading to a seesaw effect in multi-task learning. This phenomenon is attributed to knowledge interference stemming from the superposition of weight matrix updates during the training process. To address these challenges, we propose Knowledge-Decoupled Synergetic Learning (KDSL), which mitigates these issues by utilizing smaller models to transform knowledge into interpretable rules, while applying the post-training of larger models. By facilitating collaboration between the large and small multimodal large language models for prediction, our approach demonstrates significant improvements. Notably, we achieve outstanding results on two real Taobao datasets, with enhancements of 6.37% and 6.28% in online weighted F1 scores compared to the state-of-the-art method, thereby validating the efficacy of our framework.
zh

[NLP-53] Measuring temporal effects of agent knowledge by date-controlled tool use

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）代理在基于工具的知识获取过程中因时间效应导致性能下降的问题。论文的关键解决方案在于构建了一个基于工具的样本外测试框架，通过选择合适的基础模型（base model）以及引入显式推理指令（如链式思维提示，chain-of-thought prompting），有效缓解了搜索引擎的时间效应，从而提升了代理在不同日期控制工具（Date-Controlled Tools, DCTs）下的知识变异性表现和任务完成能力。研究强调了动态评估代理的重要性，并指出应考虑工具的时间影响及外部资源的更新对代理性能的影响。

链接: https://arxiv.org/abs/2503.04188
作者: R. Patrick Xian,Qiming Cui,Stefan Bauer,Reza Abbasi-Asl
机构: UC San Francisco (UCSF); UC Berkeley (加州大学伯克利分校); Technical University of Munich (慕尼黑工业大学) & Helmholtz AI
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: comments welcome

点击查看摘要

Abstract:Temporal progression is an integral part of knowledge accumulation and update. Web search is frequently adopted as grounding for agent knowledge, yet its inappropriate configuration affects the quality of agent responses. Here, we construct a tool-based out-of-sample testing framework to measure the knowledge variability of large language model (LLM) agents from distinct date-controlled tools (DCTs). We demonstrate the temporal effects of an LLM agent as a writing assistant, which can use web search to help complete scientific publication abstracts. We show that temporal effects of the search engine translates into tool-dependent agent performance but can be alleviated with base model choice and explicit reasoning instructions such as chain-of-thought prompting. Our results indicate that agent evaluation should take a dynamical view and account for the temporal influence of tools and the updates of external resources.
zh

[NLP-54] Large-Scale AI in Telecom: Charting the Roadmap for Innovation Scalability and Enhanced Digital Experiences

【速读】：本文旨在解决现代电信网络在面对日益复杂的挑战时所面临的性能瓶颈和创新局限性问题。论文的核心关注点是通过开发和部署大型电信模型（Large Telecom Models, LTMs）来实现这一目标，LTMs 是专门设计的生成式 AI 模型，能够有效应对网络管理、资源分配及优化等领域的复杂需求。关键在于 LTMs 的定制化架构及其在提升网络可扩展性、性能以及以用户为中心的创新能力方面的潜力，这为电信行业的未来发展提供了全面的技术路线图。

链接: https://arxiv.org/abs/2503.04184
作者: Adnan Shahid,Adrian Kliks,Ahmed Al-Tahmeesschi,Ahmed Elbakary,Alexandros Nikou,Ali Maatouk,Ali Mokh,Amirreza Kazemi,Antonio De Domenico,Athanasios Karapantelakis,Bo Cheng,Bo Yang,Bohao Wang,Carlo Fischione,Chao Zhang,Chaouki Ben Issaid,Chau Yuen,Chenghui Peng,Chongwen Huang,Christina Chaccour,Christo Kurisummoottil Thomas,Dheeraj Sharma,Dimitris Kalogiros,Dusit Niyato,Eli De Poorter,Elissa Mhanna,Emilio Calvanese Strinati,Faouzi Bader,Fathi Abdeldayem,Fei Wang,Fenghao Zhu,Gianluca Fontanesi,Giovanni Geraci,Haibo Zhou,Hakimeh Purmehdi,Hamed Ahmadi,Hang Zou,Hongyang Du,Hoon Lee,Howard H. Yang,Iacopo Poli,Igor Carron,Ilias Chatzistefanidis,Inkyu Lee,Ioannis Pitsiorlas,Jaron Fontaine,Jiajun Wu,Jie Zeng,Jinan Li,Jinane Karam,Johny Gemayel,Juan Deng,Julien Frison,Kaibin Huang,Kehai Qiu,Keith Ball,Kezhi Wang,Kun Guo,Leandros Tassiulas,Lecorve Gwenole,Liexiang Yue,Lina Bariah,Louis Powell,Marcin Dryjanski,Maria Amparo Canaveras Galdon,Marios Kountouris,Maryam Hafeez,Maxime Elkael,Mehdi Bennis,Mehdi Boudjelli,Meiling Dai,Merouane Debbah,Michele Polese,Mohamad Assaad,Mohamed Benzaghta,Mohammad Al Refai,Moussab Djerrab,Mubeen Syed,Muhammad Amir,Na Yan,Najla Alkaabi,Nan Li,Nassim Sehad,Navid Nikaein,Omar Hashash,Pawel Sroka,Qianqian Yang,Qiyang Zhao,Rasoul Nikbakht Silab,Rex Ying,Roberto Morabito,Rongpeng Li,Ryad Madi,Salah Eddine El Ayoubi,Salvatore D’Oro,Samson Lasaulce,Serveh Shalmashi,Sige Liu,Sihem Cherrared,Swarna Bindu Chetty
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This white paper discusses the role of large-scale AI in the telecommunications industry, with a specific focus on the potential of generative AI to revolutionize network functions and user experiences, especially in the context of 6G systems. It highlights the development and deployment of Large Telecom Models (LTMs), which are tailored AI models designed to address the complex challenges faced by modern telecom networks. The paper covers a wide range of topics, from the architecture and deployment strategies of LTMs to their applications in network management, resource allocation, and optimization. It also explores the regulatory, ethical, and standardization considerations for LTMs, offering insights into their future integration into telecom infrastructure. The goal is to provide a comprehensive roadmap for the adoption of LTMs to enhance scalability, performance, and user-centric innovation in telecom networks.
zh

[NLP-55] IMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

【速读】：本文旨在解决大型语言模型（LLMs）在处理纵向电子健康记录（EHRs）时面临的独特挑战，特别是其在跨多次患者就诊和时间框架推理时间依赖性的能力尚未被充分探索的问题。论文的关键解决方案是引入TIMER（Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records）框架，该框架通过将指令-响应对与患者记录的不同部分关联起来，作为评估和微调纵向临床记录指令的重要维度。此外，开发了TIMER-Bench（首个具备时间感知能力的基准），用于评估纵向EHR的时间推理能力，并提出了TIMER-Instruct（一种针对LLMs的时间推理指令微调方法）。实验表明，使用TIMER-Instruct微调的模型在人工生成的基准上性能提升7.3%，在TIMER-Bench上提升9.2%，证明了时间指令微调能够有效提高模型在EHR推理任务中的表现。

链接: https://arxiv.org/abs/2503.04176
作者: Hejie Cui,Alyssa Unell,Bowen Chen,Jason Alan Fries,Emily Alsentzer,Sanmi Koyejo,Nigam Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs’ capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a framework that incorporate instruction-response pairs grounding to different parts of a patient’s record as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR.
zh

[NLP-56] BPQA Dataset: Evaluating How Well Language Models Leverag e Blood Pressures to Answer Biomedical Questions

【速读】：该论文旨在解决两个关键问题：首先，是否能够利用临床测量数据（如血压）有效训练语言模型（LMs）以回答相关的医学问题；其次，如何提升语言模型在涉及测量数据的医学问答任务中的表现。论文通过开发一个新的数据集BPQA（包含100个经过验证的血压相关医学问答对），评估了四种语言模型（BERT、BioBERT、MedAlpaca和GPT-3.5）的性能，并发现较大的模型（如GPT-3.5和MedAlpaca）从血压信息中获益更多，而对较小的语言模型（如BERT和BioBERT）则效果有限。此外，引入带标签的测量数据显著提升了领域特定语言模型（如BioBERT和MedAlpaca）的表现，表明检索增强可能有助于改进这些领域的语言模型性能。因此，解决方案的关键在于结合适当的模型规模与数据标注策略来优化语言模型处理医学测量数据的能力。

链接: https://arxiv.org/abs/2503.04155
作者: Chi Hang,Ruiqi Deng,Lavender Yao Jiang,Zihao Yang,Anton Alyakin,Daniel Alber,Eric Karl Oermann
机构: NYU Center for Data Science (纽约大学数据科学中心); NYU Grossman School of Medicine (纽约大学格罗斯曼医学院); NYU Langone Health (纽约大学朗格尼健康中心); Washington University, Saint Louis (圣路易斯华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Clinical measurements such as blood pressures and respiration rates are critical in diagnosing and monitoring patient outcomes. It is an important component of biomedical data, which can be used to train transformer-based language models (LMs) for improving healthcare delivery. It is, however, unclear whether LMs can effectively interpret and use clinical measurements. We investigate two questions: First, can LMs effectively leverage clinical measurements to answer related medical questions? Second, how to enhance an LM’s performance on medical question-answering (QA) tasks that involve measurements? We performed a case study on blood pressure readings (BPs), a vital sign routinely monitored by medical professionals. We evaluated the performance of four LMs: BERT, BioBERT, MedAlpaca, and GPT-3.5, on our newly developed dataset, BPQA (Blood Pressure Question Answering). BPQA contains 100 medical QA pairs that were verified by medical students and designed to rely on BPs . We found that GPT-3.5 and MedAlpaca (larger and medium sized LMs) benefit more from the inclusion of BPs than BERT and BioBERT (small sized LMs). Further, augmenting measurements with labels improves the performance of BioBERT and Medalpaca (domain specific LMs), suggesting that retrieval may be useful for improving domain-specific LMs.
zh

[NLP-57] cktack : Long Span Temporal Alignment of Large Language Models Leverag ing Sexagenary Cycle Time Expression

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长时跨度（long-time span）上的时间错位问题（temporal misalignment），特别是在处理跨越数千年的长周期数据时，由于时间信息稀疏导致的模型学习不足或灾难性遗忘（catastrophic forgetting）。论文的关键解决方案包括：首先，采用干支纪年（sexagenary year expression）替代LLMs使用的公历表达（Gregorian year expression），以实现更均匀的年度粒度分布；其次，利用极坐标（polar coordinates）建模干支循环的60个术语及其内部年份顺序，并结合额外的时间编码确保LLMs理解这些时间概念；最后，提出一种针对后训练LLMs的时间表征对齐方法（temporal representational alignment approach），通过有效区分不同时间点的相关知识来提升LLMs在时间相关任务中的性能，尤其是长时间跨度下的表现。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2503.04150
作者: Xue Han,Qian Hu,Yitong Wang,Wenchun Gao,Lianlian Zhang,Qing Wang,Lijun Mei,Chao Deng,Junlan Feng
机构: China Mobile Research Institute (中国移动研究院); JIUTIAN Team (九天团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named “Ticktack” for addressing the LLM’s long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.
zh

[NLP-58] Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination

【速读】：该论文试图解决代码大语言模型（Code LLMs）基准测试在数据污染风险下的有效性与可靠性问题。当前基准测试依赖于固定的人类创建数据集，这种静态方式易受数据污染的影响，而现有缓解数据污染的方法受限于人工成本高且问题复杂度分布不均。为应对这些挑战，论文提出了一种名为\tool的新颖基准测试套件，其关键在于通过多个智能体动态生成语义等价的问题变体，同时保持核心逻辑不变，从而在潜在数据污染情况下有效评估Code LLMs的推理能力，并确保评价结果的一致性和可靠性。

链接: https://arxiv.org/abs/2503.04149
作者: Simin Chen,Pranav Pusarla,Baishakhi Ray
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:The rapid evolution of code largelanguage models underscores the need for effective and transparent benchmarking of their reasoning capabilities. However, the current benchmarking approach heavily depends on publicly available, human-created datasets. The widespread use of these fixed benchmark datasets makes the benchmarking process to be static and thus particularly susceptible to data contamination, an unavoidable consequence of the extensive data collection processes used to train Code LLMs. Existing approaches that address data contamination often suffer from human effort limitations and imbalanced problem complexity. To tackle these challenges, we propose \tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. Given a seed programming problem, \tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. We introduce a dynamic data generation methods and conduct empirical studies on two seed datasets across 21 Code LLMs. Results show that \tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
zh

[NLP-59] HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLM s NAACL2025

【速读】：该论文旨在解决在对话数据中有效捕获语义意图的信息检索问题，现有方法通常面临捕捉语义意图困难或需要大量标注和微调的挑战。论文提出的解决方案是HEISIR（Hierarchical Expansion of Inverted Semantic Indexing for Retrieval），其关键是通过优化的数据摄取过程增强语义理解，无需资源密集型的标注或模型适配。HEISIR通过两级处理实现：(1) 层次三元组构建和 (2) 附属增强，形成包含主谓宾补（SVOA）四元组的语义索引，这种结构化表示能够有效捕获对话内容中的潜在语义信息。此外，HEISIR在保持低延迟的同时实现了高性能检索，并且在多种嵌入类型和语言模型上超越了微调模型，同时为对话系统的意图和主题分析提供了机会。

链接: https://arxiv.org/abs/2503.04141
作者: Sangyeop Kim,Hangyeul Lee,Yohan Lee
机构: Coxwave; Seoul National University (首尔国立大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by NAACL 2025 (Findings)

点击查看摘要

Abstract:The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.
zh

[NLP-60] Biological Sequence with Language Model Prompting: A Survey

【速读】：该论文旨在研究基于提示（prompt）的方法在生物序列（包括DNA、RNA、蛋白质）及药物发现任务中的应用，特别是利用大型语言模型（LLMs）解决生物信息学领域特定问题的能力。论文的关键在于通过提示工程（prompt engineering），使LLMs能够在标注数据有限的情况下有效应对诸如启动子序列预测、蛋白质结构建模以及药物-靶点结合亲和力预测等挑战性任务。此外，论文还探讨了提示方法在生物信息学中的变革潜力，并针对数据稀缺性、多模态融合及计算资源限制等核心问题提出见解。论文的目标是为初学者提供入门指导，同时推动该领域的持续创新。

链接: https://arxiv.org/abs/2503.04135
作者: Jiyue Jiang,Zikang Wang,Yuheng Shan,Heyan Chai,Jiayi Li,Zixian Ma,Xinrui Zhang,Yu Li
机构: The Chinese University of Hong Kong (香港中文大学); The Hong Kong Polytechnic University (香港理工大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain-specific problems, such as promoter sequence prediction, protein structure modeling, and drug-target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.
zh

[NLP-61] Uncovering Gaps in How Humans and LLM s Interpret Subjective Language ICLR2025

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在理解和执行人类主观指令时可能存在的对齐偏差问题。具体而言，当用户或开发者使用自然语言指令（如“热情”或“机智”）引导LLMs的行为时，模型的实际操作语义可能与人类预期不一致。论文的关键解决方案是提出TED（Thesaurus Error Detector），通过构建一个能够捕捉LLMs对两组短语是否具有相似操作语义的词典，并对比该词典与人工构建的参考标准之间的分歧，来揭示模型行为中的意外偏差。例如，Mistral 7B Instruct在尝试使文本更机智时可能生成更多骚扰性输出，而Llama 3 8B Instruct在生成热情的文章时可能产生不诚实的内容。这种方法无需直接监督模型输出，而是通过分析抽象概念间的关系来发现LLMs的未预期行为，从而有效揭示模型与人类意图之间的错位问题。

链接: https://arxiv.org/abs/2503.04113
作者: Erik Jones,Arjun Patrawala,Jacob Steinhardt
机构: UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published at ICLR 2025

点击查看摘要

Abstract:Humans often rely on subjective natural language to direct language models (LLMs); for example, users might instruct the LLM to write an enthusiastic blogpost, while developers might train models to be helpful and harmless using LLM-based edits. The LLM’s operational semantics of such subjective phrases – how it adjusts its behavior when each phrase is included in the prompt – thus dictates how aligned it is with human intent. In this work, we uncover instances of misalignment between LLMs’ actual operational semantics and what humans expect. Our method, TED (thesaurus error detector), first constructs a thesaurus that captures whether two phrases have similar operational semantics according to the LLM. It then elicits failures by unearthing disagreements between this thesaurus and a human-constructed reference. TED routinely produces surprising instances of misalignment; for example, Mistral 7B Instruct produces more harassing outputs when it edits text to be witty, and Llama 3 8B Instruct produces dishonest articles when instructed to make the articles enthusiastic. Our results demonstrate that humans can uncover unexpected LLM behavior by scrutinizing relationships between abstract concepts, without supervising outputs directly.
zh

[NLP-62] LLM s Can Generate a Better Answer by Aggregating Their Own Responses

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理复杂问题时，依赖自身判别能力进行错误修正或响应选择导致性能不佳的问题。传统方法如自我校正（self-correction）和响应选择（response selection）受限于LLMs在无显式监督的情况下难以有效完成判别任务的局限性。为克服这一挑战，论文提出了一种名为生成式自聚合（Generative Self-Aggregation, GSA）的新颖提示方法。GSA的关键在于不依赖LLM的判别能力，而是通过从模型中采样多个多样化响应，并将这些响应聚合为一个改进的解决方案来提升回答质量。与自一致性（self-consistency, SC）等响应聚合方法不同，GSA无需特定可验证标记即可实现多数投票，而是利用生成能力综合多个样本的上下文信息生成新的响应，从而适用于开放式任务。实验评估表明，GSA在数学推理、知识驱动问题以及代码合成和对话响应等开放生成任务中均显著提升了回答质量。

链接: https://arxiv.org/abs/2503.04104
作者: Zichong Li,Xinyu Feng,Yuheng Cai,Zixuan Zhang,Tianyi Liu,Chen Liang,Weizhu Chen,Haoyu Wang,Tuo Zhao
机构: Georgia Tech (乔治亚理工学院); Microsoft Azure (微软Azure); Amazon (亚马逊); University at Albany (奥尔巴尼大学); Georgia Tech (乔治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. While approaches like self-correction and response selection have emerged as popular solutions, recent studies have shown these methods perform poorly when relying on the LLM itself to provide feedback or selection criteria. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. In this paper, we propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model’s discriminative capabilities. GSA first samples multiple diverse responses from the LLM, then aggregates them to obtain an improved solution. Unlike previous approaches, our method does not require the LLM to correct errors or compare response quality; instead, it leverages the model’s generative abilities to synthesize a new response based on the context of multiple samples. While GSA shares similarities with the self-consistency (SC) approach for response aggregation, SC requires specific verifiable tokens to enable majority voting. In contrast, our approach is more general and can be applied to open-ended tasks. Empirical evaluation demonstrates that GSA effectively improves response quality across various tasks, including mathematical reasoning, knowledge-based problems, and open-ended generation tasks such as code synthesis and conversational responses.
zh

[NLP-63] Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理不同方言变体（如非洲裔美国英语African American English, AAE）时存在的推理任务偏见问题。论文的关键解决方案在于开发了一个实验框架，通过结合基于LLM的方言转换技术和传统的语言学分析方法，对比LLMs在标准美式英语（Standard American English, SAE）和AAE提示下的表现差异。研究发现，与SAE相比，LLMs对AAE输入的响应准确性较低，推理链条更简单且解释更简略，特别是在社会科学和人文学科领域差异最为显著。这一研究揭示了LLMs在处理不同语言变体时存在系统性差异，强调了在多语言和多方言环境中开发和部署这些系统时需关注的重要问题。代码资源已公开发布。

链接: https://arxiv.org/abs/2503.04099
作者: Runtao Zhou,Guangya Wan,Saadia Gabriel,Sheng Li,Alexander J Gates,Maarten Sap,Thomas Hartvigsen
机构: University of Virginia (弗吉尼亚大学); University of California, Los Angeles (加州大学洛杉矶分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ARR Under Review, First two authors contribute equally

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at this https URL.
zh

[NLP-64] Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts

【速读】：该论文试图解决现有图表基准测试中忽视多模态大型语言模型（MLLMs）参数化记忆导致的输出偏差的问题。论文的关键解决方案是引入了一个新的图表假设性问题回答（Chart Hypothetical Question Answering, HQA）任务，通过在相同问题上施加假设，迫使模型基于图表内容进行反事实推理。此外，论文提出了一个人机交互数据合成方法（HAI），利用大型语言模型的高效文本编辑能力和人类专家知识，以低成本生成多样化且高质量的HQ&A数据，进而构建了一个从公开数据源合成的挑战性基准——Chart-HQA。这一方案的核心在于通过反事实推理和高质量数据集设计，评估和改进MLLMs的泛化能力和推理平衡性。

链接: https://arxiv.org/abs/2503.04095
作者: Xiangnan Chen,Yuancheng Fang,Qian Xiao,Juncheng Li,Jun Lin,Siliang Tang,Yi Yang,Yueting Zhuang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); DAMO Research (达摩院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have garnered significant attention for their strong visual-semantic understanding. Most existing chart benchmarks evaluate MLLMs’ ability to parse information from charts to answer this http URL, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of LLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost. Using HAI, we construct Chart-HQA, a challenging benchmark synthesized from publicly available data sources. Evaluation results on 18 MLLMs of varying model sizes reveal that current models face significant generalization challenges and exhibit imbalanced reasoning performance on the HQA task.
zh

[NLP-65] PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks

【速读】：该论文旨在解决文档图像内容快速且准确解析的需求，随着数字化的快速发展，文档图像在生产和日常生活中被广泛应用，对高效处理文档内容提出了更高的要求。论文提出了一种名为PP-DocBee的新一代多模态大型语言模型，用于端到端的文档图像理解。解决方案的关键在于开发了一套面向文档场景的数据合成策略以构建多样化数据集提升模型泛化能力，并结合动态比例采样、数据预处理以及OCR后处理等训练技术优化模型性能。实验结果表明，PP-DocBee在英语文档理解和中文文档理解任务上均达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.04065
作者: Feng Ni,Kui Huang,Yao Lu,Wenyu Lv,Guanzhong Wang,Zeyu Chen,Yi Liu
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \hrefthis https URLthis https URL.
zh

[NLP-66] Uncovering inequalities in new knowledge learning by large language models across different languages

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在新知识学习过程中存在的语言间不平等问题。论文关注四个关键维度：有效性（effectiveness）、可迁移性（transferability）、优先级分配（prioritization）以及鲁棒性（robustness），通过在两种设置（提示学习in-context learning 和微调fine-tuning）下对专有模型与开源模型进行广泛的实验，揭示了低资源语言在所有四个维度上均面临不利局面。解决方案的关键在于系统性地分析和验证这些语言间的不平等现象，从而提高对LLMs新知识学习中语言不平等的认识，推动构建更具包容性和公平性的未来大型语言模型。

链接: https://arxiv.org/abs/2503.04064
作者: Chenglong Wang,Haoyu Tang,Xiyuan Yang,Yueqi Xie,Jina Suh,Sunayana Sitaram,Junming Huang,Yu Xie,Zhaoya Gong,Xing Xie,Fangzhao Wu
机构: School of Urban Planning and Design, Peking University Shenzhen Graduate School (北京大学深圳研究生院), Shenzhen, China; Key Laboratory of Earth Surface System and Human-Earth Relations of Ministry of Natural Resources of China, Peking University Shenzhen Graduate School (自然资源部地表系统与人地关系重点实验室（北京大学深圳研究生院）), Shenzhen, China; School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院), Hefei, China; Microsoft Research Asia (微软亚洲研究院), Beijing, China; School of Computer Science, Wuhan University (武汉大学计算机学院), Wuhan, China; Hong Kong University of Science and Technology (香港科技大学), Hong Kong, China; Paul and Marcia Center on Contemporary China, Princeton University (保罗与玛丽·马克斯当代中国研究中心，普林斯顿大学), Princeton, USA; Microsoft Research (微软研究), Redmond, USA; Microsoft Research (微软研究), Bengaluru, India; Center for Social Research, Guanghua School of Management, Peking University (北京大学光华管理学院社会调查中心), Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) gradually become integral tools for problem solving in daily life worldwide, understanding linguistic inequality is becoming increasingly important. Existing research has primarily focused on static analyses that assess the disparities in the existing knowledge and capabilities of LLMs across languages. However, LLMs are continuously evolving, acquiring new knowledge to generate up-to-date, domain-specific responses. Investigating linguistic inequalities within this dynamic process is, therefore, also essential. In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness. Through extensive experiments under two settings (in-context learning and fine-tuning) using both proprietary and open-source models, we demonstrate that low-resource languages consistently face disadvantages across all four dimensions. By shedding light on these disparities, we aim to raise awareness of linguistic inequalities in LLMs’ new knowledge learning, fostering the development of more inclusive and equitable future LLMs.
zh

[NLP-67] Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge

【速读】：该论文试图解决语言模型中数据水印在训练数据所有权跟踪和验证方面存在的挑战，特别是现有技术主要关注于预训练后的有效记忆，而忽视了数据预处理阶段的水印过滤风险、训练后可能的记忆丧失以及仅通过API访问时验证困难等问题。论文的关键解决方案是提出一种新的数据水印方法，通过生成包含虚构实体及其属性的连贯且可信的文本片段，将水印无缝融入训练数据中，从而增强其记忆性和隐蔽性，使其更难以通过词法检测。此外，该方法还证明了增加水印的密度、长度和属性多样性可以进一步加强记忆效果，并且这些水印在整个大型语言模型（LLM）开发过程中保持鲁棒性，包括持续预训练和有监督微调阶段。最后，论文展示了即使在仅通过API访问的情况下，也可以通过问答任务来评估这些数据水印的有效性。

链接: https://arxiv.org/abs/2503.04036
作者: Xinyue Cui,Johnny Tian-Zheng Wei,Swabha Swayamdipta,Robin Jia
机构: University of Southern California (南加州大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization after pretraining, while overlooking challenges that arise in other stages of the LLM pipeline, such as the risk of watermark filtering during data preprocessing, or potential forgetting through post-training, or verification difficulties due to API-only access. We propose a novel data watermarking approach that injects coherent and plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during this http URL demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain robust throughout LLM development, maintaining their effectiveness after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.
zh

[NLP-68] Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting

【速读】：该论文试图解决现有生物信息学领域大型语言模型（Large Language Models, LLMs）基准测试的局限性问题，即无法有效评估模型在多样化任务中的性能。为了解决这一问题，论文提出了一个全面的基于提示（prompting）的基准框架Bio-benchmark，包含涵盖蛋白质、RNA、药物、电子健康记录和中医药等领域的30个关键生物信息学任务。解决方案的关键在于设计了一个无需微调即可在零样本和少量样本链式思维（Chain-of-Thought, CoT）设置下评估模型内在能力的基准框架，并引入了新工具BioFinder，通过改进答案提取方法显著提升了约30%的准确性，从而更高效地评估模型性能。

链接: https://arxiv.org/abs/2503.04013
作者: Jiyue Jiang,Pengan Chen,Jiuming Wang,Dongchen He,Ziqin Wei,Liang Hong,Licheng Zong,Sheng Wang,Qinze Yu,Zixian Ma,Yanyu Chen,Yimin Fan,Xiangyu Shi,Jiawei Sun,Chuan Wu,Yu Li
机构: The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become important tools in solving biological problems, offering improvements in accuracy and adaptability over conventional methods. Several benchmarks have been proposed to evaluate the performance of these LLMs. However, current benchmarks can hardly evaluate the performance of these models across diverse tasks effectively. In this paper, we introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark, which includes 30 key bioinformatics tasks covering areas such as proteins, RNA, drugs, electronic health records, and traditional Chinese medicine. Using this benchmark, we evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, etc., using 0-shot and few-shot Chain-of-Thought (CoT) settings without fine-tuning to reveal their intrinsic capabilities. To improve the efficiency of our evaluations, we demonstrate BioFinder, a new tool for extracting answers from LLM responses, which increases extraction accuracy by round 30% compared to existing methods. Our benchmark results show the biological tasks suitable for current LLMs and identify specific areas requiring enhancement. Furthermore, we propose targeted prompt engineering strategies for optimizing LLM performance in these contexts. Based on these findings, we provide recommendations for the development of more robust LLMs tailored for various biological applications. This work offers a comprehensive evaluation framework and robust tools to support the application of LLMs in bioinformatics.
zh

[NLP-69] RetinalGPT : A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models

【速读】：该论文旨在解决现有多模态大型语言模型（Multimodal Large Language Models, MLLMs）在医学领域，特别是视网膜图像理解和分析方面的不足。尽管一些通用领域的MLLMs已被探索应用于医学，如LLaVA-Med，但它们在处理视网膜图像时仍缺乏足够的专业化能力。相比之下，医学专家更强调定量分析在疾病检测与解读中的重要性。这揭示了通用领域与医学领域MLLMs之间的差距：通用领域模型虽具备广泛适用性，但在医学诊断和解释任务所需的精确专业知识方面存在局限。

为应对这些挑战，论文提出了一种名为\textitRetinalGPT的多模态会话助手，专门用于临床需求下的视网膜图像定量分析。其关键解决方案包括构建大规模视网膜图像数据集、开发新型数据处理流水线以及采用定制化的视觉指令微调技术，以提升视网膜图像分析能力和丰富医学知识库。实验结果显示，\textitRetinalGPT在8个基准视网膜数据集的疾病诊断任务中显著优于通用领域的MLLMs。此外，\textitRetinalGPT还实现了病变定位等定量分析功能，标志着利用大语言模型构建可解释且端到端的临床研究框架的重要进展。

链接: https://arxiv.org/abs/2503.03987
作者: Wenhui Zhu,Xin Li,Xiwen Chen,Peijie Qiu,Vamsi Krishna Vasa,Xuanzhao Dong,Yanxi Chen,Natasha Lepore,Oana Dumitrascu,Yi Su,Yalin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textitRetinalGPT, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at this https URL
zh

[NLP-70] Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

【速读】：该论文旨在解决非语音声音和音乐的理解与推理问题，这对于人类和人工智能代理有效与环境交互至关重要。论文的关键解决方案包括引入Audio Flamingo 2 (AF2)，一个具有先进音频理解和推理能力的Audio-Language Model (ALM)。AF2通过以下三个关键要素实现其目标：(i) 自定义的CLAP模型，(ii) 合成的Audio QA数据以支持细粒度的音频推理，以及(iii) 多阶段的课程学习策略。此外，论文首次将音频理解扩展到长音频片段（30秒至5分钟），并提出了LongAudio，这是一个大规模且新颖的数据集，用于训练ALMs在长音频描述和问答任务上的能力。通过在LongAudio上微调AF2，实现了卓越的性能，并提出LongAudioBench来评估ALMs在长音频理解方面的专家标注基准。论文还进行了广泛的消融研究以验证方法的有效性。

链接: https://arxiv.org/abs/2503.03983
作者: Sreyan Ghosh,Zhifeng Kong,Sonal Kumar,S Sakshi,Jaehyeon Kim,Wei Ping,Rafael Valle,Dinesh Manocha,Bryan Catanzaro
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: this https URL.
zh

[NLP-71] Reason Graph: Visualisation of Reasoning Paths

【速读】：该论文旨在解决大型语言模型（LLMs）推理过程复杂且缺乏有效可视化工具的问题。论文的关键解决方案是提出ReasonGraph，这是一个基于Web的平台，支持序列和树状推理方法，并集成了主要的LLM提供商及五十多种最先进的模型。ReasonGraph通过直观的用户界面、可配置的可视化参数以及模块化框架，实现了高效的功能扩展。其关键创新在于提供了一个统一的可视化框架，降低了分析复杂推理路径的认知负担，增强了逻辑过程中的错误检测能力，并促进了基于LLM应用的更有效开发。平台开源特性进一步推动了LLM推理分析的可访问性和重现性。

链接: https://arxiv.org/abs/2503.03979
作者: Zongqian Li,Ehsan Shareghi,Nigel Collier
机构: University of Cambridge (剑桥大学); Monash University (蒙纳士大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) reasoning processes are challenging to analyze due to their complexity and the lack of organized visualization tools. We present ReasonGraph, a web-based platform for visualizing and analyzing LLM reasoning processes. It supports both sequential and tree-based reasoning methods while integrating with major LLM providers and over fifty state-of-the-art models. ReasonGraph incorporates an intuitive UI with meta reasoning method selection, configurable visualization parameters, and a modular framework that facilitates efficient extension. Our evaluation shows high parsing reliability, efficient processing, and strong usability across various downstream applications. By providing a unified visualization framework, ReasonGraph reduces cognitive load in analyzing complex reasoning paths, improves error detection in logical processes, and enables more effective development of LLM-based applications. The platform is open-source, promoting accessibility and reproducibility in LLM reasoning analysis.
zh

[NLP-72] Preliminary Report: Enhancing Role Differentiation in Conversational HCI Through Chromostereopsis

【速读】：该论文试图解决文本基础的人工智能（AI）界面中缺乏直观的角色区分机制的问题，旨在通过隐式方式传达角色层级并赋予界面一定的物理空间感。解决方案的关键在于利用色觉深度感知（chromostereopsis）这一感知现象，通过色彩对比诱导深度感知，从而实现会话角色的视觉区分。

链接: https://arxiv.org/abs/2503.03968
作者: Matteo Grella
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Preliminary Report, 8 pages, 1 figures

点击查看摘要

Abstract:We propose leveraging chromostereopsis, a perceptual phenomenon inducing depth perception through color contrast, as a novel approach to visually differentiating conversational roles in text-based AI interfaces. This method aims to implicitly communicate role hierarchy and add a subtle sense of physical space.
zh

[NLP-73] On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

【速读】：该论文旨在探究跨语言迁移在当代多语言模型中的作用机制，特别是当一个单语语言模型开始接受第二种语言训练时会发生什么。论文的关键在于通过控制双语模型训练数据量及语言暴露顺序，寻找共享多语言表示的证据，并采用结构启动（structural priming）方法研究语法表示。研究发现，在控制训练数据量和语言暴露顺序后，跨语言结构启动效应在不同语言对及其方向上存在不对称性，这可能影响关于人类结构启动效应的假设。此外，研究还表明，对于相似度较低的语言对，结构启动效应较弱，揭示了跨语言迁移学习和共享表示在类型学多样化语言中的潜在局限性。因此，论文的核心解决方案在于通过实验设计揭示跨语言迁移的不对称性和局限性，以深化对多语言模型工作机制的理解。

链接: https://arxiv.org/abs/2503.03962
作者: Catherine Arnett,Tyler A. Chang,James A. Michaelov,Benjamin K. Bergen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While crosslingual transfer is crucial to contemporary language models’ multilingual capabilities, how it occurs is not well understood. In this paper, we ask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.
zh

[NLP-74] Performance Comparison of Large Language Models on Advanced Calculus Problems

【速读】：该论文旨在评估七种不同大型语言模型（Large Language Models, LLMs）在解决高等数学微积分问题中的准确性、可靠性和问题解决能力。研究对象包括ChatGPT 4o、Gemini Advanced with 1.5 Pro、Copilot Pro、Claude 3.5 Sonnet、Meta AI、Mistral AI和Perplexity。通过一系列包含32个测试问题、总计320分的评估任务，涵盖了向量计算、几何解释、积分评估及优化任务等多个主题。论文的关键发现强调了重新提示（re-prompting）在获得准确解题结果中的重要性，并揭示了各模型在不同问题类型上的表现趋势与模式，从而为LLMs在数学微积分领域的现有能力和局限性提供了深入洞察。这些分析不仅突显了某些模型如ChatGPT 4o和Mistral AI在多种问题上的稳健性，同时也指出了其他模型如Gemini Advanced with 1.5 Pro和Meta AI在处理复杂积分与优化问题时存在的不足之处。总体而言，这项研究为教育者、研究人员以及开发者利用LLMs进行数学教学和实际应用提供了宝贵的参考信息，推动了LLM技术的发展与优化。

链接: https://arxiv.org/abs/2503.03960
作者: In Hak Moon
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents an in-depth analysis of the performance of seven different Large Language Models (LLMs) in solving a diverse set of math advanced calculus problems. The study aims to evaluate these models’ accuracy, reliability, and problem-solving capabilities, including ChatGPT 4o, Gemini Advanced with 1.5 Pro, Copilot Pro, Claude 3.5 Sonnet, Meta AI, Mistral AI, and Perplexity. The assessment was conducted through a series of thirty-two test problems, encompassing a total of 320 points. The problems covered various topics, from vector calculations and geometric interpretations to integral evaluations and optimization tasks. The results highlight significant trends and patterns in the models’ performance, revealing both their strengths and weaknesses - for instance, models like ChatGPT 4o and Mistral AI demonstrated consistent accuracy across various problem types, indicating their robustness and reliability in mathematical problem-solving, while models such as Gemini Advanced with 1.5 Pro and Meta AI exhibited specific weaknesses, particularly in complex problems involving integrals and optimization, suggesting areas for targeted improvements. The study also underscores the importance of re-prompting in achieving accurate solutions, as seen in several instances where models initially provided incorrect answers but corrected them upon re-prompting. Overall, this research provides valuable insights into the current capabilities and limitations of LLMs in the domain of math calculus, with the detailed analysis of each model’s performance on specific problems offering a comprehensive understanding of their strengths and areas for improvement, contributing to the ongoing development and refinement of LLM technology. The findings are particularly relevant for educators, researchers, and developers seeking to leverage LLMs for educational and practical applications in mathematics.
zh

[NLP-75] c-Habilidad: Skill Classification for Bridging Education and Employment

【速读】：该论文旨在解决西班牙语简历中技能提取与分类模型评估数据集缺乏的问题，以及如何有效区分硬技能（hard skills）和软技能（soft skills），同时标注知识、技能和能力的区别。论文的关键在于开发了一个针对西班牙语的技能提取与分类数据集，并提出了相应的标注方法及深度学习基线模型，以推动技能分类任务的鲁棒性解决方案。这一工作填补了现有研究在西班牙语技能评估领域的空白，为确保模型可靠性与精确性提供了必要的基础。

链接: https://arxiv.org/abs/2503.03932
作者: Sabur Butt,Hector G. Ceballos,Diana P. Madera
机构: Tecnológico de Monterrey ( Monterrey Institute of Technology and Higher Education )
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Job application and assessment processes have evolved significantly in recent years, largely due to advancements in technology and changes in the way companies operate. Skill extraction and classification remain an important component of the modern hiring process as it provides a more objective way to evaluate candidates and automatically align their skills with the job requirements. However, to effectively evaluate the skills, the skill extraction tools must recognize varied mentions of skills on resumes, including direct mentions, implications, synonyms, acronyms, phrases, and proficiency levels, and differentiate between hard and soft skills. While tools like LLMs (Large Model Models) help extract and categorize skills from job applications, there’s a lack of comprehensive datasets for evaluating the effectiveness of these models in accurately identifying and classifying skills in Spanish-language job applications. This gap hinders our ability to assess the reliability and precision of the models, which is crucial for ensuring that the selected candidates truly possess the required skills for the job. In this paper, we develop a Spanish language dataset for skill extraction and classification, provide annotation methodology to distinguish between knowledge, skill, and abilities, and provide deep learning baselines to advance robust solutions for skill classification.
zh

[NLP-76] Personalized Federated Fine-tuning for Heterogeneous Data: An Automatic Rank Learning Approach via Two-Level LoRA

【速读】：该论文致力于解决个性化联邦微调（Personalized Federated Fine-Tuning）在语言模型场景下的异构数据挑战，目标是在不共享本地数据的前提下，通过协作方式对预训练的语言模型（如 BERT 或 GPT）进行微调，并同时实现个性化。现有方法通常采用参数高效微调技术，如低秩适应（Low-Rank Adaptation, LoRA），但这些方法通常依赖于预定义的最大和最小秩，无法适应来自不同客户端的多样化数据源。为了解决这一问题，论文提出了一种名为 PF2LoRA 的新算法，其关键在于引入一种基于两层 LoRA 的自动秩学习方法。PF2LoRA 同时学习两个层次的适配：第一层为目标所有客户端学习一个通用适配器，第二层则促进每个客户端的个性化适配。其核心优势在于能够根据单个客户端的数据特性自适应确定合适的秩，而非依赖固定的秩假设。论文通过合成示例展示 PF2LoRA 如何自动学习每个客户端的真实秩，并通过少量额外参数实现了高效的个性化适配。实验结果表明，PF2LoRA 在自然语言理解和生成任务上显著优于现有联邦微调方法。

链接: https://arxiv.org/abs/2503.03920
作者: Jie Hao,Yuman Wu,Ali Payani,Myungjin Lee,Mingrui Liu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages, 5 figures

点击查看摘要

Abstract:We study the task of personalized federated fine-tuning with heterogeneous data in the context of language models, where clients collaboratively fine-tune a language model (e.g., BERT, GPT) without sharing their local data, achieving personalization simultaneously. While recent efforts have applied parameter-efficient fine-tuning techniques like low-rank adaptation (LoRA) in federated settings, they typically use single or multiple independent low-rank adapters with predefined maximal and minimal ranks, which may not be optimal for diverse data sources over clients. To address this issue, we propose PF2LoRA, a new personalized federated fine-tuning algorithm built on a novel \emphautomatic rank learning approach via two-level LoRA. Given the pretrained language model whose weight is frozen, our algorithm aims to learn two levels of adaptation simultaneously: the first level aims to learn a common adapter for all clients, while the second level fosters individual client personalization. A key advantage of PF2LoRA is its ability to adaptively determine a suitable rank based on an individual client’s data, rather than relying on a predefined rank that is agnostic to data heterogeneity. We present a synthetic example that highlights how PF2LoRA automatically learns the ground-truth rank for each client, tailoring the adaptation to match the properties of their individual data. Notably, this approach introduces minimal additional memory overhead, as the second-level adaptation comprises a small number of parameters compared to the first level. Our experiments on natural language understanding and generation tasks demonstrate that PF2LoRA significantly outperforms existing federated fine-tuning methods. Comments: 28 pages, 5 figures Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2503.03920 [cs.LG] (or arXiv:2503.03920v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.03920 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-77] AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County

【速读】：本文旨在解决历史产权文件中种族契约条款（racial covenants）的识别与消除问题，这是许多司法管辖区面临的紧迫任务。加州于2021年要求所有县实施相关清理流程，但由于文档数量庞大（如Santa Clara County单一地区即有超过2400万份产权契据），完全依赖人工审查既不现实也效率低下。论文的关键解决方案在于开发了一种基于开放大型语言模型（Large Language Model, LLM）的新型方法，并通过与Santa Clara County书记员办公室的合作进行了优化，使其能够以高精度和高召回率检测种族契约条款。这种方法不仅大幅减少了86,500人时的手动工作量，且成本仅为同等商用封闭模型的不到2%，从而显著提升了处理效率与经济性。

链接: https://arxiv.org/abs/2503.03888
作者: Faiz Surani,Mirac Suzgun,Vyoma Raman,Christopher D. Manning,Peter Henderson,Daniel E. Ho
机构: Stanford University (斯坦福大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:Many jurisdictions have moved to identify and strike these provisions, including California, which mandated in 2021 that all counties implement such a process. Yet the scale can be overwhelming, with Santa Clara County (SCC) alone having over 24 million property deed documents, making purely manual review infeasible. We present a novel approach to addressing this pressing issue, developed through a partnership with the SCC Clerk-Recorder’s Office. First, we leverage an open large language model, fine-tuned to detect racial covenants with high precision and recall. We estimate that this system reduces manual efforts by 86,500 person hours and costs less than 2% of the cost for a comparable off-the-shelf closed model. Second, we illustrate the County’s integration of this model into responsible operational practice, including legal review and the creation of a historical registry, and release our model to assist the hundreds of jurisdictions engaged in similar efforts. Finally, our results reveal distinct periods of utilization of racial covenants, sharp geographic clustering, and the disproportionate role of a small number of developers in maintaining housing discrimination. We estimate that by 1950, one in four properties across the County were subject to racial covenants.
zh

[NLP-78] LEWIS (LayEr WIse Sparsity) – A Training Free Guided Model Merging Approach ICLR2025

【速读】：本文旨在解决现有模型合并方法在提升特定任务基准性能方面的局限性。当前的无数据模型合并方法虽能创建多任务模型，但难以显著提高下游模型在特定任务上的表现。为应对这一挑战，论文提出了一种名为LEWIS（分层稀疏性）的引导型模型合并框架。其关键是通过基于激活的分层重要性评估，在合并过程中动态调整各层任务向量的稀疏性需求，并利用校准数据集优先保留关键层的特定任务知识。此方法通过确保合并后的模型在校准数据集类似的基准测试中达到最佳性能，从而优化了模型的特定任务表现。实验结果表明，LEWIS显著提升了代码指令遵循和数学解题模型的性能，分别提高了4%和11.3%，优于采用均匀稀疏性的无导向方法。

链接: https://arxiv.org/abs/2503.03874
作者: Hetarth Chopra,Vidhi Rambhia,Vikram Adve
机构: Siebel School of Computing and Data Science (西贝尔计算与数据科学学院), University of Illinois at Urbana Champaign (伊利诺伊大学香槟分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Accepted at ICLR 2025 Workshop: SLLM (Sparsity in Large Language Models)

点击查看摘要

Abstract:As specialized large language models (LLMs) become increasingly prevalent, model merging methods are being used to combine them to create a single multi-task model without requiring any additional data or training. However, these approaches fall short when the objective of merging is to increase the downstream model’s performance on a particular task-specific benchmark. In this work, we propose LEWIS (Layer Wise Sparsity), a guided model-merging framework that uses activation-based layer importance to dynamically adjust layer-wise task-vector sparsity required for the merge process. LEWIS uses a calibration dataset to prioritize critical layers during the task-vector pruning process required for model merging. This approach guides existing merging methods by preserving essential layer-wise task-specific knowledge while ensuring the merged model performs the best at benchmarks resembling the calibration dataset. Our experiments demonstrate the effectiveness of LEWIS with performance improvements of code instruction-following and math-solving models created through model merging up to 4 percent and 11.3 percent, respectively, outperforming unguided data-less model merging approaches that use uniform-sparsity.
zh

[NLP-79] Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

【速读】：该论文试图解决的问题是：在语言模型能力提升中，为何较小规模的模型有时能在特定设计选择下超越更大规模的模型？论文关注的焦点在于量化不同设计选择（如模型大小、训练数据量以及架构决策等）对模型性能的影响，并揭示这些因素如何共同决定最终的模型能力。
解决方案的关键在于提出一个框架，通过分析92个开源预训练模型的数据与架构特性，发现除了模型规模和训练tokens数量之外的因素（如数据组成和具体架构设计）对下游任务预测性能的相对提升可达3%-28%。这表明综合考虑多维设计因素比单纯依赖模型规模更为重要，从而为系统性研究模型开发决策如何塑造最终能力奠定了基础。

链接: https://arxiv.org/abs/2503.03862
作者: Emmy Liu,Amanda Bertsch,Lintang Sutawika,Lindia Tjuatja,Patrick Fernandes,Lara Marinov,Michael Chen,Shreya Singhal,Carolin Lawrence,Aditi Raghunathan,Kiril Gashteovski,Graham Neubig
机构: Carnegie Mellon University, Language Technologies Institute (卡内基梅隆大学, 语言技术学院); Instituto Superior Técnico (Lisbon ELLIS Unit), 3 Instituto de Telecomunicações (里斯本高级技术研究所, ELLIS联盟里斯本单元, 电信研究所); NEC Laboratories Europe, Germany (德国 NEC 实验室); Center for Advanced Interdisciplinary Research, Ss. Cyril and Methodius Uni. of Skopje, Germany (马其顿圣西里尔和圣美多迪乌斯大学高级跨学科研究中心, 德国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
zh

[NLP-80] Vision-Language Models Struggle to Align Entities across Modalities

【速读】：该论文试图解决跨模态实体链接（cross-modal entity linking）这一研究空白问题，即在不同模态之间对实体及其属性进行对齐的能力。这类能力对于实际应用如多模态代码生成、假新闻检测或场景理解至关重要，但此前未被充分研究。论文的关键解决方案是引入一个新的任务和评估基准 MATE (Multi-modal Attribute-grounded Entity Linking Benchmark)，包含 5.5k 个带有视觉场景与其文本表示对齐的评价样本，并设计了一种基于检索目标模态中物体某一属性的问题-答案任务来评估性能。研究表明，尽管视觉-语言模型（Vision-Language Models, VLMs）在某些情况下可以通过链式思维提示（chain-of-thought prompting）提升表现，但它们仍远未达到人类水平，这凸显了进一步研究跨模态实体链接的必要性，并证明了 MATE 是一个有力的基准工具。

链接: https://arxiv.org/abs/2503.03854
作者: Iñigo Alonso,Ander Salaberria,Gorka Azkune,Jeremy Barnes,Oier Lopez de Lacalle
机构: Institute for Language, Cognition and Computation, University of Edinburgh (爱丁堡大学语言、认知与计算研究所); HiTZ Center - Ixa, University of the Basque Country UPV/EHU (巴斯克大学 HiTZ 中心 - Ixa)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-modal entity linking refers to the ability to align entities and their attributes across different modalities. While cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation, fake news detection, or scene understanding, it has not been thoroughly studied in the literature. In this paper, we introduce a new task and benchmark to address this gap. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. To evaluate cross-modal entity linking performance, we design a question-answering task that involves retrieving one attribute of an object in one modality based on a unique attribute of that object in another modality. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find that VLMs struggle significantly compared to humans, particularly as the number of objects in the scene increases. Our analysis also shows that, while chain-of-thought prompting can improve VLM performance, models remain far from achieving human-level proficiency. These findings highlight the need for further research in cross-modal entity linking and show that MATE is a strong benchmark to support that progress.
zh

[NLP-81] Multi-Agent Systems Powered by Large Language Models : Applications in Swarm Intelligence

【速读】：该论文试图解决的问题是如何将大型语言模型（Large Language Models, LLMs）整合到多智能体模拟中，以替代传统的硬编码程序，使智能体能够基于环境数据自适应地生成行为。解决方案的关键在于开发了一套工具链（toolchain），该工具链集成了LLMs与NetLogo仿真平台，并利用NetLogo的Python扩展通过OpenAI API与GPT-4o进行通信。这套工具链支持基于提示的行为生成，同时结合结构化规则驱动提示和自主的知识驱动提示，从而实现对复杂系统中自组织过程的研究以及在多智能体环境中诱导涌现行为的能力。

链接: https://arxiv.org/abs/2503.03800
作者: Cristian Jimenez-Romero,Alper Yegenoglu,Christian Blum
机构: ETIS-Lab Faculty of Computer Science, CY Cergy Paris University (CY Cergy巴黎大学); alper.yegenoglu@rwth-aachen.de (邮箱未提供机构); Artificial Intelligence Research Institute (IIIA-CSIC) (人工智能研究学院), Campus of the UAB (UAB校区), Bellaterra, Spain
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work examines the integration of large language models (LLMs) into multi-agent simulations by replacing the hard-coded programs of agents with LLM-driven prompts. The proposed approach is showcased in the context of two examples of complex systems from the field of swarm intelligence: ant colony foraging and bird flocking. Central to this study is a toolchain that integrates LLMs with the NetLogo simulation platform, leveraging its Python extension to enable communication with GPT-4o via the OpenAI API. This toolchain facilitates prompt-driven behavior generation, allowing agents to respond adaptively to environmental data. For both example applications mentioned above, we employ both structured, rule-based prompts and autonomous, knowledge-driven prompts. Our work demonstrates how this toolchain enables LLMs to study self-organizing processes and induce emergent behaviors within multi-agent environments, paving the way for new approaches to exploring intelligent systems and modeling swarm intelligence inspired by natural phenomena. We provide the code, including simulation files and data at this https URL.
zh

[NLP-82] Sarcasm Detection as a Catalyst: Improving Stance Detection with Cross-Target Capabilities

【速读】：该论文旨在解决立场检测（Stance Detection, SD）中因网络平台文本的微妙性和复杂性，尤其是讽刺语言的使用，导致现有算法难以准确识别作者立场的问题。同时，针对新目标领域训练数据不足的挑战，论文提出了跨目标立场检测（Cross-Target Stance Detection, CTSD）的方法。解决方案的关键在于通过微调BERT和RoBERTa模型，并在其基础上叠加额外的深度学习层，将讽刺知识融入模型以显著减少讽刺文本元素的误分类，从而提升检测性能。实验结果显示，该方法在无需讽刺检测预训练的情况下，使立场检测任务的平均宏F1值提升了约85%，并在跨目标检测任务中实现了与领域内检测相当的性能，表明讽刺检测作为迁移学习中间任务的有效性。

链接: https://arxiv.org/abs/2503.03787
作者: Gibson Nkhata Shi Yin Hong,Susan Gauch
机构: University of Arkansas (阿肯色大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 2 pages, 5 figures, published, published in International Journal On Advances in Intelligent Systems, volume 17, numbers 3 and 4. arXiv admin note: text overlap with arXiv:2503.03172

点击查看摘要

Abstract:Stance Detection (SD) has become a critical area of interest due to its applications in various contexts leading to increased research within NLP. Yet the subtlety and complexity of texts sourced from online platforms often containing sarcastic language pose significant challenges for SD algorithms in accurately determining the authors stance. This paper addresses this by employing sarcasm for SD. It also tackles the issue of insufficient annotated data for training SD models on new targets by conducting Cross-Target SD (CTSD). The proposed approach involves fine-tuning BERT and RoBERTa models followed by concatenating additional deep learning layers. The approach is assessed against various State-Of-The-Art baselines for SD demonstrating superior performance using publicly available datasets. Notably our model outperforms the best SOTA models on both in-domain SD and CTSD tasks even before the incorporation of sarcasm-detection pre-training. The integration of sarcasm knowledge into the model significantly reduces misclassifications of sarcastic text elements in SD allowing our model to accurately predict 85% of texts that were previously misclassified without sarcasm-detection pre-training on in-domain SD. This enhancement contributes to an increase in the models average macro F1-score. The CTSD task achieves performance comparable to that of the in-domain task despite using a zero-shot finetuning. We also reveal that the success of the transfer-learning framework relies on the correlation between the lexical attributes of sarcasm detection and SD. This study represents the first exploration of sarcasm detection as an intermediate transfer-learning task within the context of SD while also leveraging the concatenation of BERT or RoBERTa with other deep-learning techniques. The proposed approach establishes a foundational baseline for future research in this domain.
zh

[NLP-83] M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

【速读】：该论文旨在解决多模态大语言模型（Omni-MLLM）在跨模态理解和生成任务中的性能挑战，特别是在数据量和收敛速度存在显著差异的多模态场景下。论文的关键创新在于提出了两种策略：一是预训练阶段的步长平衡策略，用于处理各模态数据量的不均衡；二是在指令微调阶段引入动态自适应平衡策略，以同步各模态的训练进程，确保最优收敛。此外，研究特别强调保持纯文本任务上的强大性能，以维持模型的语言理解能力。这些方案共同保证了M2-omni在多模态任务上的全面支持和卓越表现。

链接: https://arxiv.org/abs/2502.18778
作者: Qingpei Guo,Kaiyou Song,Zipeng Feng,Ziping Ma,Qinglong Zhang,Sirui Gao,Xuzheng Yu,Yunxiao Sun,Tai-WeiChang,Jingdong Chen,Ming Yang,Jun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni’s language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.
zh

[NLP-84] Scaling Rich Style-Prompted Text-to-Speech Datasets

【速读】：该论文试图解决的问题是如何在大规模数据集上实现丰富的语音风格标签（如 guttural、nasal、pained 等）的自动标注，以弥补现有大规模数据集中仅包含基本标签（如 low-pitched、slow、loud）的不足。传统方法依赖小规模人工标注，难以扩展到更大规模的数据集。为解决此问题，论文的关键在于结合预训练的文本嵌入器、语音嵌入器、分类器以及音频语言模型，首次实现了丰富标签的大规模自动化扩展。通过这一方法，论文构建了包含 59 种风格标签的 ParaSpeechCaps 数据集，包括说话人级别的固有标签和语句级别的场景标签，并展示了其在提升风格一致性和语音质量方面的有效性，同时为未来研究提供了重要的基础。

链接: https://arxiv.org/abs/2503.04713
作者: Anuj Diwan,Zhisheng Zheng,David Harwath,Eunsol Choi
机构: Department of Computer Science, The University of Texas at Austin (德克萨斯大学奥斯汀分校计算机科学系); Department of Computer Science and Data Science, New York University (纽约大学计算机科学与数据科学系)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at this https URL .
zh

计算机视觉

[CV-0] FluidNexus: 3D Fluid Reconstruction and Prediction from a Single Video CVPR2025

【速读】：该论文致力于从单视视频重建和预测三维流体外观与速度，这是当前方法需要多视视频才能完成的任务。论文提出了一种名为FluidNexus的新框架，通过结合视频生成与物理模拟来解决此问题。方案的关键在于合成多视角参考视频以用于重建，并由两个核心组件实现：(1) 结合帧级视图合成与视频扩散优化的新型视角视频合成器，用于生成逼真的视频；(2) 物理集成粒子表示，融合可微分仿真与渲染，同时促进三维流体的重建与预测。论文为此收集了两个包含纹理背景和物体交互的真实流体数据集。

链接: https://arxiv.org/abs/2503.04720
作者: Yue Gao,Hong-Xing Yu,Bo Zhu,Jiajun Wu
机构: Stanford University (斯坦福大学); Microsoft (微软); Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project website: this https URL

点击查看摘要

Abstract:We study reconstructing and predicting 3D fluid appearance and velocity from a single video. Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. Our key insight is to synthesize multiple novel-view videos as references for reconstruction. FluidNexus consists of two key components: (1) a novel-view video synthesizer that combines frame-wise view synthesis with video diffusion refinement for generating realistic videos, and (2) a physics-integrated particle representation coupling differentiable simulation and rendering to simultaneously facilitate 3D fluid reconstruction and prediction. To evaluate our approach, we collect two new real-world fluid datasets featuring textured backgrounds and object interactions. Our method enables dynamic novel view synthesis, future prediction, and interaction simulation from a single fluid video. Project website: this https URL.
zh

[CV-1] Floxels: Fast Unsupervised Voxel Based Scene Flow Estimation CVPR2025

【速读】：该论文旨在解决场景流估计中的若干局限性，特别是现有优化方法在运行时间、收敛性和结果质量方面的不足。论文的关键创新点包括：1）提出了一种基于体素网格的模型，该模型在多个维度上改进了传统的基于MLP的公式；2）引入了一种新的多帧损失函数公式；3）将上述两项贡献结合，提出了名为Floxels的新方法。实验结果显示，在Argoverse 2基准测试中，Floxels在计算成本仅为EulerFlow一小部分的情况下，性能仅次于EulerFlow，同时实现了超过60-140倍的速度提升，相较于更快但质量较低的基线方法NSFP也达到了约14倍的速度提升。

链接: https://arxiv.org/abs/2503.04718
作者: David T. Hoffmann,Syed Haseeb Raza,Hanqiu Jiang,Denis Tananaev,Steffen Klingenhoefer,Martin Meinke
机构: Robert Bosch GmbH (博世有限责任公司); University of Freiburg (弗赖堡大学); CARIAD SE
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Scene flow estimation is a foundational task for many robotic applications, including robust dynamic object detection, automatic labeling, and sensor synchronization. Two types of approaches to the problem have evolved: 1) Supervised and 2) optimization-based methods. Supervised methods are fast during inference and achieve high-quality results, however, they are limited by the need for large amounts of labeled training data and are susceptible to domain gaps. In contrast, unsupervised test-time optimization methods do not face the problem of domain gaps but usually suffer from substantial runtime, exhibit artifacts, or fail to converge to the right solution. In this work, we mitigate several limitations of existing optimization-based methods. To this end, we 1) introduce a simple voxel grid-based model that improves over the standard MLP-based formulation in multiple dimensions and 2) introduce a new multiframe loss formulation. 3) We combine both contributions in our new method, termed Floxels. On the Argoverse 2 benchmark, Floxels is surpassed only by EulerFlow among unsupervised methods while achieving comparable performance at a fraction of the computational cost. Floxels achieves a massive speedup of more than ~60 - 140x over EulerFlow, reducing the runtime from a day to 10 minutes per sequence. Over the faster but low-quality baseline, NSFP, Floxels achieves a speedup of ~14x.
zh

[CV-2] Iris Style Transfer: Enhancing Iris Recognition with Style Features and Privacy Preservation through Neural Style Transfer

【速读】：该论文旨在解决 iris 认证和识别方法在面对旋转、透视变化等几何变换以及潜在的安全攻击（如 iris 攻击）时鲁棒性不足的问题。论文的关键在于提出利用神经风格迁移技术分离并提取虹膜纹理的风格特征，并证明这些风格特征不仅能够提供更可靠的识别基础，而且对几何变换具有更高的鲁棒性，优于传统特征。此外，论文进一步提出通过神经风格迁移技术对可识别的虹膜风格特征进行掩蔽，以保护敏感生物特征信息的同时，仍保留眼部图像在分割和视线估计等任务中的实用性。这一方法为构建面向虹膜的、安全且注重隐私保护的生物识别系统开辟了新途径。

链接: https://arxiv.org/abs/2503.04707
作者: Mengdi Wang,Efe Bozkir,Enkelejda Kasneci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 14 pages main paper, 4 pages appendix

点击查看摘要

Abstract:Iris texture is widely regarded as a gold standard biometric modality for authentication and identification. The demand for robust iris recognition methods, coupled with growing security and privacy concerns regarding iris attacks, has escalated recently. Inspired by neural style transfer, an advanced technique that leverages neural networks to separate content and style features, we hypothesize that iris texture’s style features provide a reliable foundation for recognition and are more resilient to variations like rotation and perspective shifts than traditional approaches. Our experimental results support this hypothesis, showing a significantly higher classification accuracy compared to conventional features. Further, we propose using neural style transfer to mask identifiable iris style features, ensuring the protection of sensitive biometric information while maintaining the utility of eye images for tasks like eye segmentation and gaze estimation. This work opens new avenues for iris-oriented, secure, and privacy-aware biometric systems.
zh

[CV-3] DEAL-YOLO: Drone-based Efficient Animal Localization using YOLO ICLR2025

【速读】：该论文旨在解决复杂且多变的环境条件下，野生动物尤其是小动物检测成本高、精度低的问题。解决方案的关键在于提出了一种名为DEAL-YOLO的新方法，其核心创新点包括：(1) 使用多目标损失函数（如Wise IoU (WIoU) 和归一化Wasserstein距离 (NWD)），通过优先优化边界框中心附近的像素，实现更平滑的目标定位并减少定位偏差；(2) 利用线性可变形 (Linear Deformable, LD) 卷积进行高效特征提取，在保持计算效率的同时提升检测精度；(3) 引入尺度序列特征融合 (Scaled Sequence Feature Fusion, SSFF) 模块，有效捕捉多尺度特征之间的关联，增强特征表示能力，并通过优化的多尺度融合提升性能指标。此外，DEAL-YOLO采用两阶段推理范式，进一步改善小目标检测的定位精度与置信度。这些改进显著提升了模型的检测效果，并减少了参数量，相较基础模型（如Yolov8-N）最多可减少69.5%的参数，展示了方法的鲁棒性和有效性。

链接: https://arxiv.org/abs/2503.04698
作者: Aditya Prashant Naidu,Hem Gosalia,Ishaan Gakhar,Shaurya Singh Rathore,Krish Didwania,Ujjwal Verma
机构: Manipal Institute of Technology (Manipal理工学院); Manipal Academy of Higher Education (Manipal高等教育学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a Poster at the ML4RS Workshop at ICLR 2025

点击查看摘要

Abstract:Although advances in deep learning and aerial surveillance technology are improving wildlife conservation efforts, complex and erratic environmental conditions still pose a problem, requiring innovative solutions for cost-effective small animal detection. This work introduces DEAL-YOLO, a novel approach that improves small object detection in Unmanned Aerial Vehicle (UAV) images by using multi-objective loss functions like Wise IoU (WIoU) and Normalized Wasserstein Distance (NWD), which prioritize pixels near the centre of the bounding box, ensuring smoother localization and reducing abrupt deviations. Additionally, the model is optimized through efficient feature extraction with Linear Deformable (LD) convolutions, enhancing accuracy while maintaining computational efficiency. The Scaled Sequence Feature Fusion (SSFF) module enhances object detection by effectively capturing inter-scale relationships, improving feature representation, and boosting metrics through optimized multiscale fusion. Comparison with baseline models reveals high efficacy with up to 69.5% fewer parameters compared to vanilla Yolov8-N, highlighting the robustness of the proposed modifications. Through this approach, our paper aims to facilitate the detection of endangered species, animal population analysis, habitat monitoring, biodiversity research, and various other applications that enrich wildlife conservation efforts. DEAL-YOLO employs a two-stage inference paradigm for object detection, refining selected regions to improve localization and confidence. This approach enhances performance, especially for small instances with low objectness scores.
zh

[CV-4] ach YOLO to Remember: A Self-Distillation Approach for Continual Object Detection

【速读】：该论文旨在解决在连续学习（Continual Learning, CL）场景下，基于目标检测的一阶段无锚点检测器（如 YOLO）在处理类增量学习（Class Incremental Learning, CIL）任务时所面临的灾难性遗忘（catastrophic forgetting）问题。现有研究表明，传统的学习无遗忘（Learning without Forgetting, LwF）方法因一阶段检测器回归输出的噪声问题可能导致知识传递过程中引入错误信息。为应对这一挑战，论文提出了一种针对 YOLO 的自蒸馏方法（YOLO LwF），其关键在于结合重放记忆（replay memory）机制，通过自蒸馏策略显著减轻遗忘现象。实验结果表明，与现有方法相比，该方案在 VOC 和 COCO 数据集上的平均精度均值（mAP）分别提升了 +2.1% 和 +2.9%，达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.04688
作者: Riccardo De Monte,Davide Dalle Pezze,Gian Antonio Susto
机构: University of Padova (帕多瓦大学), Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time object detectors like YOLO achieve exceptional performance when trained on large datasets for multiple epochs. However, in real-world scenarios where data arrives incrementally, neural networks suffer from catastrophic forgetting, leading to a loss of previously learned knowledge. To address this, prior research has explored strategies for Class Incremental Learning (CIL) in Continual Learning for Object Detection (CLOD), with most approaches focusing on two-stage object detectors. However, existing work suggests that Learning without Forgetting (LwF) may be ineffective for one-stage anchor-free detectors like YOLO due to noisy regression outputs, which risk transferring corrupted knowledge. In this work, we introduce YOLO LwF, a self-distillation approach tailored for YOLO-based continual object detection. We demonstrate that when coupled with a replay memory, YOLO LwF significantly mitigates forgetting. Compared to previous approaches, it achieves state-of-the-art performance, improving mAP by +2.1% and +2.9% on the VOC and COCO benchmarks, respectively.
zh

[CV-5] What Are You Doing? A Closer Look at Controllable Human Video Generation

【速读】：该论文试图解决视频生成领域缺乏高质量基准数据集以全面评估人类动作和交互多样性的问题。现有数据集如TikTok和TED Talks在多样性和复杂性方面不足以充分反映视频生成模型的能力。为解决此问题，论文引入了`What Are You Doing?’ (WYD) 数据集，这是一个专门设计用于细粒度评估可控图像到视频生成的新基准，包含1,544个带有56个细粒度类别的标注视频。关键在于通过构建这一全面且多样化的数据集以及提出相应的自动评估指标，论文能够系统性地衡量多种人类生成特性，并深入分析七种最先进的可控图像到视频生成模型的能力，从而提供新的见解。论文还公开了数据与代码以促进该领域的进一步研究。

链接: https://arxiv.org/abs/2503.04666
作者: Emanuele Bugliarello,Anurag Arnab,Roni Paiss,Pieter-Jan Kindermans,Cordelia Schmid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human generation. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing `What Are You Doing?’ (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate automatic metrics that leverage our annotations and better capture human evaluations. Equipped with our dataset and metrics, we perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation, showing how WYD provides novel insights about the capabilities of these models. We release our data and code to drive forward progress in human video generation modeling at this https URL.
zh

[CV-6] Implicit Neural Representation for Video and Image Super-Resolution

【速读】：该论文旨在解决低分辨率视频和图像的超分辨率重建问题，目标是通过仅使用低分辨率输入数据高效地生成高质量的高分辨率输出。论文的关键创新在于利用隐式神经表示（Implicit Neural Representation, INR）技术，通过神经网络隐式编码空间和时间特征，实现对3D高分辨率网格的高效重建。这种方法无需依赖计算密集型的光流（optical flow）或运动估计，从而在保持跨帧和图像细节一致性的同时，显著提升了时间稳定性。其简洁高效的结构使其在性能上媲美甚至超越现有的超分辨率方法，同时降低了计算复杂度。

链接: https://arxiv.org/abs/2503.04665
作者: Mary Aiyetigbo,Wanqi Yuan,Feng Luo,Nianyi Li
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach for super-resolution that utilizes implicit neural representation (INR) to effectively reconstruct and enhance low-resolution videos and images. By leveraging the capacity of neural networks to implicitly encode spatial and temporal features, our method facilitates high-resolution reconstruction using only low-resolution inputs and a 3D high-resolution grid. This results in an efficient solution for both image and video super-resolution. Our proposed method, SR-INR, maintains consistent details across frames and images, achieving impressive temporal stability without relying on the computationally intensive optical flow or motion estimation typically used in other video super-resolution techniques. The simplicity of our approach contrasts with the complexity of many existing methods, making it both effective and efficient. Experimental evaluations show that SR-INR delivers results on par with or superior to state-of-the-art super-resolution methods, while maintaining a more straightforward structure and reduced computational demands. These findings highlight the potential of implicit neural representations as a powerful tool for reconstructing high-quality, temporally consistent video and image signals from low-resolution data.
zh

[CV-7] RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining

【速读】：该论文旨在解决医学影像检索系统开发中的两个主要挑战：不同医学背景下“相似图像”定义的差异性以及缺乏大规模高质量的医学影像检索数据集和基准。为应对这些挑战，论文提出了一种新颖的方法，利用密集放射学报告以多粒度方式自动且可扩展地定义图像间相似性排序。方法的关键在于通过这种技术构建了两个全面的医学影像检索数据集（MIMIC-IR用于胸片，CTRATE-IR用于CT扫描），并提供了基于多样化解剖结构的详细图像-图像排名注释。此外，开发的两个检索系统（RadIR-CXR和model-ChestCT）在传统图像-图像和图像-报告检索任务中表现出色，并能够在文本描述的特定解剖结构条件下实现灵活有效的图像检索，在78个评估指标中有77个达到了最先进的性能水平。

链接: https://arxiv.org/abs/2503.04653
作者: Tengfei Zhang,Ziheng Zhao,Chaoyi Wu,Xiao Zhou,Ya Zhang,Yangfeng Wang,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Developing advanced medical imaging retrieval systems is challenging due to the varying definitions of `similar images’ across different medical contexts. This challenge is compounded by the lack of large-scale, high-quality medical imaging retrieval datasets and benchmarks. In this paper, we propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities in a scalable and fully automatic manner. Using this approach, we construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans, providing detailed image-image ranking annotations conditioned on diverse anatomical structures. Furthermore, we develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks. These systems also enable flexible, effective image retrieval conditioned on specific anatomical structures described in text, achieving state-of-the-art results on 77 out of 78 metrics.
zh

[CV-8] ransferable Foundation Models for Geometric Tasks on Point Cloud Representations: Geometric Neural Operators

【速读】：该论文致力于开发预训练的几何神经算子（Geometric Neural Operators, GNPs），以作为基础模型服务于几何特征的提取。论文的核心问题是研究如何通过GNPs学习点云数据的鲁棒潜表示，从而有效估计其度量、曲率以及其他形状相关特性，并将其应用于几何相关的任务中。关键解决方案在于设计能够捕捉点云数据内在几何结构的GNPs模型，使其不仅能够在存在噪声的情况下稳健地估计任意形状和拓扑表面的几何属性，还能用于近似求解流形上的几何偏微分方程（PDEs）以及处理形状变形问题如曲率驱动流。此外，论文还提供了包含代码和权重的工具包，便于将预训练的GNPs集成到现有的或新的数据处理流水线中，进一步扩展其应用范围。

链接: https://arxiv.org/abs/2503.04649
作者: Blaine Quackenbush,Paul J. Atzberger
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:We introduce methods for obtaining pretrained Geometric Neural Operators (GNPs) that can serve as basal foundation models for use in obtaining geometric features. These can be used within data processing pipelines for machine learning tasks and numerical methods. We show how our GNPs can be trained to learn robust latent representations for the differential geometry of point-clouds to provide estimates of metric, curvature, and other shape-related features. We demonstrate how our pre-trained GNPs can be used (i) to estimate the geometric properties of surfaces of arbitrary shape and topologies with robustness in the presence of noise, (ii) to approximate solutions of geometric partial differential equations (PDEs) on manifolds, and (iii) to solve equations for shape deformations such as curvature driven flows. We also release a package of the codes and weights for using our pre-trained GNPs for processing point cloud representations. This allows for incorporating our pre-trained GNPs as components for reuse within existing and new data processing pipelines. The GNPs also can be used as part of numerical solvers involving geometry or as part of methods for performing inference and other geometric tasks.
zh

[CV-9] Simulating the Real World: A Unified Survey of Multimodal Generative Models

【速读】：该论文旨在解决在人工通用智能（AGI）研究中理解和模拟真实世界这一关键挑战。现有方法如世界模型虽试图捕捉物理世界的普适原则以实现更精确的仿真与有意义的交互，但它们通常将不同模态（如二维图像、视频、三维及四维表示）视为独立领域，忽视了这些模态间的相互依赖性，并且缺乏系统性整合各维度现实连接的研究。本文提出了一种统一的综述，专注于多模态生成模型的发展路径，从二维生成（外观）、视频生成（外观+动态）、三维生成（外观+几何），最终到融合所有维度的四维生成，系统性地将二维、视频、三维和四维生成统一于单一框架内。其关键在于首次尝试构建这样一个全面覆盖数据维度演进过程的统一框架，从而为未来研究提供数据集、评估指标以及方向指导，并帮助新手获得洞见。通过这项工作，论文期望促进多模态生成模型与真实世界仿真研究的进步。

链接: https://arxiv.org/abs/2503.04641
作者: Yuqi Hu,Longguang Wang,Xian Liu,Ling-Hao Chen,Yuwei Guo,Yukai Shi,Ce Liu,Anyi Rao,Zeyu Wang,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); Shanghai Academy of AI for Science (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Repository for the related papers at this https URL

点击查看摘要

Abstract:Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
zh

[CV-10] Enhancing SAM with Efficient Prompting and Preference Optimization for Semi-supervised Medical Image Segmentation CVPR2025

【速读】：该论文旨在解决现有基础模型（如Segment Anything Model, SAM）在医学影像分割任务中对大规模标注数据或专家提供的提示依赖性强的问题。传统方法如主动学习虽有所改进，但仍需持续的人类参与及复杂的领域知识进行标签优化或奖励真实值构建。为应对这些挑战，论文提出了一种增强版SAM框架，其关键在于利用完全无监督方式生成标注高效的提示，同时通过对比语言图像预训练和视觉问答捕捉语义、位置和形状等重要信息。此外，采用直接偏好优化技术设计最优策略，使模型能够基于虚拟注释器提供的简单评分或排名生成高保真分割结果。实验表明，该框架在多种模态（如X射线、超声波、腹部CT）下的肺部分割、乳腺肿瘤分割以及器官分割等任务中表现出最先进的性能，证明了其在低标注数据场景中的有效性。

链接: https://arxiv.org/abs/2503.04639
作者: Aishik Konwer,Zhijian Yang,Erhan Bas,Cao Xiao,Prateek Prasanna,Parminder Bhatia,Taha Kass-Hout
机构: Stony Brook University (石溪大学); GE Healthcare (GE 医疗集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Foundational models such as the Segment Anything Model (SAM) are gaining traction in medical imaging segmentation, supporting multiple downstream tasks. However, such models are supervised in nature, still relying on large annotated datasets or prompts supplied by experts. Conventional techniques such as active learning to alleviate such limitations are limited in scope and still necessitate continuous human involvement and complex domain knowledge for label refinement or establishing reward ground truth. To address these challenges, we propose an enhanced Segment Anything Model (SAM) framework that utilizes annotation-efficient prompts generated in a fully unsupervised fashion, while still capturing essential semantic, location, and shape information through contrastive language-image pretraining and visual question answering. We adopt the direct preference optimization technique to design an optimal policy that enables the model to generate high-fidelity segmentations with simple ratings or rankings provided by a virtual annotator simulating the human annotation process. State-of-the-art performance of our framework in tasks such as lung segmentation, breast tumor segmentation, and organ segmentation across various modalities, including X-ray, ultrasound, and abdominal CT, justifies its effectiveness in low-annotation data scenarios.
zh

[CV-11] 3HANDS Dataset: Learning from Humans for Generating Naturalistic Handovers with Supernumerary Robotic Limbs

【速读】：该论文试图解决超冗余机器人肢体（Supernumerary Robotic Limbs, SRLs）在与人类交互中实现自然且有效物体传递的问题。传统的启发式策略设计方法耗时且难以泛化，同时产生的运动不够人性化。为解决这一问题，论文提出利用生成模型（Generative Models）来创建更自然的物体传递轨迹作为替代方案。解决方案的关键在于引入了一个名为3HANDS的新数据集，该数据集记录了参与者之间自然状态下进行日常活动时的物体传递交互。基于此数据集，论文开发了三种模型：一种用于生成自然的传递轨迹，一种用于确定合适的传递终点，还有一种用于预测传递启动时刻。这些模型共同展示了如何通过生成模型实现更加人性化、舒适且省力的SRL与人类协作方式。

链接: https://arxiv.org/abs/2503.04635
作者: Artin Saberpour Abadian,Yi-Chi Liao,Ata Otaran,Rishabh Dabral,Marie Muehlhaus,Christian Theobalt,Martin Schmitz,Jürgen Steimle
机构: Saarland University (萨尔兰大学), Saarland Informatics Campus (萨尔兰计算机科学校区), Saarbrücken (萨尔布吕肯), Germany; ETH Zürich (瑞士联邦理工学院), Zürich (苏黎世), Switzerland; Max Planck Institute for Informatics (马克斯·普朗克信息学研究所), Saarbrücken (萨尔布吕肯), Germany
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: CHI '25

点击查看摘要

Abstract:Supernumerary robotic limbs (SRLs) are robotic structures integrated closely with the user’s body, which augment human physical capabilities and necessitate seamless, naturalistic human-machine interaction. For effective assistance in physical tasks, enabling SRLs to hand over objects to humans is crucial. Yet, designing heuristic-based policies for robots is time-consuming, difficult to generalize across tasks, and results in less human-like motion. When trained with proper datasets, generative models are powerful alternatives for creating naturalistic handover motions. We introduce 3HANDS, a novel dataset of object handover interactions between a participant performing a daily activity and another participant enacting a hip-mounted SRL in a naturalistic manner. 3HANDS captures the unique characteristics of SRL interactions: operating in intimate personal space with asymmetric object origins, implicit motion synchronization, and the user’s engagement in a primary task during the handover. To demonstrate the effectiveness of our dataset, we present three models: one that generates naturalistic handover trajectories, another that determines the appropriate handover endpoints, and a third that predicts the moment to initiate a handover. In a user study (N=10), we compare the handover interaction performed with our method compared to a baseline. The findings show that our method was perceived as significantly more natural, less physically demanding, and more comfortable.
zh

[CV-12] PathoPainter: Augmenting Histopathology Segmentation via Tumor-aware Inpainting

【速读】：该论文旨在解决肿瘤分割在组织病理学图像分析中的挑战，特别是由于标注成本高昂且细粒度图像-掩膜对稀缺导致的数据不足问题。现有的数据合成方法存在图像-掩膜对准确性低及多样性有限的问题，特别是在小规模数据集和复杂的组织病理学图像中，这些问题严重影响了分割模型的训练效果。为了解决上述挑战，论文提出了一种名为PathoPainter的方法，其关键是将图像-掩膜对的生成任务重新定义为肿瘤修复（tumor inpainting）任务。这种方法通过保留背景同时精确修复肿瘤区域，确保生成图像与其对应掩膜之间的精准对齐。此外，通过引入基于不同图像区域嵌入的采样机制以及过滤策略以排除不确定的合成区域，进一步提升了合成数据的质量和多样性。实验结果表明，使用该方法生成的数据显著提高了分割性能，在CAMELYON16数据集上的表现优于现有方法。

链接: https://arxiv.org/abs/2503.04634
作者: Hong Liu,Haosen Yang,Evi M.C. Huijben,Mark Schuiveling,Ruisheng Su,Josien P.W. Pluim,Mitko Veta
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Tumor segmentation plays a critical role in histopathology, but it requires costly, fine-grained image-mask pairs annotated by pathologists. Thus, synthesizing histopathology data to expand the dataset is highly desirable. Previous works suffer from inaccuracies and limited diversity in image-mask pairs, both of which affect training segmentation, particularly in small-scale datasets and the inherently complex nature of histopathology images. To address this challenge, we propose PathoPainter, which reformulates image-mask pair generation as a tumor inpainting task. Specifically, our approach preserves the background while inpainting the tumor region, ensuring precise alignment between the generated image and its corresponding mask. To enhance dataset diversity while maintaining biological plausibility, we incorporate a sampling mechanism that conditions tumor inpainting on regional embeddings from a different image. Additionally, we introduce a filtering strategy to exclude uncertain synthetic regions, further improving the quality of the generated data. Our comprehensive evaluation spans multiple datasets featuring diverse tumor types and various training data scales. As a result, segmentation improved significantly with our synthetic data, surpassing existing segmentation data synthesis approaches, e.g., 75.69% - 77.69% on CAMELYON16. The code is available at this https URL.
zh

[CV-13] A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning

【速读】：该论文旨在解决遥感图像描述（Remote Sensing Image Captioning, RSIC）领域中两个关键挑战：非英语描述数据集的稀缺性和模型多语言能力评估的缺失。这些限制严重阻碍了RSIC技术的进步及其在实际中的应用部署，尤其是在大规模视觉语言模型（Vision-Language Models, VLMs）时代。论文的关键解决方案包括构建一个名为BRSIC（双语遥感图像描述）的综合双语数据集，该数据集扩展了三个现有的英文RSIC数据集，并为其添加了对应的中文描述，包含13,634张图像与68,170条双语文本对。基于此数据集，论文提出了一种系统性的评估框架，解决了现有评估协议中的不一致性问题，并通过标准化的重训练流程实现了对模型性能的严格评估。此外，论文还对八种最先进的大规模视觉语言模型（Large Vision-Language Models, LVLMs）进行了广泛的实证研究，探讨其在多种范式下的表现，包括零样本推理、监督微调和多语言训练。这一全面评估揭示了当前LVLMs在处理多语言遥感任务中的优势与局限性，并通过跨数据集迁移实验发现了有趣的结果。

链接: https://arxiv.org/abs/2503.04592
作者: Qing Zhou,Tao Yang,Junyu Gao,Weiping Ni,Junzheng Wu,Qi Wang
机构: School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University (西北工业大学); Department of Remote Sensing, Northwest Institute of Nuclear Technology (西北核技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery. Despite significant advances in developing sophisticated methods and large-scale datasets for training vision-language models (VLMs), two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models. These limitations fundamentally impede the progress and practical deployment of RSIC, particularly in the era of large VLMs. To address these challenges, this paper presents several significant contributions to the field. First, we introduce and analyze BRSIC (Bilingual Remote Sensing Image Captioning), a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions. Building upon this foundation, we develop a systematic evaluation framework that addresses the prevalent inconsistency in evaluation protocols, enabling rigorous assessment of model performance through standardized retraining procedures on BRSIC. Furthermore, we present an extensive empirical study of eight state-of-the-art large vision-language models (LVLMs), examining their capabilities across multiple paradigms including zero-shot inference, supervised fine-tuning, and multi-lingual training. This comprehensive evaluation provides crucial insights into the strengths and limitations of current LVLMs in handling multilingual remote sensing tasks. Additionally, our cross-dataset transfer experiments reveal interesting findings. The code and data will be available at this https URL.
zh

[CV-14] Omnidirectional Multi-Object Tracking CVPR2025

【速读】：该论文旨在解决全景图像在多目标跟踪（Multi-Object Tracking, MOT）任务中的应用挑战。传统MOT算法主要针对有限视场的针孔图像设计，难以直接适应全景图像的大视场特性，同时全景图像的畸变（如分辨率损失、几何变形和不均匀光照）进一步阻碍了现有方法的有效迁移，导致性能显著下降。为应对这些挑战，论文提出OmniTrack框架，其关键在于通过引入“轨迹管理模块”以整合时间线索、“可变形实例模块”以实现对象定位与关联，以及“圆周状态估计模块”以缓解图像和几何畸变。此外，为了弥补全景MOT数据集的匮乏，作者构建了QuadTrack数据集，包含由四足机器人采集的多样化挑战场景。实验结果表明，OmniTrack在JRDB数据集上的HOTA得分为26.92%，较基线提升了3.43%，并在QuadTrack基准上达到23.45%，超越基线6.81%。

链接: https://arxiv.org/abs/2503.04565
作者: Kai Luo,Hao Shi,Sheng Wu,Fei Teng,Mengfei Duan,Chang Huang,Yuhang Wang,Kaiwei Wang,Kailun Yang
机构: Hunan University (湖南大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Accepted to CVPR 2025. The dataset and code will be made publicly available at this https URL

点击查看摘要

Abstract:Panoramic imagery, with its 360° field of view, offers comprehensive information to support Multi-Object Tracking (MOT) in capturing spatial and temporal relationships of surrounding objects. However, most MOT algorithms are tailored for pinhole images with limited views, impairing their effectiveness in panoramic settings. Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. This integration enables tracking in large field-of-view scenarios, even under rapid sensor motion. To mitigate the lack of panoramic MOT datasets, we introduce the QuadTrack dataset–a comprehensive panoramic dataset collected by a quadruped robot, featuring diverse challenges such as wide fields of view, intense motion, and complex environments. Extensive experiments on the public JRDB dataset and the newly introduced QuadTrack benchmark demonstrate the state-of-the-art performance of the proposed framework. OmniTrack achieves a HOTA score of 26.92% on JRDB, representing an improvement of 3.43%, and further achieves 23.45% on QuadTrack, surpassing the baseline by 6.81%. The dataset and code will be made publicly available at this https URL.
zh

[CV-15] ViT-VS: On the Applicability of Pretrained Vision Transformer Features for Generalizable Visual Servoing

【速读】：该论文试图解决传统视觉伺服（Visual Servoing）方法在处理遮挡和环境变化时鲁棒性不足的问题，同时克服基于学习的方法通常需要大量任务或对象特定训练数据的局限。论文的关键解决方案在于提出了一种结合预训练视觉Transformer进行语义特征提取的视觉伺服方法，该方法融合了经典方法的通用性和学习方法的鲁棒性，能够在无需任务或对象特定训练的情况下实现良好的泛化能力，并在未扰动和扰动场景下均表现出色，实现实时收敛且性能超越经典图像基视觉伺服方法高达31.2%，同时匹配甚至超越了其他基于学习的方法的收敛速率。

链接: https://arxiv.org/abs/2503.04545
作者: Alessandro Scherl,Stefan Thalhammer,Bernhard Neuberger,Wilfried Wöber,José Gracía-Rodríguez
机构: Department of Computer Technology, University of Alicante (西班牙阿尔利坎特大学计算机技术系); Industrial Engineering Department, UAS Technikum Vienna (奥地利维也纳应用技术大学工业工程系)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task- or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: this https URL
zh

[CV-16] In-Context Reverse Classification Accuracy: Efficient Estimation of Segmentation Quality without Ground-Truth

【速读】：该论文旨在解决在临床实践中自动图像分割质量评估困难的问题，主要由于真实标签（ground truth annotations）的稀缺性。论文提出了一种名为“In-Context Reverse Classification Accuracy (In-Context RCA)”的新框架，用于在缺乏真实标签的情况下自动估计分割质量。解决方案的关键在于利用最近的上下文学习（in-context learning）分割模型，并结合检索增强（retrieval-augmentation）技术选择最相关的参考图像，从而实现高效的质量评估，仅需少量参考数据即可完成任务。这一方法在多种医学成像模态下表现出鲁棒性和计算效率，为临床工作流程中的自动化质量控制提供了有价值的工具。代码已开源。

链接: https://arxiv.org/abs/2503.04522
作者: Matias Cosarinsky,Ramiro Billot,Lucas Mansilla,Gabriel Gimenez,Nicolas Gaggión,Guanghui Fu,Enzo Ferrante
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing the quality of automatic image segmentation is crucial in clinical practice, but often very challenging due to the limited availability of ground truth annotations. In this paper, we introduce In-Context Reverse Classification Accuracy (In-Context RCA), a novel framework for automatically estimating segmentation quality in the absence of ground-truth annotations. By leveraging recent in-context learning segmentation models and incorporating retrieval-augmentation techniques to select the most relevant reference images, our approach enables efficient quality estimation with minimal reference data. Validated across diverse medical imaging modalities, our method demonstrates robust performance and computational efficiency, offering a promising solution for automated quality control in clinical workflows, where fast and reliable segmentation assessment is essential. The code is available at this https URL.
zh

[CV-17] A Novel Solution for Drone Photogrammetry with Low-overlap Aerial Images using Monocular Depth Estimation

【速读】：该论文试图解决低重叠航空影像在传统摄影测量方法中的局限性问题，这些方法高度依赖高影像重叠来生成精确且完整的测绘产品。论文提出了一种基于单目深度估计的新工作流程，其关键是利用空中三角测量获得的同名点建立单目深度与度量深度之间的关系，将原始深度图转换为度量深度图，从而生成密集的深度信息并实现场景的全面重建。

链接: https://arxiv.org/abs/2503.04513
作者: Jiageng Zhong,Qi Zhou,Ming Li,Armin Gruen,Xuan Liao
机构: State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University (国家测绘遥感信息工程重点实验室，武汉大学); School of Remote Sensing Information Engineering, Wuhan University (遥感信息工程学院，武汉大学); Institute of Geodesy and Photogrammetry, ETH Zurich (地球测量与摄影测量研究所，苏黎世联邦理工学院); Department of Land Surveying and Geo-Informatics, The Hong Kong Polytechnic University (土地测量与地理信息系，香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-overlap aerial imagery poses significant challenges to traditional photogrammetric methods, which rely heavily on high image overlap to produce accurate and complete mapping products. In this study, we propose a novel workflow based on monocular depth estimation to address the limitations of conventional techniques. Our method leverages tie points obtained from aerial triangulation to establish a relationship between monocular depth and metric depth, thus transforming the original depth map into a metric depth map, enabling the generation of dense depth information and the comprehensive reconstruction of the scene. For the experiments, a high-overlap drone dataset containing 296 images is processed using Metashape to generate depth maps and DSMs as ground truth. Subsequently, we create a low-overlap dataset by selecting 20 images for experimental evaluation. Results demonstrate that while the recovered depth maps and resulting DSMs achieve meter-level accuracy, they provide significantly better completeness compared to traditional methods, particularly in regions covered by single images. This study showcases the potential of monocular depth estimation in low-overlap aerial photogrammetry.
zh

[CV-18] AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

【速读】：该论文旨在解决现有视频异常检测（Video Anomaly Detection, VAD）模型依赖于学习正常模式的问题，这导致其难以适应多样化的环境。为了解决这一挑战，论文提出了自定义视频异常检测（Customizable Video Anomaly Detection, C-VAD）技术和AnyAnomaly模型。C-VAD允许将用户定义的文本作为异常事件，并在视频中检测包含指定事件的帧。解决方案的关键在于有效实现了AnyAnomaly，它利用上下文感知的视觉问答技术，在无需微调大型视觉语言模型的情况下完成任务。此外，通过构建C-VAD数据集验证了AnyAnomaly的有效性，并展示了其在多个基准数据集上的竞争力，特别是在UBnormal数据集上达到了最先进的性能，并在所有数据集上表现出优越的泛化能力。

链接: https://arxiv.org/abs/2503.04504
作者: Sunghyun Ahn,Youngwan Jo,Kijung Lee,Sein Kwon,Inpyo Hong,Sanghyun Park
机构: Yonsei University (延世大学), Seoul, Korea
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets. Our code is available online at this http URL.
zh

[CV-19] IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement CVPR2025

【速读】：该论文旨在解决当前3D修复（3D Inpainting）和物体移除方法在处理非正面场景（unconstrained scenes）时面临的重大挑战，这些场景中相机方向和轨迹不受限制。现有方法主要局限于正面场景，而在多样化的非正面场景中表现不佳。为了解决这一问题，论文提出了一种新颖的方法，能够在正面及非正面场景中生成具有视觉一致性和几何连贯性的修复3D场景。

解决方案的关键在于：(1) 提出了一种结合几何先验（geometric priors）的鲁棒3D修复流水线，并通过测试时自适应（test-time adaptation）训练了一个多视图细化网络，该网络基于预训练的图像修复模型构建；(2) 开发了一种新的修复掩码检测技术，可以从对象掩码中推导出目标修复掩码，从而显著提升处理非正面场景的能力。实验验证表明，所提方法在广泛场景的基准数据集上大幅超越现有的最先进方法。

链接: https://arxiv.org/abs/2503.04501
作者: Zhihao Shi,Dong Huo,Yuhongze Zhou,Kejia Yin,Yan Min,Juwei Lu,Xinxin Zuo
机构: Huawei Canada Research Institute (华为加拿大研究中心); University of Alberta (阿尔伯塔大学); McMaster University (麦克马斯特大学); Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025, \href{ this https URL }{Project Page}

点击查看摘要

Abstract:Current 3D inpainting and object removal methods are largely limited to front-facing scenes, facing substantial challenges when applied to diverse, “unconstrained” scenes where the camera orientation and trajectory are unrestricted. To bridge this gap, we introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry across both front-facing and unconstrained scenes. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation, building on a pre-trained image inpainting model. Additionally, we develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes. To validate the efficacy of our approach, we create a challenging and diverse benchmark that spans a wide range of scenes. Comprehensive experiments demonstrate that our proposed method substantially outperforms existing state-of-the-art approaches.
zh

[CV-20] ReynoldsFlow: Exquisite Flow Estimation via Reynolds Transport Theorem

【速读】：该论文旨在解决传统光学流方法因亮度恒定和慢速运动假设等限制性假设，在复杂场景中有效性不足的问题，同时克服基于深度学习的方法对大规模领域特定数据集的高度依赖以及计算资源需求高的挑战。此外，论文还关注光学流在HSV颜色空间中的可视化方式引入的非线性失真和噪声敏感性，这些问题会降低运动表示的准确性，并进一步影响下游模型性能。

论文的关键解决方案是提出Reynolds流（ReynoldsFlow），这是一种受雷诺输运定理启发的新型无训练的流场估计方法，提供了建模复杂运动动力学的系统化途径。此外，为了改进流场可视化，论文还引入了增强版的ReynoldsFlow+表示方法。通过在UAVDB、Anti-UAV和GolfDB三个视频基准数据集上的实验验证，结果表明使用ReynoldsFlow+训练的网络在目标检测、红外目标检测和姿态估计等任务中实现了最先进的性能，展现出更高的鲁棒性和效率。

链接: https://arxiv.org/abs/2503.04500
作者: Yu-Hsi Chen,Chin-Tien Wu
机构: The University of Melbourne (墨尔本大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Optical flow is a fundamental technique for motion estimation, widely applied in video stabilization, interpolation, and object tracking. Recent advancements in artificial intelligence (AI) have enabled deep learning models to leverage optical flow as an important feature for motion analysis. However, traditional optical flow methods rely on restrictive assumptions, such as brightness constancy and slow motion constraints, limiting their effectiveness in complex scenes. Deep learning-based approaches require extensive training on large domain-specific datasets, making them computationally demanding. Furthermore, optical flow is typically visualized in the HSV color space, which introduces nonlinear distortions when converted to RGB and is highly sensitive to noise, degrading motion representation accuracy. These limitations inherently constrain the performance of downstream models, potentially hindering object tracking and motion analysis tasks. To address these challenges, we propose Reynolds flow, a novel training-free flow estimation inspired by the Reynolds transport theorem, offering a principled approach to modeling complex motion dynamics. Beyond the conventional HSV-based visualization, denoted ReynoldsFlow, we introduce an alternative representation, ReynoldsFlow+, designed to improve flow visualization. We evaluate ReynoldsFlow and ReynoldsFlow+ across three video-based benchmarks: tiny object detection on UAVDB, infrared object detection on Anti-UAV, and pose estimation on GolfDB. Experimental results demonstrate that networks trained with ReynoldsFlow+ achieve state-of-the-art (SOTA) performance, exhibiting improved robustness and efficiency across all tasks.
zh

[CV-21] Spatial regularisation for improved accuracy and interpretability in keypoint-based registration

【速读】：该论文致力于解决基于无监督关键点检测的配准方法中特征图空间分布扩散导致难以解释的问题，这削弱了基于关键点配准方法的核心目标。论文的关键解决方案是提出了一种三重损失（three-fold loss），用于正则化特征的空间分布。具体而言，首先通过KL散度将特征建模为点扩散函数，并将其解释为概率关键点；其次，通过锐化特征的空间分布以提高检测到的地标精度；最后，引入新的关键点间排斥损失以鼓励空间多样性。这种三重损失显著提高了特征的可解释性，使其对应于精确且具有解剖学意义的地标，不仅在胎儿刚体运动跟踪和脑MRI仿射配准任务中超越了最先进的无监督策略，还缩小了与最先进的监督方法之间的性能差距。

链接: https://arxiv.org/abs/2503.04499
作者: Benjamin Billot,Ramya Muthukrishnan,Esra Abaci-Turk,Ellen P. Grant,Nicholas Ayache,Hervé Delingette,Polina Golland
机构: Benjamin Billot (unknown); Ramya Muthukrishnan (unknown); Esra Abaci-Turk (unknown); Ellen P. Grant (unknown); Nicholas Ayache (unknown); Hervé Delingette (unknown); Polina Golland (unknown)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Unsupervised registration strategies bypass requirements in ground truth transforms or segmentations by optimising similarity metrics between fixed and moved volumes. Among these methods, a recent subclass of approaches based on unsupervised keypoint detection stand out as very promising for interpretability. Specifically, these methods train a network to predict feature maps for fixed and moving images, from which explainable centres of mass are computed to obtain point clouds, that are then aligned in closed-form. However, the features returned by the network often yield spatially diffuse patterns that are hard to interpret, thus undermining the purpose of keypoint-based registration. Here, we propose a three-fold loss to regularise the spatial distribution of the features. First, we use the KL divergence to model features as point spread functions that we interpret as probabilistic keypoints. Then, we sharpen the spatial distributions of these features to increase the precision of the detected landmarks. Finally, we introduce a new repulsive loss across keypoints to encourage spatial diversity. Overall, our loss considerably improves the interpretability of the features, which now correspond to precise and anatomically meaningful landmarks. We demonstrate our three-fold loss in foetal rigid motion tracking and brain MRI affine registration tasks, where it not only outperforms state-of-the-art unsupervised strategies, but also bridges the gap with state-of-the-art supervised methods. Our code is available at this https URL.
zh

[CV-22] Learning Object Placement Programs for Indoor Scene Synthesis with Iterative Self Training

【速读】：该论文旨在解决当前数据驱动的自回归室内场景合成系统倾向于生成不完整的下一物体位置分布的问题。为了解决这一问题，论文的关键创新在于设计了一种领域特定语言（Domain Specific Language, DSL），用于指定功能约束。通过该语言生成的程序以部分场景和待放置物体作为输入，在执行过程中预测可能的物体放置位置。此外，论文提出了一种生成模型，能够自动编写这些程序，并基于无监督程序归纳的先前工作开发了一种新的程序引导算法，因为现有的3D场景数据集缺乏用于训练的程序。为了量化实验观察结果，论文还引入了一种新的评估方法，能够更好地捕捉系统对单个物体位置分布的建模能力。实验结果显示，所提出的系统生成的单个物体位置分布更符合人类标注者的标注，同时在训练数据稀缺的情况下，其性能下降幅度小于现有系统。

链接: https://arxiv.org/abs/2503.04496
作者: Adrian Chang,Kai Wang,Yuanbo Li,Manolis Savva,Angel X. Chang,Daniel Ritchie
机构: Vision Systems Inc. (视觉系统股份有限公司); Brown University (布朗大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 20 figures Subjects: Graphics (cs.GR), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG)

点击查看摘要

Abstract:Data driven and autoregressive indoor scene synthesis systems generate indoor scenes automatically by suggesting and then placing objects one at a time. Empirical observations show that current systems tend to produce incomplete next object location distributions. We introduce a system which addresses this problem. We design a Domain Specific Language (DSL) that specifies functional constraints. Programs from our language take as input a partial scene and object to place. Upon execution they predict possible object placements. We design a generative model which writes these programs automatically. Available 3D scene datasets do not contain programs to train on, so we build upon previous work in unsupervised program induction to introduce a new program bootstrapping algorithm. In order to quantify our empirical observations we introduce a new evaluation procedure which captures how well a system models per-object location distributions. We ask human annotators to label all the possible places an object can go in a scene and show that our system produces per-object location distributions more consistent with human annotators. Our system also generates indoor scenes of comparable quality to previous systems and while previous systems degrade in performance when training data is sparse, our system does not degrade to the same degree.
zh

[CV-23] Semantic Alignment of Unimodal Medical Text and Vision Representations

【速读】：该论文旨在解决通用人工智能（General-purpose AI）模型在特定领域（如医学影像）表现不佳的问题，通过探索语义对齐（semantic alignment）的方法，实现跨模态、跨架构的模型融合（model stitching），从而让通用模型能够利用特定领域的知识提升性能，而无需额外训练。此外，还提出了一种基于语义对齐的新型零样本分类方法，以增强单模态视觉编码器的跨模态推理能力。

解决方案的关键在于通过估计锚点（anchors）之间的变换矩阵（最多为仿射变换，affine transformation），实现不同训练范式、架构及模态间的无缝整合。这种方法利用了通用模型在处理语义相关数据时潜在空间的相似性，并通过显式的语义对齐过程弥补其自然对齐不足的问题，最终使通用模型能够有效融入特定领域的专业知识，同时保持高效性和灵活性。

链接: https://arxiv.org/abs/2503.04478
作者: Maxime Di Folco,Emily Chan,Marta Hasny,Cosmin I. Bercea,Julia A. Schnabel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:General-purpose AI models, particularly those designed for text and vision, demonstrate impressive versatility across a wide range of deep-learning tasks. However, they often underperform in specialised domains like medical imaging, where domain-specific solutions or alternative knowledge transfer approaches are typically required. Recent studies have noted that general-purpose models can exhibit similar latent spaces when processing semantically related data, although this alignment does not occur naturally. Building on this insight, it has been shown that applying a simple transformation - at most affine - estimated from a subset of semantically corresponding samples, known as anchors, enables model stitching across diverse training paradigms, architectures, and modalities. In this paper, we explore how semantic alignment - estimating transformations between anchors - can bridge general-purpose AI with specialised medical knowledge. Using multiple public chest X-ray datasets, we demonstrate that model stitching across model architectures allows general models to integrate domain-specific knowledge without additional training, leading to improved performance on medical tasks. Furthermore, we introduce a novel zero-shot classification approach for unimodal vision encoders that leverages semantic alignment across modalities. Our results show that our method not only outperforms general multimodal models but also approaches the performance levels of fully trained, medical-specific multimodal solutions
zh

[CV-24] ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images CVPR2025

【速读】：该论文旨在解决自然森林环境中基于 LiDAR 的回环检测（place recognition）问题，这一领域因森林场景的高度自相似性和植被随时间显著变化而面临独特挑战。论文的关键解决方案是提出了一种名为 ForestLPR 的方法，其核心假设是通过不同高度水平切片的点云得到的鸟瞰图（BEV 密度图像）能够包含识别场所重访所需的信息。ForestLPR 利用视觉 Transformer 作为共享主干网络生成局部描述符，并引入多 BEV 交互模块以自适应关注不同高度的信息，随后通过聚合层生成旋转不变的全局场所描述符。这一方案有效应对了森林环境的复杂性与动态变化，验证了论文的核心假设。

链接: https://arxiv.org/abs/2503.04475
作者: Yanqing Shen,Turcan Tuna,Marco Hutter,Cesar Cadena,Nanning Zheng
机构: Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); Robotic Systems Lab, ETH Zurich (苏黎世联邦理工学院机器人系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: accepted by CVPR2025

点击查看摘要

Abstract:Place recognition is essential to maintain global consistency in large-scale localization systems. While research in urban environments has progressed significantly using LiDARs or cameras, applications in natural forest-like environments remain largely under-explored. Furthermore, forests present particular challenges due to high self-similarity and substantial variations in vegetation growth over time. In this work, we propose a robust LiDAR-based place recognition method for natural forests, ForestLPR. We hypothesize that a set of cross-sectional images of the forest’s geometry at different heights contains the information needed to recognize revisiting a place. The cross-sectional images are represented by \acbev density images of horizontal slices of the point cloud at different heights. Our approach utilizes a visual transformer as the shared backbone to produce sets of local descriptors and introduces a multi-BEV interaction module to attend to information at different heights adaptively. It is followed by an aggregation layer that produces a rotation-invariant place descriptor. We evaluated the efficacy of our method extensively on real-world data from public benchmarks as well as robotic datasets and compared it against the state-of-the-art (SOTA) methods. The results indicate that ForestLPR has consistently good performance on all evaluations and achieves an average increase of 7.38% and 9.11% on Recall@1 over the closest competitor on intra-sequence loop closure detection and inter-sequence re-localization, respectively, validating our hypothesis
zh

[CV-25] Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

【速读】：该论文旨在解决花样滑冰运动员摔倒分类问题，通过将骨架姿态数据与RGB帧相结合以增强分类性能。解决方案的关键在于提出了一种名为Gate-Shift-Pose的网络架构，其改进自Gate-Shift-Fuse网络，并设计了两种融合策略：早期融合（early-fusion）和晚期融合（late-fusion）。早期融合在输入阶段结合RGB帧与姿态关键点的高斯热图，而晚期融合采用多流架构并结合注意力机制来整合RGB和姿态特征。实验表明，这两种融合方法显著提升了分类准确性，特别是早期融合在ResNet50上的最高准确率达到98.08%，证明了多模态架构在体育动作识别中的潜力以及骨架姿态信息对于捕捉复杂运动模式的重要性。

链接: https://arxiv.org/abs/2503.04470
作者: Edoardo Bianchi,Oswald Lanz
机构: Free University of Bozen-Bolzano (自由大学博尔扎诺)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model’s capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns.
zh

[CV-26] Question-Aware Gaussian Experts for Audio-Visual Question Answering CVPR2025

【速读】：该论文致力于解决音频-视觉问答（Audio-Visual Question Answering, AVQA）任务中，现有方法因隐式利用问题信息及均匀帧采样导致的对问题特定细节关注不足和关键帧遗漏的问题。此外，尽管已有Top-K帧选择方法尝试改进，但其离散特性仍无法捕捉细粒度的时间动态。为应对这些挑战，论文提出了一种名为\textbf{QA-TIGER}的新框架，其关键在于通过基于高斯建模显式融入问题信息，并对连续与非连续帧进行自适应聚焦，同时结合逐步优化机制。通过引入专家混合模型（Mixture of Experts, MoE），激活针对具体问题定制的时间专家，从而灵活实现多高斯模型。实验结果表明，该方法在多个AVQA基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2503.04459
作者: Hongyeob Kim,Inyoung Jung,Dayoon Suh,Youjia Zhang,Sangmin Lee,Sungeun Hong
机构: Sungkyunkwan University (成均馆大学); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page at this https URL

点击查看摘要

Abstract:Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes \textbfQA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at this https URL
zh

[CV-27] PC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在高风险应用场景中因幻觉（hallucination）现象导致的可靠性降低问题。幻觉现象指模型过度自信地描述图像中不存在的对象或属性，这一问题在模型倾向于依赖语言先验（linguistic priors）时尤为突出。为应对这一挑战，论文提出了一种名为跨时间预测连接（Cross-Temporal Prediction Connection, TPC）的关键解决方案。TPC通过在时间步之间连接logits，增强其语义一致性，从而放大信息流并提升连贯性，有效减少了幻觉现象的发生。实验结果表明，TPC在保持鲁棒性的前提下，在准确性与效率方面均优于现有方法，尤其在开放性文本生成任务中表现优异。

链接: https://arxiv.org/abs/2503.04457
作者: Chao Wang,Weiwei Fu,Yang Zhou
机构: Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved remarkable advancements, capitalizing on the impressive capabilities of large language models (LLMs) across diverse tasks. Despite this, a critical challenge known as hallucination occurs when models overconfidently describe objects or attributes absent from the image, a problem exacerbated by the tendency of VLMs to rely on linguistic priors. This limitation reduces model reliability in high-stakes applications. In this work, we have observed the characteristic of logits’ continuity consistency enhancement and introduced a straightforward and efficient method, Cross-Temporal Prediction Connection (TPC), designed to enhance the semantic consistency of logits by connecting them temporally across timesteps. TPC amplifies information flow and improves coherence, effectively reducing hallucination. Extensive experiments show that TPC surpasses existing representatives, delivering superior performance in both accuracy and efficiency while maintaining robustness in open-ended text generation tasks.
zh

[CV-28] A lightweight model FDM-YOLO for small target improvement based on YOLOv8

【速读】：该论文致力于解决小目标检测在低计算资源约束下的挑战，即如何在保证实时性的同时提高小目标的检测精度。传统方法中，大模型虽然具有高精度但推理时间过长，而轻量级模型又难以满足精度需求。为应对这一问题，论文的关键创新在于提出了一种名为FDM-YOLO的新网络架构。该架构通过对YOLOv8检测头输出的分析，在原有基础上引入高分辨率层以增强小目标特征提取能力，并移除大目标检测层；同时设计了一种基于PConv的轻量化Fast-C2f网络结构嵌入到模型的PAN模块中。此外，为了减轻因模型轻量化带来的精度损失，采用了动态上采样（Dysample）及轻量级EMA注意力机制。最终，在Visdrone数据集上的实验结果表明，FDM-YOLO模型不仅减少了38%的参数量，还将Map0.5指标从38.4%提升至42.5%，且保持了相近的推理速度，验证了其在边缘设备部署中的有效性与实用性。

链接: https://arxiv.org/abs/2503.04452
作者: Xuerui Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small targets are particularly difficult to detect due to their low pixel count, complex backgrounds, and varying shooting angles, which make it hard for models to extract effective features. While some large-scale models offer high accuracy, their long inference times make them unsuitable for real-time deployment on edge devices. On the other hand, models designed for low computational power often suffer from poor detection accuracy. This paper focuses on small target detection and explores methods for object detection under low computational constraints. Building on the YOLOv8 model, we propose a new network architecture called FDM-YOLO. Our research includes the following key contributions: We introduce FDM-YOLO by analyzing the output of the YOLOv8 detection head. We add a highresolution layer and remove the large target detection layer to better handle small targets. Based on PConv, we propose a lightweight network structure called Fast-C2f, which is integrated into the PAN module of the model. To mitigate the accuracy loss caused by model lightweighting, we employ dynamic upsampling (Dysample) and a lightweight EMA attention this http URL FDM-YOLO model was validated on the Visdrone dataset, achieving a 38% reduction in parameter count and improving the Map0.5 score from 38.4% to 42.5%, all while maintaining nearly the same inference speed. This demonstrates the effectiveness of our approach in balancing accuracy and efficiency for edge device deployment.
zh

[CV-29] oFu: Visual Tokens Reduction via Fusion for Multi-modal Multi-patch Multi-image Task

【速读】：本文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在处理高分辨率、多图像任务时，因视觉输入编码所需大量标记（tokens）导致的计算资源需求过高的问题。传统方法依赖特定的视觉编码器架构，并通常需要对大语言模型（Large Language Models, LLMs）进行微调以保持性能，且主要关注单图像场景。为克服这些局限性，论文提出了一种名为ToFu的视觉编码器无关、无需训练的标记融合（Token Fusion）策略。ToFu的关键在于通过逐步分析视觉标记，决定将相似标记合并还是保留为独立实体，从而在保持区分度的同时减少冗余标记。这种方法不仅提升了计算效率，还改善了性能表现，在LLaVA-Interleave基准以及新创建的ComPairs基准测试中均得到了验证。

链接: https://arxiv.org/abs/2503.04444
作者: Vittorio Pippi,Matthieu Guillaumin,Silvia Cascianelli,Rita Cucchiara,Maximilian Jaritz,Loris Bazzani
机构: University of Modena and Reggio Emilia (摩德纳与雷焦艾米利亚大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) are powerful tools that are capable of reasoning and understanding multimodal information beyond text and language. Despite their entrenched impact, the development of LMMs is hindered by the higher computational requirements compared to their unimodal counterparts. One of the main causes of this is the large amount of tokens needed to encode the visual input, which is especially evident for multi-image multimodal tasks. Recent approaches to reduce visual tokens depend on the visual encoder architecture, require fine-tuning the LLM to maintain the performance, and only consider single-image scenarios. To address these limitations, we propose ToFu, a visual encoder-agnostic, training-free Token Fusion strategy that combines redundant visual tokens of LMMs for high-resolution, multi-image, tasks. The core intuition behind our method is straightforward yet effective: preserve distinctive tokens while combining similar ones. We achieve this by sequentially examining visual tokens and deciding whether to merge them with others or keep them as separate entities. We validate our approach on the well-established LLaVA-Interleave Bench, which covers challenging multi-image tasks. In addition, we push to the extreme our method by testing it on a newly-created benchmark, ComPairs, focused on multi-image comparisons where a larger amount of images and visual tokens are inputted to the LMMs. Our extensive analysis, considering several LMM architectures, demonstrates the benefits of our approach both in terms of efficiency and performance gain.
zh

[CV-30] EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images IROS2025

【速读】：该论文旨在解决在非结构化环境中场景理解所面临的挑战，具体问题是现有映射方法常因深度感知稀疏且噪声较大、语义预测过自信等问题导致地图表示不一致。为应对这些挑战，论文提出的关键解决方案是EvidMTL（evidential multi-task learning）框架，该框架利用证据头进行深度估计和语义分割，从而实现从单目RGB图像中进行不确定性感知推理。此外，通过引入一种新的证据深度损失函数，该函数联合优化深度预测的信任强度与证据分割损失，进一步增强了不确定性校准的多任务学习能力。基于此，论文还提出了EvidKimera框架，它使用证据深度和语义预测来提高三维度量-语义一致性，最终实现在NYUDepthV2上的训练与评估，并在ScanNetV2上展示了优于传统方法的不确定性估计能力，同时保持了可比的深度估计和语义分割性能。

链接: https://arxiv.org/abs/2503.04441
作者: Rohit Menon,Nils Dengler,Sicong Pan,Gokul Krishna Chenchani,Maren Bennewitz
机构: Humanoid Robots Lab and the Center for Robotics, University of Bonn (波恩大学人形机器人实验室和波恩大学机器人中心); Lamarr Institute (拉马尔研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IROS 2025 Conference

点击查看摘要

Abstract:For scene understanding in unstructured environments, an accurate and uncertainty-aware metric-semantic mapping is required to enable informed action selection by autonomous this http URL mapping methods often suffer from overconfident semantic predictions, and sparse and noisy depth sensing, leading to inconsistent map representations. In this paper, we therefore introduce EvidMTL, a multi-task learning framework that uses evidential heads for depth estimation and semantic segmentation, enabling uncertainty-aware inference from monocular RGB images. To enable uncertainty-calibrated evidential multi-task learning, we propose a novel evidential depth loss function that jointly optimizes the belief strength of the depth prediction in conjunction with evidential segmentation loss. Building on this, we present EvidKimera, an uncertainty-aware semantic surface mapping framework, which uses evidential depth and semantics prediction for improved 3D metric-semantic consistency. We train and evaluate EvidMTL on the NYUDepthV2 and assess its zero-shot performance on ScanNetV2, demonstrating superior uncertainty estimation compared to conventional approaches while maintaining comparable depth estimation and semantic segmentation. In zero-shot mapping tests on ScanNetV2, EvidKimera outperforms Kimera in semantic surface mapping accuracy and consistency, highlighting the benefits of uncertainty-aware mapping and underscoring its potential for real-world robotic applications.
zh

[CV-31] PointsToWood: A deep learning framework for complete canopy leaf-wood segmentation of TLS data across diverse European forests

【速读】：本文旨在解决点云语义分割在复杂森林生态系统中的准确性与通用性问题，特别是在从树基到枝尖的整个树木点云中，对木材和叶片进行可靠区分的需求。现有自动化方法主要针对单一生态系统的优化，在冠层内部的表现通常不如预期，尤其难以应对多样化的树冠结构及传感器特性。
解决方案的关键在于提出了一种基于PointNet和pointNEXT的新框架，结合精心标注的数据集，通过体素采样、邻域尺度调整以及嵌入特征提取层的新型门控反射率整合模块，实现了对高密度TLS点云中木材和叶片的精确语义分割。模型的训练数据涵盖欧洲多种成熟森林类型，评估表明其性能在北半球、温带、地中海及热带区域的公开数据集中均优于基于PointNet的传统方法，并展现出跨生态类型和传感器类型的广泛适用性。

链接: https://arxiv.org/abs/2503.04420
作者: Harry J. F. Owen,Matthew J. A. Allen,Stuart W. D. Grieve,Phill Wilkes,Emily R. Lines
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds from Terrestrial Laser Scanning (TLS) are an increasingly popular source of data for studying plant structure and function but typically require extensive manual processing to extract ecologically important information. One key task is the accurate semantic segmentation of different plant material within point clouds, particularly wood and leaves, which is required to understand plant productivity, architecture and physiology. Existing automated semantic segmentation methods are primarily developed for single ecosystem types, and whilst they show good accuracy for biomass assessment from the trunk and large branches, often perform less well within the crown. In this study, we demonstrate a new framework that uses a deep learning architecture newly developed from PointNet and pointNEXT for processing 3D point clouds to provide a reliable semantic segmentation of wood and leaf in TLS point clouds from the tree base to branch tips, trained on data from diverse mature European forests. Our model uses meticulously labelled data combined with voxel-based sampling, neighbourhood rescaling, and a novel gated reflectance integration module embedded throughout the feature extraction layers. We evaluate its performance across open datasets from boreal, temperate, Mediterranean and tropical regions, encompassing diverse ecosystem types and sensor characteristics. Our results show consistent outperformance against the most widely used PointNet based approach for leaf/wood segmentation on our high-density TLS dataset collected across diverse mixed forest plots across all major biomes in Europe. We also find consistently strong performance tested on others open data from China, Eastern Cameroon, Germany and Finland, collected using both time-of-flight and phase-shift sensors, showcasing the transferability of our model to a wide range of ecosystems and sensors.
zh

[CV-32] Learning Transformer-based World Models with Contrastive Predictive Coding

【速读】：该论文旨在解决基于Transformer的世界模型在长时序预测能力上的不足，以及由此导致的性能瓶颈问题。现有方法虽利用Transformer的高效训练特性，但在复杂环境下的表现仍逊于Dreamer等基于RNN的世界模型。论文的关键创新在于提出TWISTER（Transformer-based World model wIth contraSTivE Representations），通过引入动作条件的对比预测编码（Action-Conditioned Contrastive Predictive Coding, CPC）来学习高层次的时间特征表示，扩展世界模型的预测时间范围至更长远的未来。这一方案有效提升了Agent在Atari 100k基准测试中的表现，实现了162%的人类标准化平均得分，刷新了无需前瞻搜索的最先进方法的记录。

链接: https://arxiv.org/abs/2503.04416
作者: Maxime Burchi,Radu Timofte
机构: Computer Vision Lab, CAIDAS & IFI, University of Würzburg (伍尔兹堡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The DreamerV3 algorithm recently obtained remarkable performance across diverse environment domains by learning an accurate world model based on Recurrent Neural Networks (RNNs). Following the success of model-based reinforcement learning algorithms and the rapid adoption of the Transformer architecture for its superior training efficiency and favorable scaling properties, recent works such as STORM have proposed replacing RNN-based world models with Transformer-based world models using masked self-attention. However, despite the improved training efficiency of these methods, their impact on performance remains limited compared to the Dreamer algorithm, struggling to learn competitive Transformer-based world models. In this work, we show that the next state prediction objective adopted in previous approaches is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER (Transformer-based World model wIth contraSTivE Representations), a world model using action-conditioned Contrastive Predictive Coding to learn high-level temporal feature representations and improve the agent performance. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
zh

[CV-33] Scale-Invariant Adversarial Attack against Arbitrary-scale Super-resolution

【速读】：本文旨在解决基于连续表示的任意尺度超分辨率（Super-Resolution, SR）技术在对抗攻击下的鲁棒性评估问题。虽然固定尺度SR的脆弱性已被研究，但针对任意尺度SR的连续表示方法的鲁棒性尚未得到充分探索。现有针对固定尺度SR设计的对抗攻击方法依赖于特定尺度，直接应用于任意尺度SR时会导致计算开销和内存消耗显著增加。为了解决这一问题，论文提出了一种简单而有效的“尺度不变”（scale-invariant）SR对抗攻击方法，称为SIAGT。其关键是通过利用连续表示的有限离散点构建资源节省型攻击，并引入与坐标相关的损失函数以增强跨模型的迁移能力，从而在显著损害超分辨率图像质量的同时，对目标低分辨率图像引入不可察觉的失真。

链接: https://arxiv.org/abs/2503.04385
作者: Yihao Huang,Xin Luo,Qing Guo,Felix Juefei-Xu,Xiaojun Jia,Weikai Miao,Geguang Pu,Yang Liu
机构: Nanyang Technological University, Singapore (南洋理工大学，新加坡); East China Normal University, China (华东师范大学，中国); Agency for Science, Technology and Research (A*STAR), Singapore (新加坡科技研究局); New York University, USA (纽约大学，美国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, accepted by TIFS 2025

点击查看摘要

Abstract:The advent of local continuous image function (LIIF) has garnered significant attention for arbitrary-scale super-resolution (SR) techniques. However, while the vulnerabilities of fixed-scale SR have been assessed, the robustness of continuous representation-based arbitrary-scale SR against adversarial attacks remains an area warranting further exploration. The elaborately designed adversarial attacks for fixed-scale SR are scale-dependent, which will cause time-consuming and memory-consuming problems when applied to arbitrary-scale SR. To address this concern, we propose a simple yet effective ``scale-invariant’’ SR adversarial attack method with good transferability, termed SIAGT. Specifically, we propose to construct resource-saving attacks by exploiting finite discrete points of continuous representation. In addition, we formulate a coordinate-dependent loss to enhance the cross-model transferability of the attack. The attack can significantly deteriorate the SR images while introducing imperceptible distortion to the targeted low-resolution (LR) images. Experiments carried out on three popular LIIF-based SR approaches and four classical SR datasets show remarkable attack performance and transferability of SIAGT.
zh

[CV-34] MIDAS: Modeling Ground-Truth Distributions with Dark Knowledge for Domain Generalized Stereo Matching

【速读】：该论文旨在解决现有领域泛化立体匹配方法在从合成域转移到真实域时仍表现出特定领域偏好（Domain-specific Preferences）的问题，这一局限性限制了其在复杂且多样化场景中的实际应用。论文的关键解决方案在于从预训练的立体网络中提取两类“暗知识”（Dark Knowledge）：相似性和不确定性信息，并利用这些知识建模边缘与非边缘区域的直观多模态真实分布。进一步地，通过网络集成（Network Ensemble），并在拉普拉斯参数空间中区分客观知识与有偏知识，最终将客观知识与原始视差标签联合建模为拉普拉斯混合模型，以提供细粒度监督用于立体网络的训练。

链接: https://arxiv.org/abs/2503.04376
作者: Peng Xu,Zhiyu Xiang,Jingyun Fu,Tianyu Pu,Hanzhi Zhong,Eryun Liu
机构: College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息科学与电子工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the significant advances in domain generalized stereo matching, existing methods still exhibit domain-specific preferences when transferring from synthetic to real domains, hindering their practical applications in complex and diverse scenarios. The probability distributions predicted by the stereo network naturally encode rich similarity and uncertainty information. Inspired by this observation, we propose to extract these two types of dark knowledge from the pre-trained network to model intuitive multi-modal ground-truth distributions for both edge and non-edge regions. To mitigate the inherent domain preferences of a single network, we adopt network ensemble and further distinguish between objective and biased knowledge in the Laplace parameter space. Finally, the objective knowledge and the original disparity labels are jointly modeled as a mixture of Laplacians to provide fine-grained supervision for the stereo network training. Extensive experiments demonstrate that: 1) Our method is generic and effectively improves the generalization of existing networks. 2) PCWNet with our method achieves the state-of-the-art generalization performance on both KITTI 2015 and 2012 datasets. 3) Our method outperforms existing methods in comprehensive ranking across four popular real-world datasets.
zh

[CV-35] ObjMST: An Object-Focused Multimodal Style Transfer Framework

【速读】：该论文旨在解决现有图像-文本多模态风格迁移方法中存在的两个主要问题：(1) 多模态风格表示的非对齐和不一致性；(2) 内容错配，即相同的风格模式被应用于显著对象及其周围元素。为解决这些问题，论文提出了ObjMST框架，其关键在于引入了Style-Specific Masked Directional CLIP Loss以确保显著对象及其周围环境的一致且对齐的风格表示，并通过显著到关键映射机制实现显著对象的风格化，随后进行图像和谐化以无缝融合风格化对象与其环境。

链接: https://arxiv.org/abs/2503.04353
作者: Chanda Grover Kamra,Indra Deep Mastan,Debayan Gupta
机构: Ashoka University (阿什oka大学); Indian Institute of Technology Banaras Hindu University (印度理工学院瓦拉纳西hindu大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 Figures, 3 Tables

点击查看摘要

Abstract:We propose ObjMST, an object-focused multimodal style transfer framework that provides separate style supervision for salient objects and surrounding elements while addressing alignment issues in multimodal representation learning. Existing image-text multimodal style transfer methods face the following challenges: (1) generating non-aligned and inconsistent multimodal style representations; and (2) content mismatch, where identical style patterns are applied to both salient objects and their surrounding elements. Our approach mitigates these issues by: (1) introducing a Style-Specific Masked Directional CLIP Loss, which ensures consistent and aligned style representations for both salient objects and their surroundings; and (2) incorporating a salient-to-key mapping mechanism for stylizing salient objects, followed by image harmonization to seamlessly blend the stylized objects with their environment. We validate the effectiveness of ObjMST through experiments, using both quantitative metrics and qualitative visual evaluations of the stylized outputs. Our code is available at: this https URL.
zh

[CV-36] PLMP – Point-Line Minimal Problems for Projective SfM

【速读】：本文旨在完全分类多视图未校准针孔相机下点和线完全观测的所有最小结构从运动（Structure-from-Motion, SfM）问题，并发现了291个最小问题，其中73个具有唯一解且可线性求解。关键在于通过探索子排列的稳定子子群，提出了一种几何化和系统化的方法，用于分解最小问题为更小的问题、识别欠约束问题中的最小问题以及形式化证明非最小性，从而为评估和解决这些问题提供了理论基础与实用工具。

链接: https://arxiv.org/abs/2503.04351
作者: Kim Kiehn,Albin Ahlbäck,Kathlén Kohn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:We completely classify all minimal problems for Structure-from-Motion (SfM) where arrangements of points and lines are fully observed by multiple uncalibrated pinhole cameras. We find 291 minimal problems, 73 of which have unique solutions and can thus be solved linearly. Two of the linear problems allow an arbitrary number of views, while all other minimal problems have at most 9 cameras. All minimal problems have at most 7 points and at most 12 lines. We compute the number of solutions of each minimal problem, as this gives a measurement of the problem’s intrinsic difficulty, and find that these number are relatively low (e.g., when comparing with minimal problems for calibrated cameras). Finally, by exploring stabilizer subgroups of subarrangements, we develop a geometric and systematic way to 1) factorize minimal problems into smaller problems, 2) identify minimal problems in underconstrained problems, and 3) formally prove non-minimality.
zh

[CV-37] LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

【速读】：该论文旨在解决扩散Transformer（Diffusion Transformers, DiTs）在生成高于训练分辨率图像时面临的挑战，主要障碍是显式的位置编码（如RoPE）在推理分辨率与训练分辨率不同时需要外推，导致性能下降。论文提出了一种名为LEDiT（Length-Extrapolatable Diffusion Transformer）的创新架构，其关键在于引入因果注意力机制以隐式赋予标记全局位置信息，同时增强局部性以便精确区分相邻标记，从而避免显式位置编码的外推需求。实验表明，LEDiT能够将推理分辨率扩展至512x512甚至1024x1024，并且在ImageNet上的图像质量优于当前最先进的长度外推方法（如NTK-aware和YaRN），同时仅需对预训练的DiT进行10万步微调即可实现强大的外推性能，展示了其在现有文本到图像DiTs集成中的潜力。

链接: https://arxiv.org/abs/2503.04344
作者: Shen Zhang,Yaning Tan,Siyuan Liang,Linze Li,Ge Wu,Yuhao Chen,Shuheng Li,Zhenyu Zhao,Caihua Chen,Jiajun Liang,Yao Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion transformers(DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful architecture to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are introducing causal attention to implicitly impart global positional information to tokens, while enhancing locality to precisely distinguish adjacent tokens. Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the inference resolution to 512x512 and 1024x1024, respectively, while achieving better image quality compared to current state-of-the-art length extrapolation methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation performance with just 100K steps of fine-tuning on a pretrained DiT, demonstrating its potential for integration into existing text-to-image DiTs.
zh

[CV-38] GaussianVideo: Efficient Video Representation and Compression by Gaussian Splatting

【速读】：该论文旨在解决 Implicit Neural Representation for Videos (NeRV) 在模型规模增大时面临的编码和解码速度慢以及内存消耗高的问题，限制了其在实际应用中的可行性。为了解决这些局限性，论文提出了一种基于二维高斯点阵 (2D Gaussian Splatting) 的新视频表示与压缩方法。其关键在于引入可变形的二维高斯点阵，通过动态调整每一帧的二维高斯分布变换，显著降低了内存成本。此外，结合基于多平面的空间-时间编码器和轻量级解码器，该方法能够以极低的成本利用时间梯度捕获时间冗余，从而有效提升视频表示效率。最终，该方法将 GPU 内存使用减少了高达 78.4%，并将训练速度提高了 5.5 倍，解码速度提高了 12.5 倍，相比最先进的 NeRV 方法实现了显著性能提升。

链接: https://arxiv.org/abs/2503.04333
作者: Inseo Lee,Youngyoon Choi,Joonseok Lee
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representation for Videos (NeRV) has introduced a novel paradigm for video representation and compression, outperforming traditional codecs. As model size grows, however, slow encoding and decoding speed and high memory consumption hinder its application in practice. To address these limitations, we propose a new video representation and compression method based on 2D Gaussian Splatting to efficiently handle video data. Our proposed deformable 2D Gaussian Splatting dynamically adapts the transformation of 2D Gaussians at each frame, significantly reducing memory cost. Equipped with a multi-plane-based spatiotemporal encoder and a lightweight decoder, it predicts changes in color, coordinates, and shape of initialized Gaussians, given the time step. By leveraging temporal gradients, our model effectively captures temporal redundancy at negligible cost, significantly enhancing video representation efficiency. Our method reduces GPU memory usage by up to 78.4%, and significantly expedites video processing, achieving 5.5x faster training and 12.5x faster decoding compared to the state-of-the-art NeRV methods.
zh

[CV-39] A Modular Pipeline for 3D Object Tracking Using RGB Cameras

【速读】：该论文旨在解决多目标三维轨迹跟踪的问题，尤其关注在多个时间同步且固定的摄像机环境下，利用现成网络摄像头追踪多个物体的挑战。论文提出了一种新的模块化流程，能够计算多个物体的三维轨迹，并适应不同的场景设置。解决方案的关键在于实现了一个鲁棒的处理管道，该管道能够准确检测小尺寸物体、确定摄像机姿态、区分临近及重叠物体、处理临时遮挡，并最终利用每3分钟试验中平均约11,12,456个像素坐标中的合适子集来计算三维轨迹。此管道通过引入协方差作为位置（x, y, z）的置信度度量，动态应对物体的出现与消失，同时实例化新的扩展卡尔曼滤波器。此外，该方法仅需极少量的人工标注输入即可扩展至数百次餐桌布置试验，即使每次试验的摄像机姿态未知。代码资源公开可获取。

链接: https://arxiv.org/abs/2503.04322
作者: Lars Bredereke,Yale Hartmann,Tanja Schultz
机构: University Bremen (布伦瑞克工业大学); Cognitive Systems Lab (认知系统实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 11 figures, original paper not to be published anywhere else

点击查看摘要

Abstract:Object tracking is a key challenge of computer vision with various applications that all require different architectures. Most tracking systems have limitations such as constraining all movement to a 2D plane and they often track only one object. In this paper, we present a new modular pipeline that calculates 3D trajectories of multiple objects. It is adaptable to various settings where multiple time-synced and stationary cameras record moving objects, using off the shelf webcams. Our pipeline was tested on the Table Setting Dataset, where participants are recorded with various sensors as they set a table with tableware objects. We need to track these manipulated objects, using 6 rgb webcams. Challenges include: Detecting small objects in 9.874.699 camera frames, determining camera poses, discriminating between nearby and overlapping objects, temporary occlusions, and finally calculating a 3D trajectory using the right subset of an average of 11.12.456 pixel coordinates per 3-minute trial. We implement a robust pipeline that results in accurate trajectories with covariance of x,y,z-position as a confidence metric. It deals dynamically with appearing and disappearing objects, instantiating new Extended Kalman Filters. It scales to hundreds of table-setting trials with very little human annotation input, even with the camera poses of each trial unknown. The code is available at this https URL
zh

[CV-40] S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting CVPR2025

【速读】：该论文旨在解决从稀疏且低分辨率视角重建高质量3D场景的问题，这些视角同时面临视角不足与清晰度缺陷的挑战。现有方法仅针对稀疏视角或低分辨率观测分别处理，无法应对这种混合且复杂的场景。论文提出了一种名为S2Gaussian的新颖稀疏视图超分辨3D高斯点 splatting 框架，通过仅使用稀疏且低分辨率的视角即可重建结构精确且细节忠实的3D场景。方案的关键在于其两阶段操作：第一阶段优化低分辨率高斯表示并通过定制的高斯洗牌分割操作进行深度正则化以初始化高分辨率高斯；第二阶段利用来自原始稀疏视角和由低分辨率高斯渲染的伪视角生成的超分辨图像来精炼高分辨率高斯，并设计了无模糊不一致性建模方案和三维鲁棒优化策略以缓解多视角不一致并消除因监督不完美引起的错误更新。

链接: https://arxiv.org/abs/2503.04314
作者: Yecong Wan,Mingwen Shao,Yuanshuo Cheng,Wangmeng Zuo
机构: Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China) (青岛软件研究院，计算机科学与技术学院，中国石油大学（华东）); Faculty of Computing, Harbin Institute of Technology (哈尔滨工业大学计算学部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:In this paper, we aim ambitiously for a realistic yet challenging problem, namely, how to reconstruct high-quality 3D scenes from sparse low-resolution views that simultaneously suffer from deficient perspectives and clarity. Whereas existing methods only deal with either sparse views or low-resolution observations, they fail to handle such hybrid and complicated scenarios. To this end, we propose a novel Sparse-view Super-resolution 3D Gaussian Splatting framework, dubbed S2Gaussian, that can reconstruct structure-accurate and detail-faithful 3D scenes with only sparse and low-resolution views. The S2Gaussian operates in a two-stage fashion. In the first stage, we initially optimize a low-resolution Gaussian representation with depth regularization and densify it to initialize the high-resolution Gaussians through a tailored Gaussian Shuffle Split operation. In the second stage, we refine the high-resolution Gaussians with the super-resolved images generated from both original sparse views and pseudo-views rendered by the low-resolution Gaussians. In which a customized blur-free inconsistency modeling scheme and a 3D robust optimization strategy are elaborately designed to mitigate multi-view inconsistency and eliminate erroneous updates caused by imperfect supervision. Extensive experiments demonstrate superior results and in particular establishing new state-of-the-art performances with more consistent geometry and finer details.
zh

[CV-41] Shaken Not Stirred: A Novel Dataset for Visual Understanding of Glasses in Human-Robot Bartending Tasks IROS

【速读】：该论文旨在解决现有对象检测数据集中眼镜类别子类区分不足的问题，这主要是由于眼镜的透明和反光特性导致数据集缺乏足够的多样性。解决方案的关键在于提出了一种基于RGB-D传感器的自动化真实世界数据获取方法，通过深度测量实现帧级自动标注(auto-labeling pipeline)，从而显著减少人工干预。此外，论文构建了一个新的真实世界眼镜对象数据集，该数据集在类人机器人平台Neuro-Inspired COLlaborator (NICOL) 上采集，包含来自五个不同摄像头的7850张图像。实验结果表明，所训练的基线模型在性能上优于现有的开放词汇(open-vocabulary)方法，并且在NICOL平台的具身代理任务中实现了81%的成功率。

链接: https://arxiv.org/abs/2503.04308
作者: Lukáš Gajdošech,Hassan Ali,Jan-Gerrit Habekost,Martin Madaras,Matthias Kerzel,Stefan Wermter
机构: Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University (康斯坦丁·菲利普大学), Bratislava, Slovakia; Knowledge Technology Group, Department of Informatics, University of Hamburg (汉堡大学), Hamburg, Germany
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025

点击查看摘要

Abstract:Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue to robotic applications that suffer from accumulating errors between detection, planning, and action execution. The paper introduces a novel method for the acquisition of real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The data set consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.
zh

[CV-42] ControlFill: Spatially Adjustable Image Inpainting from Prompt Learning

【速读】：该论文试图解决图像修复（inpainting）中的可控性问题，具体目标是在指定掩模区域生成合理的对象（creation）的同时，通过扩展背景来平滑填补该区域（removal）。论文的关键在于提出了一种名为“ControlFill”的框架，其核心解决方案是训练两种独立的提示（prompts）分别用于creation和removal，并利用这些学习到的嵌入(embeddings)引导一个无需重型文本编码器（text encoders）的扩散网络（diffusion network）。此外，通过调节两种提示的相对重要性以及采用无分类器引导（classifier-free guidance），用户能够控制修复过程中去除或创建的程度，并进一步通过为像素分配不同尺度实现空间可变的引导强度。

链接: https://arxiv.org/abs/2503.04268
作者: Boseong Jeon
机构: Samsung Research (三星研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this report, I present an inpainting framework named \textitControlFill, which involves training two distinct prompts: one for generating plausible objects within a designated mask (\textitcreation) and another for filling the region by extending the background (\textitremoval). During the inference stage, these learned embeddings guide a diffusion network that operates without requiring heavy text encoders. By adjusting the relative significance of the two prompts and employing classifier-free guidance, users can control the intensity of removal or creation. Furthermore, I introduce a method to spatially vary the intensity of guidance by assigning different scales to individual pixels.
zh

[CV-43] AIL: Text-Audio Incremental Learning

【速读】：该论文试图解决多模态模型在面对新数据集时的泛化能力不足以及 catastrophic forgetting（灾难性遗忘）问题，同时探讨大模型参数对训练性能的影响。为了解决这些问题，论文提出了一个名为Text-Audio Incremental Learning (TAIL) 的新任务，并设计了一种名为PTAT (Prompt Tuning for Audio-Text incremental learning) 的方法。该方法的关键在于利用prompt tuning优化模型参数的同时，结合音频-文本相似性和特征蒸馏模块，有效缓解了catastrophic forgetting现象，从而提升了模型在旧数据集上的性能稳定性。

链接: https://arxiv.org/abs/2503.04258
作者: Yingfei Sun,Xu Gu,Wei Ji,Hanbin Zhao,Hao Fei,Yifang Yin,Roger Zimmermann
机构: National University of Singapore (新加坡国立大学); Renmin University of China (中国人民大学); Zhejiang University (浙江大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 4 figures, 5 tables

点击查看摘要

Abstract:Many studies combine text and audio to capture multi-modal information but they overlook the model’s generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42% of its parameters, achieving 4.46% higher performance.
zh

[CV-44] How to Move Your Drag on: Text-to-Motion Synthesis for Large-Vocabulary Objects

【速读】：该论文致力于解决3D内容创建领域中运动合成（Motion Synthesis）在多样物体类别上的两大关键挑战：(1) 缺乏包含广泛高质量运动及其标注的综合运动数据集，以及(2) 缺少能够处理来自不同物体的异构骨骼模板（Skeletal Templates）的方法。为了解决这些问题，论文提出了三个核心方案：首先，通过为Truebones Zoo数据集添加详细的文本描述，将其改造成适合基于文本的运动合成的数据集；其次，引入骨骼增强技术，在保持一致动力学的前提下生成多样化的运动数据，使模型能够适应不同的骨骼配置；最后，重新设计现有的运动扩散模型，使其能够动态适配任意骨骼模板，从而实现对具有不同结构的多样化物体的运动合成。这些方法的关键在于结合文本驱动的标注、骨骼增强技术和模型动态适配能力，以实现跨多样物体类别和骨骼模板的高保真运动生成。

链接: https://arxiv.org/abs/2503.04257
作者: Wonkwang Lee,Jongwon Jeong,Taehong Moon,Hyeon-Jong Kim,Jaehyeon Kim,Gunhee Kim,Byeong-Uk Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Motion synthesis for diverse object categories holds great potential for 3D content creation but remains underexplored due to two key challenges: (1) the lack of comprehensive motion datasets that include a wide range of high-quality motions and annotations, and (2) the absence of methods capable of handling heterogeneous skeletal templates from diverse objects. To address these challenges, we contribute the following: First, we augment the Truebones Zoo dataset, a high-quality animal motion dataset covering over 70 species, by annotating it with detailed text descriptions, making it suitable for text-based motion synthesis. Second, we introduce rig augmentation techniques that generate diverse motion data while preserving consistent dynamics, enabling models to adapt to various skeletal configurations. Finally, we redesign existing motion diffusion models to dynamically adapt to arbitrary skeletal templates, enabling motion synthesis for a diverse range of objects with varying structures. Experiments show that our method learns to generate high-fidelity motions from textual descriptions for diverse and even unseen objects, setting a strong foundation for motion synthesis across diverse object categories and skeletal templates. Qualitative results are available on this link: this http URL
zh

[CV-45] An Egocentric Vision-Language Model based Portable Real-time Smart Assistant

【速读】：本文旨在解决在便携式设备上提供实时、全面视觉-语言（Vision-Language, VL）AI辅助的问题。现有系统通常依赖于专用硬件，限制了其应用范围。为了解决这一挑战，论文提出Vinci系统，其关键在于采用EgoVideo-VL模型，该模型将第一人称视角（egocentric vision）基础模型与大型语言模型（Large Language Model, LLM）相结合，实现了场景理解、时间定位、视频摘要以及未来规划等高级功能。此外，Vinci通过引入内存模块处理长视频流的同时保留上下文历史、生成模块创建视觉动作演示，以及检索模块连接第一人称与第三人称视角以提供技能获取相关的操作指南视频，进一步增强了其实用性。这些创新使得Vinci能够在不依赖特定硬件的前提下跨多种设备部署，包括智能手机和可穿戴相机，从而显著扩展了便携式实时第一人称AI系统的适用场景。

链接: https://arxiv.org/abs/2503.04250
作者: Yifei Huang,Jilan Xu,Baoqi Pei,Yuping He,Guo Chen,Mingfang Zhang,Lijin Yang,Zheng Nie,Jinyao Liu,Guoshun Fan,Dechen Lin,Fang Fang,Kunpeng Li,Chang Yuan,Xinyuan Chen,Yaohui Wang,Yali Wang,Yu Qiao,Limin Wang
机构: The University of Tokyo(Tokyo, Japan); Fudan University(复旦大学, Shanghai, China); Zhejiang University(浙江大学, Hangzhou, China); Nanjing University(南京大学, Nanjing, China); Shanghai AI Laboratory(上海人工智能实验室, Shanghai, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at this https URL.
zh

[CV-46] Geometry-Constrained Monocular Scale Estimation Using Semantic Segmentation for Dynamic Scenes

【速读】：本文旨在解决单目视觉定位在估计车辆自运动（ego-motion）时因缺乏深度信息导致的尺度估计难题，同时应对传统方法在动态物体处理及计算效率与精度平衡方面的挑战。为实现这一目标，论文提出的关键解决方案包括：(1) 利用SegNeXt模型设计了一种混合方法，用于实时自运动估计和地面点选择；(2) 引入动态对象掩膜以剔除不稳定特征，并使用地面平面掩膜进行精确三角测量；(3) 借助几何约束恢复道路区域的尺度信息；(4) 将所提方法与单目ORB-SLAM3结合，实现道路模型的精准估计，从而完成尺度恢复。实验结果表明，该方法在KITT数据集上的表现显著优于现有单目视觉里程计算法及尺度恢复方法。

链接: https://arxiv.org/abs/2503.04235
作者: Hui Zhang,Zhiyang Wu,Qianqian Shangguan,Kang An
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular visual localization plays a pivotal role in advanced driver assistance systems and autonomous driving by estimating a vehicle’s ego-motion from a single pinhole camera. Nevertheless, conventional monocular visual odometry encoun-ters challenges in scale estimation due to the absence of depth information during projection. Previous methodologies, whether rooted in physical constraints or deep learning paradigms, con-tend with issues related to computational complexity and the management of dynamic objects. This study extends our prior research, presenting innovative strategies for ego-motion estima-tion and the selection of ground points. Striving for a nuanced equilibrium between computational efficiency and precision, we propose a hybrid method that leverages the SegNeXt model for real-time applications, encompassing both ego-motion estimation and ground point selection. Our methodology incorporates dy-namic object masks to eliminate unstable features and employs ground plane masks for meticulous triangulation. Furthermore, we exploit Geometry-constraint to delineate road regions for scale recovery. The integration of this approach with the mo-nocular version of ORB-SLAM3 culminates in the accurate esti-mation of a road model, a pivotal component in our scale recov-ery process. Rigorous experiments, conducted on the KITTI da-taset, systematically compare our method with existing monocu-lar visual odometry algorithms and contemporary scale recovery methodologies. The results undeniably confirm the superior ef-fectiveness of our approach, surpassing state-of-the-art visual odometry algorithms. Our source code is available at https://git this http URL.
zh

[CV-47] Synthetic Data is an Elegant GIFT for Continual Vision-Language Models CVPR2025

【速读】：该论文旨在解决预训练视觉-语言模型（Pre-trained Vision-Language Models, VLMs）在持续学习（Continual Learning, CL）过程中面临的两个主要问题：一是从下游任务中遗忘已学知识（即灾难性遗忘, Catastrophic Forgetting），二是由于无法访问原始预训练数据而导致的预训练知识退化，进而影响模型的泛化能力。论文的关键解决方案在于提出了一种名为GIFT的新颖持续微调方法，通过利用合成数据来缓解上述问题。具体而言，GIFT借助文本到图像合成领域的最新进展，使用预训练的扩散模型重新创建预训练数据和已学习的下游任务数据，并通过图像-文本对齐约束以及对比蒸馏损失函数，在特征空间中实现知识的高效迁移与保留。此外，为了应对分布内过拟合并增强有限生成数据条件下的蒸馏效果，GIFT引入了自适应权重整合机制，基于合成图像-文本对的Fisher信息，实现了更好的稳定性-可塑性权衡。实验结果表明，该方法在多种设置下均优于现有最先进方法。

链接: https://arxiv.org/abs/2503.04229
作者: Bin Wu,Wuxuan Shi,Jinqiao Wang,Mang Ye
机构: School of Computer Science, Wuhan University, Wuhan, China (武汉大学计算机学院); Institute of Automation, Chinese Academy of Sciences, Beijing, China (中国科学院自动化研究所); Wuhan AI Research, Wuhan, China (武汉人工智能研究院); Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China (武汉大学泰康生命医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work is accepted by CVPR 2025. Modifications may be performed

点击查看摘要

Abstract:Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM’s generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM’s feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.
zh

[CV-48] Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks

【速读】：该论文旨在解决传统人工神经网络（Artificial Neural Networks, ANNs）在遥感图像超分辨率（Remote Sensing Super-Resolution, RSI-SR）任务中的局限性，特别是针对尖峰神经网络（Spiking Neural Networks, SNNs）容量有限及表征能力不足的问题。论文的关键创新在于提出了尖峰注意块（Spiking Attention Block, SAB），通过联合学习时间维度与通道维度的相关特征，并利用全局自相似模式推断空间注意权重，从而实现高效的特征表示和精确的图像重建。基于此，作者开发了SpikeSR模型，在AID、DOTA和DIOR等遥感数据集上达到了最先进的性能，同时保持了高计算效率。

链接: https://arxiv.org/abs/2503.04223
作者: Yi Xiao,Qiangqiang Yuan,Kui Jiang,Qiang Zhang,Tingting Zheng,Chia-Wen Lin,Liangpei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) are emerging as a promising alternative to traditional artificial neural networks (ANNs), offering biological plausibility and energy efficiency. Despite these merits, SNNs are frequently hampered by limited capacity and insufficient representation power, yet remain underexplored in remote sensing super-resolution (SR) tasks. In this paper, we first observe that spiking signals exhibit drastic intensity variations across diverse textures, highlighting an active learning state of the neurons. This observation motivates us to apply SNNs for efficient SR of RSIs. Inspired by the success of attention mechanisms in representing salient information, we devise the spiking attention block (SAB), a concise yet effective component that optimizes membrane potentials through inferred attention weights, which, in turn, regulates spiking activity for superior feature representation. Our key contributions include: 1) we bridge the independent modulation between temporal and channel dimensions, facilitating joint feature correlation learning, and 2) we access the global self-similar patterns in large-scale remote sensing imagery to infer spatial attention weights, incorporating effective priors for realistic and faithful reconstruction. Building upon SAB, we proposed SpikeSR, which achieves state-of-the-art performance across various remote sensing benchmarks such as AID, DOTA, and DIOR, while maintaining high computational efficiency. The code of SpikeSR will be available upon paper acceptance.
zh

[CV-49] Energy-Guided Optimization for Personalized Image Editing with Pretrained Text-to-Image Diffusion Models

【速读】：本文旨在解决个性化图像编辑中个性化内容调整的挑战，特别是处理任意物体和复杂场景时存在的问题。现有方法通常将蒙版误认为对象形状先验，难以实现无缝整合结果，且反转噪声初始化阻碍了目标对象的身份一致性。为应对这些挑战，本文提出了一种无需训练的创新框架，将个性化内容编辑表述为潜在空间中编辑图像的优化问题，并以扩散模型作为由参考文本-图像对调节的能量函数指导。关键在于采用从粗到细的策略，在早期阶段利用文本能量引导实现向目标类别的自然过渡，并使用逐点特征级图像能量引导执行与目标对象的精细化外观对齐。此外，引入潜在空间内容组成以增强整体身份一致性。大量实验表明，本方法在存在较大领域差距的情况下尤其擅长对象替换，展示了其在高质量个性化图像编辑中的潜力。

链接: https://arxiv.org/abs/2503.04215
作者: Rui Jiang,Xinghe Fu,Guangcong Zheng,Teng Li,Taiping Yao,Xi Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.
zh

[CV-50] Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior

【速读】：该论文旨在解决视觉刺激与大脑信号之间的不一致性问题，具体表现为由人类视觉系统处理导致的“System GAP”以及感知动态、认知动态和技术噪声引起的“Random GAP”。这些差距使得基于配对数据训练的模型难以学习并泛化到新数据，尤其是在有限配对数据的情况下容易产生过拟合。为了解决这些问题，论文提出了一种名为“Uncertainty-aware Blur Prior (UBP)”的简单而有效的方法。其关键是通过估计配对数据中的不确定性来反映大脑信号与视觉刺激之间的不匹配，并据此动态模糊原始图像的高频细节，从而减少不匹配的影响并改善两者之间的对齐效果。

链接: https://arxiv.org/abs/2503.04207
作者: Haitao Wu,Qing Li,Changqing Zhang,Zhen He,Xiaomin Ying
机构: College of Intelligence and Computing, Tianjin University (天津大学智能与计算学院); Beijing Institute of Basic Medical Sciences (北京基础医学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Can our brain signals faithfully reflect the original visual stimuli, even including high-frequency details? Although human perceptual and cognitive capacities enable us to process and remember visual information, these abilities are constrained by several factors, such as limited attentional resources and the finite capacity of visual memory. When visual stimuli are processed by human visual system into brain signals, some information is inevitably lost, leading to a discrepancy known as the \textbfSystem GAP. Additionally, perceptual and cognitive dynamics, along with technical noise in signal acquisition, degrade the fidelity of brain signals relative to the visual stimuli, known as the \textbfRandom GAP. When encoded brain representations are directly aligned with the corresponding pretrained image features, the System GAP and Random GAP between paired data challenge the model, requiring it to bridge these gaps. However, in the context of limited paired data, these gaps are difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the \textbfUncertainty-aware Blur Prior (UBP). It estimates the uncertainty within the paired data, reflecting the mismatch between brain signals and visual stimuli. Based on this uncertainty, UBP dynamically blurs the high-frequency details of the original images, reducing the impact of the mismatch and improving alignment. Our method achieves a top-1 accuracy of \textbf50.9% and a top-5 accuracy of \textbf79.7% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by margins of \textbf13.7% and \textbf9.8%, respectively. Code is available at \hrefthis https URLGitHub.
zh

[CV-51] Learning 3D Medical Image Models From Brain Functional Connectivity Network Supervision For Mental Disorder Diagnosis

【速读】：该论文致力于解决功能性 MRI (fMRI) 数据标注集规模较小限制其广泛应用的问题，同时关注到结构 MRI (sMRI)，如 3D T1 加权 (T1w) MRI，在临床场景中的可用性和被忽视的现象。为整合功能与结构信息以提高精神障碍诊断的准确性，论文提出了一种名为 CINP (Contrastive Image-Network Pre-training) 的框架，通过结构 MRI 和功能连接网络 (FCN) 之间的对比学习来实现。解决方案的关键在于利用掩码图像建模和网络-图像匹配来增强视觉表征学习和模态对齐，并通过网络提示 (network prompting) 实现从 FCN 到 sMRI 的知识迁移，从而仅依赖疑似患者的 sMRI 和少量不同患者类别的 FCN 即可进行诊断，这在实际临床场景中具有很高的实用性。

链接: https://arxiv.org/abs/2503.04205
作者: Xingcan Hu,Wei Wang,Li Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In MRI-based mental disorder diagnosis, most previous studies focus on functional connectivity network (FCN) derived from functional MRI (fMRI). However, the small size of annotated fMRI datasets restricts its wide application. Meanwhile, structural MRIs (sMRIs), such as 3D T1-weighted (T1w) MRI, which are commonly used and readily accessible in clinical settings, are often overlooked. To integrate the complementary information from both function and structure for improved diagnostic accuracy, we propose CINP (Contrastive Image-Network Pre-training), a framework that employs contrastive learning between sMRI and FCN. During pre-training, we incorporate masked image modeling and network-image matching to enhance visual representation learning and modality alignment. Since the CINP facilitates knowledge transfer from FCN to sMRI, we introduce network prompting. It utilizes only sMRI from suspected patients and a small amount of FCNs from different patient classes for diagnosing mental disorders, which is practical in real-world clinical scenario. The competitive performance on three mental disorder diagnosis tasks demonstrate the effectiveness of the CINP in integrating multimodal MRI information, as well as the potential of incorporating sMRI into clinical diagnosis using network prompting.
zh

[CV-52] FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization

【速读】：该论文旨在解决现代机器学习算法中一阶和二阶随机优化方法在性能与计算效率之间的权衡问题。一阶方法（如SGD和Adam）在深度学习中占据主导地位，但仅能收敛至平稳点；而二阶方法尽管具有更高的精度，但在高维问题中因计算开销过大而不被广泛采用。论文提出了一种名为FUSE的统一算法框架，并由此衍生出其实用版本FUSE-PV，其关键在于结合一阶和二阶方法的优势，在两者之间实现智能切换。此外，论文设计了多种切换准则以优化性能。理论分析表明，FUSE-PV的计算复杂度低于传统的一阶方法（如SGD和Adam）。为验证方案的有效性，论文通过消融实验在简单测试函数上进行了评估，并与基准数据集上的基线方法进行了对比。

链接: https://arxiv.org/abs/2503.04204
作者: Zhanhong Jiang,Md Zahid Hasan,Aditya Balu,Joshua R. Waite,Genyi Huang,Soumik Sarkar
机构: Iowa State University; Oracle (甲骨文)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:Stochastic optimization methods have actively been playing a critical role in modern machine learning algorithms to deliver decent performance. While numerous works have proposed and developed diverse approaches, first-order and second-order methods are in entirely different situations. The former is significantly pivotal and dominating in emerging deep learning but only leads convergence to a stationary point. However, second-order methods are less popular due to their computational intensity in large-dimensional problems. This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework, termed FUSE, from which a practical version (PV) is derived accordingly. FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders. Additionally, we develop different criteria that determine when to switch. FUSE-PV has provably shown a smaller computational complexity than SGD and Adam. To validate our proposed scheme, we present an ablation study on several simple test functions and show a comparison with baselines for benchmark datasets.
zh

[CV-53] MASTER: Multimodal Segmentation with Text Prompts

【速读】：该论文旨在解决在复杂场景下因天气和光线条件变化导致的多模态数据融合挑战。传统方法通常依赖设计复杂的模块来实现多模态信息融合，而本文提出利用大型语言模型（Large Language Models, LLMs）的优势，设计了一种结构简单且高度适应性强的多模态融合模型架构。关键解决方案在于提出了MASTER（MultimodAl Segmentation with TExt PRompts）架构，该架构将LLM集成到RGB-热成像多模态数据的融合过程中，并允许复杂的查询文本参与融合。通过采用双路径结构提取不同模态图像的信息，并利用LLM作为多模态融合的核心模块生成可学习的代码本令牌（codebook tokens），结合轻量级图像解码器获得语义分割结果，MASTER在多种自动驾驶场景的基准测试中表现出色。

链接: https://arxiv.org/abs/2503.04199
作者: Fuyang Liu,Shun Lu,Jilin Mei,Yu Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RGB-Thermal fusion is a potential solution for various weather and light conditions in challenging scenarios. However, plenty of studies focus on designing complex modules to fuse different modalities. With the widespread application of large language models (LLMs), valuable information can be more effectively extracted from natural language. Therefore, we aim to leverage the advantages of large language models to design a structurally simple and highly adaptable multimodal fusion model architecture. We proposed MultimodAl Segmentation with TExt PRompts (MASTER) architecture, which integrates LLM into the fusion of RGB-Thermal multimodal data and allows complex query text to participate in the fusion process. Our model utilizes a dual-path structure to extract information from different modalities of images. Additionally, we employ LLM as the core module for multimodal fusion, enabling the model to generate learnable codebook tokens from RGB, thermal images, and textual information. A lightweight image decoder is used to obtain semantic segmentation results. The proposed MASTER performs exceptionally well in benchmark tests across various automated driving scenarios, yielding promising results.
zh

[CV-54] Conformal forecasting for surgical instrument trajectory

【速读】：本文旨在解决内窥镜手术中对手术器械运动轨迹预测及其下一步手术动作预测的不确定性量化问题。这两个任务对于内窥镜手术的自动化和辅助至关重要，但由于其安全性要求极高，可靠的不确定性估计显得尤为必要。为应对这一挑战，论文探索了标准非参数预测（Conformal Prediction）和校准分位数回归（Conformalized Quantile Regression）在手术器械运动预测中的应用，以估计方向和幅度的未来运动不确定性。关键在于利用这些方法分析和比较预测区间覆盖率与大小，并评估多重假设检验修正的影响，同时展示如何生成有用的不确定性热图。据作者所知，这是首次将非参数预测应用于手术指导领域，标志着在构建具有正式覆盖保证的原则性预测区间方面的初步尝试。

链接: https://arxiv.org/abs/2503.04191
作者: Sara Sangalli,Gary Sarwin,Ertunc Erdil,Carlo Serra,Ender Konukoglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forecasting surgical instrument trajectories and predicting the next surgical action recently started to attract attention from the research community. Both these tasks are crucial for automation and assistance in endoscopy surgery. Given the safety-critical nature of these tasks, reliable uncertainty quantification is essential. Conformal prediction is a fast-growing and widely recognized framework for uncertainty estimation in machine learning and computer vision, offering distribution-free, theoretically valid prediction intervals. In this work, we explore the application of standard conformal prediction and conformalized quantile regression to estimate uncertainty in forecasting surgical instrument motion, i.e., predicting direction and magnitude of surgical instruments’ future motion. We analyze and compare their coverage and interval sizes, assessing the impact of multiple hypothesis testing and correction methods. Additionally, we show how these techniques can be employed to produce useful uncertainty heatmaps. To the best of our knowledge, this is the first study applying conformal prediction to surgical guidance, marking an initial step toward constructing principled prediction intervals with formal coverage guarantees in this domain.
zh

[CV-55] DuCos: Duality Constrained Depth Super-Resolution via Foundation Model

【速读】：该论文旨在解决深度图像超分辨率（Depth Super-Resolution）领域中模型泛化能力不足的问题，特别是在多样化场景下的性能下降。为应对这一挑战，论文提出了一种基于拉格朗日对偶理论（Lagrangian Duality Theory）的新型框架DuCos，其核心创新在于通过相关融合（Correlative Fusion, CF）和梯度调节（Gradient Regulation, GR）设计的提示（Prompts），将先验知识有效嵌入到优化目标中。CF实现了精确的几何对齐与特征融合，而GR则通过对边缘清晰的基础模型深度图的一致性约束来优化深度预测。这些组件被无缝整合进拉格朗日约束项中，形成了一个协同且有理论依据的框架，从而显著提升了模型的准确性、鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2503.04171
作者: Zhiqiang Yan,Zhengxue Wang,Haoye Dong,Jun Li,Jian Yang,Gim Hee Lee
机构: National University of Singapore (新加坡国立大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization. The source codes and pre-trained models will be publicly available.
zh

[CV-56] he Role of Visual Modality in Multimodal Mathematical Reasoning : Challenges and Insights

【速读】：该论文旨在解决现有多模态数学推理模型未能充分利用视觉信息的问题，并揭示当前评估方法在衡量模型视觉感知能力方面的局限性。论文的关键在于引入了一个新的数据集HC-M3D，该数据集专门设计用于强调图像依赖性，并通过包含相似但细微不同的图像来挑战模型，从而改变正确答案。实验结果显示，主流模型未能检测到这些微妙的视觉差异，表明当前模型在视觉感知方面存在显著限制。此外，论文发现通过组合不同类型图像编码器来提升通用视觉-语言问答（VQA）能力的方法，并未有效改善数学推理性能。这进一步凸显了增强数学推理中视觉依赖性的难度。因此，该研究的核心贡献在于提出了一个能够更全面评估多模态数学推理中视觉信息作用的新基准和相应代码。

链接: https://arxiv.org/abs/2503.04167
作者: Yufang Liu,Yao Du,Tao Ji,Jianing Wang,Yang Liu,Yuanbin Wu,Aimin Zhou,Mengdi Zhang,Xunliang Cai
机构: School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院); Meituan Inc. (美团); Fudan University (复旦大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning. Our benchmark and code would be available at \hrefthis https URLthis https URL_modality_role.
zh

[CV-57] WeakSupCon: Weakly Supervised Contrastive Learning for Encoder Pre-training

【速读】：该论文旨在解决弱监督多重实例学习（Weakly Supervised Multiple Instance Learning, Weakly Supervised MIL）任务中的特征表示问题，其核心挑战在于仅提供袋级标签（bag-level labels），而每个袋包含多个实例。现有方法通常使用固定patch特征作为输入，这些特征多由在ImageNet或大规模数据集上预训练的基础编码器生成，或者通过局部数据上的自监督学习获得。然而，这类方法在面对域偏移（domain shift）时表现不足，并且由于未利用袋级标签，不同类别的patch特征可能聚类在一起，从而降低MIL任务的分类性能。

为了解决上述问题，论文提出了一种新的下游MIL任务编码器预训练方法，称为弱监督对比学习（Weakly Supervised Contrastive Learning, WeakSupCon）。该方法的关键在于利用袋级标签进行多任务学习，并为具有不同袋标签的样本定义不同的对比学习损失函数。实验结果表明，采用WeakSupCon生成的特征显著提升了三个数据集上的MIL分类性能，相比传统的自监督方法表现出明显优势。

链接: https://arxiv.org/abs/2503.04165
作者: Bodong Zhang,Hamid Manoochehri,Beatrice S. Knudsen,Tolga Tasdizen
机构: Department of Electrical and Computer Engineering, University of Utah (犹他大学); Scientific Computing and Imaging Institute, University of Utah (犹他大学); Department of Pathology, University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised multiple instance learning (MIL) is a challenging task given that only bag-level labels are provided, while each bag typically contains multiple instances. This topic has been extensively studied in histopathological image analysis, where labels are usually available only at the whole slide image (WSI) level, while each whole slide image can be divided into thousands of small image patches for training. The dominant MIL approaches take fixed patch features as inputs to address computational constraints and ensure model stability. These features are commonly generated by encoders pre-trained on ImageNet, foundation encoders pre-trained on large datasets, or through self-supervised learning on local datasets. While the self-supervised encoder pre-training on the same dataset as downstream MIL tasks helps mitigate domain shift and generate better features, the bag-level labels are not utilized during the process, and the features of patches from different categories may cluster together, reducing classification performance on MIL tasks. Recently, pre-training with supervised contrastive learning (SupCon) has demonstrated superior performance compared to self-supervised contrastive learning and even end-to-end training on traditional image classification tasks. In this paper, we propose a novel encoder pre-training method for downstream MIL tasks called Weakly Supervised Contrastive Learning (WeakSupCon) that utilizes bag-level labels. In our method, we employ multi-task learning and define distinct contrastive learning losses for samples with different bag labels. Our experiments demonstrate that the features generated using WeakSupCon significantly enhance MIL classification performance compared to self-supervised approaches across three datasets.
zh

[CV-58] CA-W3D: Leverag ing Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection

【速读】：该论文致力于解决弱监督单目3D目标检测中难以捕获可靠3D推理所需的全局上下文的问题。传统标签高效方法主要关注物体中心特征，而忽视了复杂场景中至关重要的上下文语义关系。为应对这一局限性，论文提出了一种名为CA-W3D的上下文感知弱监督方法，在两阶段训练范式下进行优化。解决方案的关键在于：首先通过区域级对象对比匹配（Region-wise Object Contrasting Matching, ROCM）预训练阶段，使可训练的单目3D编码器与冻结的开放词汇2D视觉定位模型对齐，以鼓励编码器区分场景特定属性并获取更丰富的上下文知识；其次在第二阶段引入伪标签训练过程，并采用双到一蒸馏（Dual-to-One Distillation, D2OD）机制，在保持空间保真度的同时有效将上下文先验知识转移到单目编码器中，确保推理过程的计算效率。实验结果表明，该方法在KITTI数据集上的性能超越现有最先进技术，强调了上下文感知知识在弱监督单目3D检测中的重要性。

链接: https://arxiv.org/abs/2503.04154
作者: Chupeng Liu,Runkai Zhao,Weidong Cai
机构: School of Computer Science, Faculty of Engineering, The University of Sydney (悉尼大学), NSW, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The paper includes 8 pages, 6 figures and 4 tables

点击查看摘要

Abstract:Weakly supervised monocular 3D detection, while less annotation-intensive, often struggles to capture the global context required for reliable 3D reasoning. Conventional label-efficient methods focus on object-centric features, neglecting contextual semantic relationships that are critical in complex scenes. In this work, we propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm. Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model. This alignment encourages the monocular encoder to discriminate scene-specific attributes and acquire richer contextual knowledge. In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into the monocular encoder while preserving spatial fidelity and maintaining computational efficiency during inference. Extensive experiments conducted on the public KITTI benchmark demonstrate the effectiveness of our approach, surpassing the SoTA method over all metrics, highlighting the importance of contextual-aware knowledge in weakly-supervised monocular 3D detection.
zh

[CV-59] Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

【速读】：该论文旨在解决现实世界中多视图数据集通常异构且不完美的问题，这使得专门为特定视图组合设计的多视图学习（MVL）方法缺乏应用潜力并限制其有效性。为了解决这一问题，论文提出了一种新颖的鲁棒多视图学习方法（RML），其关键在于同时进行表示融合与对齐。具体而言，引入了一个简单而有效的多视图Transformer融合网络，将异构多视图数据转换为同质词嵌入，并通过样本级注意力机制整合多个视图以获得融合表示。此外，还提出了基于模拟扰动的多视图对比学习框架，动态生成噪声和不可用的扰动来模拟不完美数据条件。通过对比学习对两种不同的融合表示进行对齐，从而学习到判别性和鲁棒性的表示。RML是一种自监督的方法，也可以作为正则化应用于下游任务。实验结果表明，RML在无监督多视图聚类、噪声标签分类以及跨模态哈希检索中均表现出色。

链接: https://arxiv.org/abs/2503.04151
作者: Jie Xu,Na Zhao,Gang Niu,Masashi Sugiyama,Xiaofeng Zhu
机构: University of Electronic Science and Technology of China (电子科技大学), Chengdu, China; Singapore University of Technology and Design (新加坡科技设计大学), Singapore; Southeast University (东南大学), Nanjing, China; The University of Tokyo (东京大学), Tokyo, Japan; Hainan University (海南大学), Haikou, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually makes MVL methods designed for specific combinations of views lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate the effectiveness of RML.
zh

[CV-60] DM-Adapter: Domain-Aware Mixture-of-Adapters for Text-Based Person Retrieval AAAI2025

【速读】：该论文旨在解决文本驱动的人物检索（Text-based Person Retrieval, TPR）任务中的两个主要问题：(1) 传统全模型微调方法在计算上昂贵且容易过拟合；(2) 当前针对TPR的参数高效迁移学习（Parameter-Efficient Transfer Learning, PETL）缺乏细粒度特征提取能力。为了解决这些问题，论文提出了一种领域感知的适配器混合模型（Domain-Aware Mixture-of-Adapters, DM-Adapter）。其关键是通过将专家混合模型（Mixture-of-Experts, MOE）与PETL相结合，设计出稀疏的适配器混合结构以增强细粒度特征表示，同时保持计算效率。此外，通过开发领域感知路由机制（Domain-Aware Router），构建新颖的门控函数并引入可学习的领域感知提示，进一步优化路由策略，从而有效利用领域信息并缓解路由不平衡问题。实验结果表明，DM-Adapter在性能上超越了现有方法。

链接: https://arxiv.org/abs/2503.04144
作者: Yating Liu,Zimo Liu,Xiangyuan Lan,Wenming Yang,Yaowei Li,Qingmin Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, accepted by AAAI 2025

点击查看摘要

Abstract:Text-based person retrieval (TPR) has gained significant attention as a fine-grained and challenging task that closely aligns with practical applications. Tailoring CLIP to person domain is now a emerging research topic due to the abundant knowledge of vision-language pretraining, but challenges still remain during fine-tuning: (i) Previous full-model fine-tuning in TPR is computationally expensive and prone to overfitting.(ii) Existing parameter-efficient transfer learning (PETL) for TPR lacks of fine-grained feature extraction. To address these issues, we propose Domain-Aware Mixture-of-Adapters (DM-Adapter), which unifies Mixture-of-Experts (MOE) and PETL to enhance fine-grained feature representations while maintaining efficiency. Specifically, Sparse Mixture-of-Adapters is designed in parallel to MLP layers in both vision and language branches, where different experts specialize in distinct aspects of person knowledge to handle features more finely. To promote the router to exploit domain information effectively and alleviate the routing imbalance, Domain-Aware Router is then developed by building a novel gating function and injecting learnable domain-aware prompts. Extensive experiments show that our DM-Adapter achieves state-of-the-art performance, outperforming previous methods by a significant margin.
zh

[CV-61] Robust Computer-Vision based Construction Site Detection for Assistive-Technology Applications

【速读】：该论文旨在解决盲人和低视力人士在城市环境中导航时面临的挑战，特别是在包含动态且不可预测元素（如施工场地）的场景中。传统导航应用及检测工具难以有效识别施工场地中的复杂多变障碍物，从而影响安全通行。为此，论文提出了一种基于计算机视觉的新系统，其关键在于结合开放词汇目标检测技术、基于YOLO的脚手架杆检测模型以及光学字符识别(OCR)模块，以全面识别和解析施工场地中的各种要素，为辅助导航提供支持。该系统在静态测试中于七个施工场地实现了88.56%的整体准确率，并在近距离(2-4米)内所有角度实现了100%的检测率，显著提升了无障碍导航的可靠性与实用性。

链接: https://arxiv.org/abs/2503.04139
作者: Junchi Feng,Giles Hamilton-Fletcher,Nikhil Ballem,Michael Batavia,Yifei Wang,Jiuling Zhong,Maurizio Porfiri,John-Ross Rizzo
机构: Beijing Institute of Technology (北京理工大学); New York University (纽约大学); Google (谷歌); University of California, San Diego (加州大学圣地亚哥分校); New York University (纽约大学); New York University (纽约大学); New York University (纽约大学); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Navigating urban environments poses significant challenges for people with disabilities, particularly those with blindness and low vision. Environments with dynamic and unpredictable elements like construction sites are especially challenging. Construction sites introduce hazards like uneven surfaces, obstructive barriers, hazardous materials, and excessive noise, and they can alter routing, complicating safe mobility. Existing assistive technologies are limited, as navigation apps do not account for construction sites during trip planning, and detection tools that attempt hazard recognition struggle to address the extreme variability of construction paraphernalia. This study introduces a novel computer vision-based system that integrates open-vocabulary object detection, a YOLO-based scaffolding-pole detection model, and an optical character recognition (OCR) module to comprehensively identify and interpret construction site elements for assistive navigation. In static testing across seven construction sites, the system achieved an overall accuracy of 88.56%, reliably detecting objects from 2m to 10m within a 0 ^\circ – 75 ^\circ angular offset. At closer distances (2–4m), the detection rate was 100% at all tested angles. At
zh

[CV-62] Real-time Spatial-temporal Traversability Assessment via Feature-based Sparse Gaussian Process

【速读】：该论文旨在解决地面移动机器人在复杂地形（尤其是户外非结构化环境）中自主导航的地形可通行性评估问题。解决方案的关键在于提出了一种新颖的空间-时间可通行性评估方法：首先利用稀疏高斯过程（Sparse Gaussian Processes, SGP）从点云扫描中直接提取几何特征（如曲率、梯度、高程等），构建高分辨率局部可通行性地图；其次设计了一种空间-时间贝叶斯高斯核（Bayesian Gaussian Kernel, BGK）推理方法，动态评估可通行性评分，同时结合历史数据与实时数据，并考虑坡度、平坦度、梯度及不确定性等多因素；最后通过GPU加速实现特征提取的实时性能。仿真实验验证了该方法在准确性与计算效率方面优于现有技术。

链接: https://arxiv.org/abs/2503.04134
作者: Senming Tan,Zhenyu Hou,Zhihao Zhang,Long Xu,Mengke Zhang,Zhaoqi He,Chao Xu,Fei Gao,Yanjun Cao
机构: Huzhou Institute of Zhejiang University (湖州浙江大学研究所), Zhejiang University (浙江大学), Hangzhou 310027, China (中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures

点击查看摘要

Abstract:Terrain analysis is critical for the practical application of ground mobile robots in real-world tasks, especially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high-resolution local traversability map. Then, we design a spatial-temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outperforms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor environments. Our code will be open-source for further research and development by the community, this https URL.
zh

[CV-63] Q-PART: Quasi-Periodic Adaptive Regression with Test-time Training for Pediatric Left Ventricular Ejection Fraction Regression CVPR2025

【速读】：本文旨在解决儿科左心室射血分数（Left Ventricular Ejection Fraction, LVEF）评估中的自适应挑战。现有测试时训练（Test-time Training, TTT）方法虽有潜力，但存在两个主要局限：一是主要针对分类任务设计，而非连续值回归；二是缺乏处理心脏信号准周期特性的机制。为解决这些问题，论文提出了一种新颖的准周期自适应回归结合测试时训练（Quasi-Periodic Adaptive Regression with Test-time Training, Q-PART）框架。其关键在于在训练阶段通过参数化螺旋轨迹与神经控制微分方程结合，在潜在空间中分解超声心动图的周期性和非周期性成分；推理阶段则采用图像增强模拟常见超声质量缺陷的方差最小化策略，并针对周期性和非周期性成分设置差异化的适应率。理论分析表明，该方差最小化目标在温和条件下能够有效约束回归误差。此外，实验证明Q-PART不仅显著优于现有方法，还具有高mAUROC评分（高达0.9747），并在所有指标上保持性别公平性，验证了其在儿科超声心动图分析中的鲁棒性和实用性。

链接: https://arxiv.org/abs/2503.04131
作者: Jie Liu,Tiexin Qin,Hui Liu,Yilei Shi,Lichao Mou,Xiao Xiang Zhu,Shiqi Wang,Haoliang Li
机构: City University of Hong Kong (香港城市大学); MedAI Technology (Wuxi) Co. Ltd. (医智科技（无锡）有限公司); Technische Universität München (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:In this work, we address the challenge of adaptive pediatric Left Ventricular Ejection Fraction (LVEF) assessment. While Test-time Training (TTT) approaches show promise for this task, they suffer from two significant limitations. Existing TTT works are primarily designed for classification tasks rather than continuous value regression, and they lack mechanisms to handle the quasi-periodic nature of cardiac signals. To tackle these issues, we propose a novel \textbfQuasi-\textbfPeriodic \textbfAdaptive \textbfRegression with \textbfTest-time Training (Q-PART) framework. In the training stage, the proposed Quasi-Period Network decomposes the echocardiogram into periodic and aperiodic components within latent space by combining parameterized helix trajectories with Neural Controlled Differential Equations. During inference, our framework further employs a variance minimization strategy across image augmentations that simulate common quality issues in echocardiogram acquisition, along with differential adaptation rates for periodic and aperiodic components. Theoretical analysis is provided to demonstrate that our variance minimization objective effectively bounds the regression error under mild conditions. Furthermore, extensive experiments across three pediatric age groups demonstrate that Q-PART not only significantly outperforms existing approaches in pediatric LVEF prediction, but also exhibits strong clinical screening capability with high mAUROC scores (up to 0.9747) and maintains gender-fair performance across all metrics, validating its robustness and practical utility in pediatric echocardiography analysis.
zh

[CV-64] oken-Efficient Long Video Understanding for Multimodal LLM s

【速读】：该论文旨在解决现有基于视频的多模态大语言模型（Video-LLMs）在处理视频时缺乏显式时序建模的问题。许多现有方法将帧独立处理，未能有效捕捉动态模式或高效处理长视频，从而限制了其视频理解能力。为了解决这些问题，论文提出了一种名为STORM的新架构，其关键是引入了一个专用的时序编码器，位于图像编码器与大语言模型之间。该时序编码器利用Mamba状态空间模型将时序信息融入图像标记中，生成能够保留整个视频序列中帧间动态的增强表示。这种增强编码不仅提升了视频推理能力，还支持有效的标记减少策略，如测试时采样和基于训练的时序与空间池化，显著降低了大语言模型的计算需求，同时保持关键时序信息不变。通过这些技术的结合，STORM实现了训练和推理延迟的同时降低，并提升了性能，在多个长视频理解基准测试中达到了最先进的结果，计算成本最多降低了8倍，解码延迟减少了2.4到2.9倍。

链接: https://arxiv.org/abs/2503.04130
作者: Jindong Jiang,Xiuyu Li,Zhijian Liu,Muyang Li,Guo Chen,Zhiqi Li,De-An Huang,Guilin Liu,Zhiding Yu,Kurt Keutzer,Sungjin Ahn,Jan Kautz,Hongxu Yin,Yao Lu,Song Han,Wonmin Byeon
机构: NVIDIA; Rutgers University; UC Berkeley; MIT; Nanjing University; KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (\textbfSpatiotemporal \textbfTOken \textbfReduction for \textbfMultimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8\times and the decoding latency by 2.4-2.9 \times for the fixed numbers of input frames. Project page is available at this https URL
zh

[CV-65] Diff-Reg v2: Diffusion-Based Matching Matrix Estimation for Image Matching and 3D Registration

【速读】：该论文旨在解决跨模态（如2D图像与点云、3D点云之间）以及单模态（如2D图像之间、3D点云之间）配准任务中可靠对应关系建立的挑战。这些问题的核心难点包括尺度不一致、对称性及大形变导致的模糊匹配现象。传统基于特征或对应关系的方法通常依赖几何或语义特征来生成初始潜在对应关系，而许多方法还受限于单一增强目标下的特定几何先验设计，且在复杂匹配场景中容易陷入局部最优解。论文的关键创新在于提出了一种利用矩阵空间扩散模型进行鲁棒匹配矩阵估计的新范式。具体而言，该方法将对应关系估计视为匹配矩阵空间中的去噪扩散过程，通过逐步优化中间匹配矩阵至全局最优解。此外，论文针对不同注册任务（2D-2D、3D-3D、2D-3D）设计了自适应匹配矩阵嵌入实现，并采用轻量级去噪模块以提高效率，在推理阶段通过反向采样执行多步去噪预测。

链接: https://arxiv.org/abs/2503.04127
作者: Qianliang Wu,Haobo Jiang,Yaqing Ding,Lei Luo,Jin Xie,Jian Yang
机构: PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology (南京理工大学计算机科学与工程学院; 高维信息智能感知系统教育部重点实验室; PCA实验室); Nanyang Technological University (南洋理工大学); Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical University in Prague (捷克布拉格捷克技术大学电气工程学院视觉识别小组); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); School of Intelligence Science and Technology, Nanjing University, Suzhou, China (南京大学苏州智能科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2403.19919

点击查看摘要

Abstract:Establishing reliable correspondences is crucial for all registration tasks, including 2D image registration, 3D point cloud registration, and 2D-3D image-to-point cloud registration. However, these tasks are often complicated by challenges such as scale inconsistencies, symmetry, and large deformations, which can lead to ambiguous matches. Previous feature-based and correspondence-based methods typically rely on geometric or semantic features to generate or polish initial potential correspondences. Some methods typically leverage specific geometric priors, such as topological preservation, to devise diverse and innovative strategies tailored to a given enhancement goal, which cannot be exhaustively enumerated. Additionally, many previous approaches rely on a single-step prediction head, which can struggle with local minima in complex matching scenarios. To address these challenges, we introduce an innovative paradigm that leverages a diffusion model in matrix space for robust matching matrix estimation. Our model treats correspondence estimation as a denoising diffusion process in the matching matrix space, gradually refining the intermediate matching matrix to the optimal one. Specifically, we apply the diffusion model in the doubly stochastic matrix space for 3D-3D and 2D-3D registration tasks. In the 2D image registration task, we deploy the diffusion model in a matrix subspace where dual-softmax projection regularization is applied. For all three registration tasks, we provide adaptive matching matrix embedding implementations tailored to the specific characteristics of each task while maintaining a consistent “match-to-warp” encoding pattern. Furthermore, we adopt a lightweight design for the denoising module. In inference, once points or image features are extracted and fixed, this module performs multi-step denoising predictions through reverse sampling.
zh

[CV-66] DVM-SLAM: Decentralized Visual Monocular Simultaneous Localization and Mapping for Multi-Agent Systems

【速读】：该论文旨在解决多智能体在未知环境中同时进行环境建图与自身定位（Cooperative Simultaneous Localization and Mapping, C-SLAM）的问题。传统集中式方法存在单点失效风险及计算瓶颈，而该研究提出的关键解决方案是开发首个开源的去中心化单目视觉 C-SLAM 系统——Decentralized Visual Monocular SLAM (DVM-SLAM)。通过仅利用低成本、轻量化的单目视觉传感器，DVM-SLAM 实现了对小型机器人和微型飞行器（Micro Aerial Vehicles, MAVs）的良好适配，并通过信息共享、减少漂移及集体探索能力，提升了系统的鲁棒性、可扩展性和精度。此外，其在真实物理机器人上的实验验证展示了系统在实时多智能体自主导航场景中的潜力，同时证明了与现有最先进的集中式单目 C-SLAM 方法具有相当的准确性。

链接: https://arxiv.org/abs/2503.04126
作者: Joshua Bird,Jan Blumenkamp,Amanda Prorok
机构: Department of Computer Science and Technology, University of Cambridge (计算机科学与技术系，剑桥大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Cooperative Simultaneous Localization and Mapping (C-SLAM) enables multiple agents to work together in mapping unknown environments while simultaneously estimating their own positions. This approach enhances robustness, scalability, and accuracy by sharing information between agents, reducing drift, and enabling collective exploration of larger areas. In this paper, we present Decentralized Visual Monocular SLAM (DVM-SLAM), the first open-source decentralized monocular C-SLAM system. By only utilizing low-cost and light-weight monocular vision sensors, our system is well suited for small robots and micro aerial vehicles (MAVs). DVM-SLAM’s real-world applicability is validated on physical robots with a custom collision avoidance framework, showcasing its potential in real-time multi-agent autonomous navigation scenarios. We also demonstrate comparable accuracy to state-of-the-art centralized monocular C-SLAM systems. We open-source our code and provide supplementary material online.
zh

[CV-67] GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping ICRA2025

【速读】：该论文旨在解决灵巧抓取生成中对SE(3)变换的不变性问题，同时提升模型的数据和参数效率，并确保生成的抓取具有物理上的可行性和稳定性。论文的关键创新在于提出了一种名为GAGrasp的新框架，通过采用几何代数表示直接将SE(3)对称性约束嵌入到架构中，从而实现对变换的等变性。此外，还引入了一个可微的基于物理信息的细化层，进一步增强了生成抓取的稳定性和现实适用性。实验结果表明，该方法在泛化能力、稳定性及适应性方面显著优于现有技术。

链接: https://arxiv.org/abs/2503.04123
作者: Tao Zhong,Christine Allen-Blanchette
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICRA 2025

点击查看摘要

Abstract:We propose GAGrasp, a novel framework for dexterous grasp generation that leverages geometric algebra representations to enforce equivariance to SE(3) transformations. By encoding the SE(3) symmetry constraint directly into the architecture, our method improves data and parameter efficiency while enabling robust grasp generation across diverse object poses. Additionally, we incorporate a differentiable physics-informed refinement layer, which ensures that generated grasps are physically plausible and stable. Extensive experiments demonstrate the model’s superior performance in generalization, stability, and adaptability compared to existing methods. Additional details at this https URL
zh

[CV-68] Simple Self Organizing Map with Visual Transformer

【速读】：该论文试图解决视觉Transformer（Vision Transformers, ViTs）在小数据集上表现不佳的问题，其根源在于ViTs缺乏归纳偏置（inductive biases）。为应对这一挑战，当前方法通常通过预训练任务或从卷积神经网络（Convolutional Neural Networks, CNNs）蒸馏知识来隐式缓解此限制。然而，本文提出了一种不同的思路：利用自组织映射（Self-Organizing Maps, SOMs）的固有特性——能够保持拓扑结构和空间组织，直接解决ViTs在小样本场景下的局限性。尽管SOMs具有这种潜力，但将其与现代深度学习架构结合的研究仍较少。为此，本研究探索了ViTs与SOMs如何相互赋能，关键在于设计一种能够融合两者优势的新型框架，使它们在无监督和有监督任务中均表现出显著性能提升。

链接: https://arxiv.org/abs/2503.04121
作者: Alan Luo,Kaiwen Yuan
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Safari AI Inc. (Safari AI Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 4 figures. Submitted to IEEE. All experiments and code work were performed by the first author, with the second author serving in a PI/mentor role, guiding the progression of the work

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code will be publicly available.
zh

[CV-69] SCSA: A Plug-and-Play Semantic Continuous-Sparse Attention for Arbitrary Semantic Style Transfer CVPR2025

【速读】：该论文旨在解决现有基于注意力机制的任意风格迁移方法（包括基于CNN、Transformer和扩散模型的方法）在处理语义内容与风格图像具有相同语义的情况下表现不佳的问题，即生成的风格化图像中对应语义区域的风格与其风格图像中的风格不一致。论文认为其根源在于这些方法未能充分考虑局部区域与语义区域之间的关系。为了解决这一问题，论文提出了一种名为语义连续-稀疏注意力（Semantic Continuous-Sparse Attention, SCSA）的即插即用模块。SCSA的关键在于结合语义连续注意力（Semantic Continuous Attention）和语义稀疏注意力（Semantic Sparse Attention）：前者确保每个查询点能够全面关注同一语义区域内所有反映该区域整体风格特征的连续关键点；后者允许每个查询点专注于同一语义区域内最相似的稀疏关键点，以捕捉该区域特定的纹理风格。通过这种组合，SCSA不仅实现了对应语义区域的整体风格对齐，还成功传递了这些区域的生动纹理。定性和定量结果验证了SCSA使基于注意力机制的任意风格迁移方法能够生成高质量的语义风格化图像。

链接: https://arxiv.org/abs/2503.04119
作者: Chunnan Shang,Zhizhong Wang,Hongwei Wang,Xiangming Meng
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer – each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.
zh

[CV-70] Fractional Correspondence Framework in Detection Transformer

【速读】：该论文旨在解决 DETR 中因严格一对一匹配策略导致的无法有效处理不同密度和分布的物体检测问题，例如未能应对同一物体的多次检测或遗漏小物体的情况。为了解决这一问题，论文提出了一种名为 Regularized Transport Plan (RTP) 的灵活匹配策略。RTP 的关键在于通过可微分的 Sinkhorn 算法实现软性、分数级的匹配，而非严格的一对一匹配，从而更准确地捕捉预测框与真实标注之间的成本，优化两者的对应关系。这种方法显著提升了模型在处理不同物体密度和分布时的能力，并在 MS-COCO 和 VOC 数据集上的实验验证了其有效性，使 RTP-DETR 的性能超越了 Deform-DETR 和 DINO-DETR，分别在平均精度均值 (mAP) 上取得了 +3.8% 和 +1.7% 的绝对提升。

链接: https://arxiv.org/abs/2503.04107
作者: Masoumeh Zareapoor,Pourya Shamsolmoali,Huiyu Zhou,Yue Lu,Salvador García
机构: Shanghai Jiao Tong University (上海交通大学); East China Normal University (华东师范大学); University of Leicester (莱斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Detection Transformer (DETR), by incorporating the Hungarian algorithm, has significantly simplified the matching process in object detection tasks. This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training. While effective, this strict matching process does not inherently account for the varying densities and distributions of objects, leading to suboptimal correspondences such as failing to handle multiple detections of the same object or missing small objects. To address this, we propose the Regularized Transport Plan (RTP). RTP introduces a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences between these sets. By utilizing the differentiable Sinkhorn algorithm, RTP allows for soft, fractional matching rather than strict one-to-one assignments. This approach enhances the model’s capability to manage varying object densities and distributions effectively. Our extensive evaluations on the MS-COCO and VOC benchmarks demonstrate the effectiveness of our approach. RTP-DETR, surpassing the performance of the Deform-DETR and the recently introduced DINO-DETR, achieving absolute gains in mAP of +3.8% and +1.7%, respectively.
zh

[CV-71] WeakMedSAM: Weakly-Supervised Medical Image Segmentation via SAM with Sub-Class Exploration and Prompt Affinity Mining

【速读】：该论文旨在解决医学图像分割中由于像素级标注成本高昂导致的标注负担问题，提出了一种新颖的弱监督基于Segment Anything Model (SAM) 的分割模型WeakMedSAM。其关键在于设计了两个模块：1）子类探索模块（Sub-class Exploration Module），用于缓解医学图像中严重的共现问题，学习更精确的特征表示；2）提示亲和力挖掘模块（Prompt Affinity Mining Module），利用SAM的提示能力生成亲和图以实现随机游走优化，从而提升类别激活图的质量。此方法可适配任何类似SAM的主干网络，并在BraTS 2019、AbdomenCT-1K及MSD Cardiac等数据集上验证了其有效性。

链接: https://arxiv.org/abs/2503.04106
作者: Haoran Wang,Lian Huai,Wenbin Li,Lei Qi,Xingqun Jiang,Yinghuan Shi
机构: National Key Laboratory for Novel Software Technology (国家重点实验室); National Institute of Healthcare Data Science (国家健康数据科学研究院), Nanjing University (南京大学); Nanjing Drum Tower Hospital (南京鼓楼医院), Nanjing, Jiangsu, China; Pattern Learning and Mining (PALM) Lab (模式学习与挖掘实验室), School of Computer Science and Engineering (计算机科学与工程学院), Southeast University (东南大学); BOE Technology Group Co., Ltd. (京东方科技集团股份有限公司), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We have witnessed remarkable progress in foundation models in vision tasks. Currently, several recent works have utilized the segmenting anything model (SAM) to boost the segmentation performance in medical images, where most of them focus on training an adaptor for fine-tuning a large amount of pixel-wise annotated medical images following a fully supervised manner. In this paper, to reduce the labeling cost, we investigate a novel weakly-supervised SAM-based segmentation model, namely WeakMedSAM. Specifically, our proposed WeakMedSAM contains two modules: 1) to mitigate severe co-occurrence in medical images, a sub-class exploration module is introduced to learn accurate feature representations. 2) to improve the quality of the class activation maps, our prompt affinity mining module utilizes the prompt capability of SAM to obtain an affinity map for random-walk refinement. Our method can be applied to any SAM-like backbone, and we conduct experiments with SAMUS and EfficientSAM. The experimental results on three popularly-used benchmark datasets, i.e., BraTS 2019, AbdomenCT-1K, and MSD Cardiac dataset, show the promising results of our proposed WeakMedSAM. Our code is available at this https URL.
zh

[CV-72] Image-Based Relocalization and Alignment for Long-Term Monitoring of Dynamic Underwater Environments

【速读】：该论文旨在解决自动化水下生态系统管理中的挑战，特别是由于复杂水下图像导致的传统视觉定位方法难以有效工作的问题。论文的关键解决方案是提出了一种集成管道，结合视觉位置识别（Visual Place Recognition, VPR）、特征匹配和基于视频衍生图像的图像分割技术，以实现对重访区域的鲁棒识别、刚体变换的估计以及生态变化的下游分析。此外，论文引入了SQUIDLE+ VPR基准数据集，这是首个大规模水下VPR基准，利用来自多个机器人平台的大量非结构化数据，涵盖从天到年的长时间间隔。该数据集包含多样化的轨迹、任意重叠以及在不同环境条件下捕获的多种海底类型，包括深度、光照和浊度的变化。

链接: https://arxiv.org/abs/2503.04096
作者: Beverley Gorry,Tobias Fischer,Michael Milford,Alejandro Fontan
机构: QUT Centre for Robotics, School of Electrical Engineering and Robotics, Queensland University of Technology (昆士兰科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on video-derived images. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR Benchmark-the first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories, arbitrary overlap and diverse seafloor types captured under varying environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: this https URL
zh

[CV-73] Brain Tumor Detection in MRI Based on Federated Learning with YOLOv11

【速读】：该论文试图解决在医学诊断中使用磁共振成像（MRI）检测脑肿瘤时准确性与效率之间的挑战，同时克服现有机器学习（ML）方法中存在的数据隐私保护不足和高延迟两大主要限制。解决方案的关键在于提出了一种联邦学习（Federated Learning）架构，结合YOLOv11算法，用于更精确的脑肿瘤检测。该联邦学习方法通过在多个机构间协作训练深度学习模型，同时保护敏感的医疗数据隐私，从而避免了传统集中式学习的数据隐私风险。此外，通过对YOLOv11模型进行调整以处理MRI数据，并在来自多个匿名医疗机构的广泛MRI数据集上进行训练和验证，确保了模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2503.04087
作者: Sheikh Moonwara Anjum Monisha,Ratun Rahman
机构: Department of Computer Science, Virginia Tech (弗吉尼亚理工大学计算机科学系); Department of Electrical and Computer Engineering, The University of Alabama in Huntsville (阿拉巴马大学汉斯维尔分校电气与计算机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the primary challenges in medical diagnostics is the accurate and efficient use of magnetic resonance imaging (MRI) for the detection of brain tumors. But the current machine learning (ML) approaches have two major limitations, data privacy and high latency. To solve the problem, in this work we propose a federated learning architecture for a better accurate brain tumor detection incorporating the YOLOv11 algorithm. In contrast to earlier methods of centralized learning, our federated learning approach protects the underlying medical data while supporting cooperative deep learning model training across multiple institutions. To allow the YOLOv11 model to locate and identify tumor areas, we adjust it to handle MRI data. To ensure robustness and generalizability, the model is trained and tested on a wide range of MRI data collected from several anonymous medical facilities. The results indicate that our method significantly maintains higher accuracy than conventional approaches.
zh

[CV-74] Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting

【速读】：该论文旨在解决从单目腹腔镜视频中实现手术器械的高保真三维重建问题，同时确保其几何可控性和操作性。为应对这一挑战，论文提出了一种名为“\textit{Instrument-Splatting}”的新方法，其关键是利用3D高斯点云（3D Gaussian Splatting）进行手术器械的全可控三维重建，并通过几何预训练将高斯点云绑定到部分网格以引入精确的几何先验，同时定义前向运动学以灵活控制高斯点云的行为，使其与真实器械的操作特性相匹配。此外，为处理未定位的视频，论文设计了一种基于语义嵌入高斯点云的器械姿态跟踪方法，在渲染和比较的过程中稳健地优化每帧的器械姿态和关节状态，从而实现纹理的真实学习和照片级渲染效果。

链接: https://arxiv.org/abs/2503.04082
作者: Shuojue Yang,Zijian Wu,Mingxuan Hong,Qian Li,Daiyun Shen,Septimiu E. Salcudean,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Real2Sim is becoming increasingly important with the rapid development of surgical artificial intelligence (AI) and autonomy. In this work, we propose a novel Real2Sim methodology, \textitInstrument-Splatting, that leverages 3D Gaussian Splatting to provide fully controllable 3D reconstruction of surgical instruments from monocular surgical videos. To maintain both high visual fidelity and manipulability, we introduce a geometry pre-training to bind Gaussian point clouds on part mesh with accurate geometric priors and define a forward kinematics to control the Gaussians as flexible as real instruments. Afterward, to handle unposed videos, we design a novel instrument pose tracking method leveraging semantics-embedded Gaussians to robustly refine per-frame instrument poses and joint states in a render-and-compare manner, which allows our instrument Gaussian to accurately learn textures and reach photorealistic rendering. We validated our method on 2 publicly released surgical videos and 4 videos collected on ex vivo tissues and green screens. Quantitative and qualitative evaluations demonstrate the effectiveness and superiority of the proposed method.
zh

[CV-75] Surgical Gaussian Surfels: Highly Accurate Real-time Surgical Scene Rendering

【速读】：该论文旨在解决单目内窥镜视频中可变形组织几何重建的准确性问题，这是机器人辅助微创手术中的一个基础性挑战。现有基于神经辐射场（NeRF）和三维高斯基元的方法虽能高效渲染手术场景，但在处理无伪影的工具遮挡和保留精细解剖细节方面仍存在困难。这些问题源于高斯尺度的无约束变化以及重建过程中表面对齐约束不足。为了解决这些问题，论文提出了外科高斯Surfel（SGS），通过沿视图对齐轴约束高斯协方差矩阵的尺度分量，将各向异性点基元转换为与表面对齐的椭圆光片。关键解决方案在于利用轻量级多层感知机（MLP）结合局部性约束来预测准确的Surfel运动场，以应对复杂的组织形变，并通过同方向视空间位置梯度在过重建区域分割高斯Surfel来捕获图像细节。此外，定义表面法线为每个高斯Surfel基元内密度变化最快的的方向，实现了无需单目法线先验即可进行精确法线估计的能力。论文在两个体内手术数据集上的评估表明，SGS方法在表面几何、法线图质量和渲染效率方面优于当前最先进的方法，同时保持了实时渲染性能的竞争优势。

链接: https://arxiv.org/abs/2503.04079
作者: Idris O. Sunmola,Zhenjun Zhao,Samuel Schmidgall,Yumeng Wang,Paul Maria Scheikl,Axel Krieger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate geometric reconstruction of deformable tissues in monocular endoscopic video remains a fundamental challenge in robot-assisted minimally invasive surgery. Although recent volumetric and point primitive methods based on neural radiance fields (NeRF) and 3D Gaussian primitives have efficiently rendered surgical scenes, they still struggle with handling artifact-free tool occlusions and preserving fine anatomical details. These limitations stem from unrestricted Gaussian scaling and insufficient surface alignment constraints during reconstruction. To address these issues, we introduce Surgical Gaussian Surfels (SGS), which transforms anisotropic point primitives into surface-aligned elliptical splats by constraining the scale component of the Gaussian covariance matrix along the view-aligned axis. We predict accurate surfel motion fields using a lightweight Multi-Layer Perceptron (MLP) coupled with locality constraints to handle complex tissue deformations. We use homodirectional view-space positional gradients to capture fine image details by splitting Gaussian Surfels in over-reconstructed regions. In addition, we define surface normals as the direction of the steepest density change within each Gaussian surfel primitive, enabling accurate normal estimation without requiring monocular normal priors. We evaluate our method on two in-vivo surgical datasets, where it outperforms current state-of-the-art methods in surface geometry, normal map quality, and rendering efficiency, while remaining competitive in real-time rendering performance. We make our code available at this https URL
zh

[CV-76] Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition

【速读】：该论文旨在解决自然驾驶行为识别在复杂现实背景下的挑战，特别是现有方法因难以观察细微行为差异及有效学习帧间特征而导致的实用性限制。论文提出了一种新颖的空间-时间感知（Spatial-Temporal Perception, STP）架构，其关键在于同时强调时间信息与关键物体之间的空间关系，并通过因果解码器实现行为识别与时间动作定位。STP直接从RGB视频片段中提取时空距离特征，在不同尺度上整合时空特征以感知挑战场景中的细微行为变化，同时引入因果感知模块探索视频帧特征间的关系，显著提升检测效率与性能。实验验证表明，该框架在两个公开的驾驶员分心检测基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2503.04078
作者: Qing Chang,Wei Dai,Zhihao Shuai,Limin Yu,Yutao Yue
机构: School of Mechanical Engineering, Nanjing University of Science and Technology (南京理工大学机械工程学院); School of Advanced Technology, Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学先进技术学院); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州）);
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.
zh

[CV-77] FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

【速读】：该论文致力于解决音频驱动的 Talking Portrait（语音驱动的说话人脸合成）中高保真唇形与语音同步的问题。当前多阶段流水线或多扩散模型虽能生成高质量结果，但计算成本高昂；而低资源下针对特定个体的方法又容易出现唇形错配现象。上述方法大多基于像素域建模，然而研究发现，合成的说话视频与自然视频之间在频域存在明显差异，目前尚无相关研究关注这一方面。为此，论文提出了一种名为FREAK的频率调制、高保真且实时的音频驱动说话人脸合成框架，从频域视角建模说话人脸，以提升合成图像的真实感与细节丰富度。

解决方案的关键在于引入两个创新性的频域模块：1）Visual Encoding Frequency Modulator (VEFM)，用于耦合频域中的多尺度视觉特征，更好地保留视觉频域信息，并缩小合成帧与自然帧之间的频谱差距；2）Audio Visual Frequency Modulator (AVFM)，帮助模型学习频域中的说话模式，从而改善音画同步效果。此外，FREAK还联合优化了像素域和频域模型，并支持单样本与视频配音设置之间的无缝切换，提升了灵活性。最终，实验表明该方法能够实时生成高保真、面部纹理细致且唇形同步精准的说话人脸，性能超越现有最先进方法。

链接: https://arxiv.org/abs/2503.04067
作者: Ziqi Ni,Ao Fu,Yi Zhou
机构: School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院), China; Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education (教育部新一代人工智能技术及其交叉应用重点实验室), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.
zh

[CV-78] H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision ICRA2025

【速读】：该论文旨在解决3D占用预测在自动驾驶场景中的高计算成本问题，并提出了一种新的方法H3O。现有方法通常依赖于基于注意力机制的2D到3D变换和复杂的3D特征处理，导致计算开销较大。H3O通过引入高度高效的架构设计显著降低了计算成本，同时针对真实3D占用标签中存在的模糊性，利用辅助任务补充直接3D监督。具体而言，通过可微分体绘制技术整合多摄像头深度估计、语义分割以及表面法线估计，并以对应的2D标签为监督信号，引入丰富且异构的监督信息。实验结果表明，H3O在Occ3D-nuScenes和SemanticKITTI数据集上的优越性能验证了解决方案的有效性。

链接: https://arxiv.org/abs/2503.04059
作者: Yunxiao Shi,Hong Cai,Amin Ansari,Fatih Porikli
机构: Qualcomm Technologies, Inc. (高通公司); Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025

点击查看摘要

Abstract:3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention-based 2D-3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state-of-the-art methods. In addition, to compensate for the ambiguity in ground-truth 3D occupancy labels, we advocate leveraging auxiliary tasks to complement the direct 3D supervision. In particular, we integrate multi-camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels that introduces rich and heterogeneous supervision signals. We conduct extensive experiments on the Occ3D-nuScenes and SemanticKITTI benchmarks that demonstrate the superiority of our proposed H3O.
zh

[CV-79] EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models

【速读】：该论文旨在解决视频字幕提取任务中难以有效利用视频时间信息以及精确预测字幕文本时间戳的问题。现有方法多基于多阶段框架，独立处理每一帧，无法充分挖掘视频的时间上下文信息，且在利用大型视觉语言模型（Large Vision-Language Models, LVLMs）的OCR能力时，准确预测字幕文本的时间戳仍具挑战性。为应对这些问题，论文提出了一种端到端的视频字幕提取方法EVE，其关键在于设计了一个新颖的适配器模块InterleavedVT，通过交错两种模态（视觉和文本）来有效地压缩视觉标记，并结合平均池化和Q-Former的优点进行令牌压缩。此外，为了考虑视频的时间信息，在文本区域压缩器中引入滑动窗口机制。同时，为评估该任务，论文构建了一个包含250万段视频的大规模数据集ViSa。实验结果表明，EVE在ViSa数据集上的表现优于现有的开源工具和LVLMs。

链接: https://arxiv.org/abs/2503.04058
作者: Haiyang Yu,Jinghui Lu,Yanjie Wang,Yang Li,Han Wang,Can Huang,Bin Li
机构: Shanghai Key Laboratory of Intelligent Information Processing (上海智能信息处理重点实验室); School of Computer Science, Fudan University (复旦大学计算机学院); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of Large Vision-Language Models (LVLMs) has advanced the video-based tasks, such as video captioning and video understanding. Some previous research indicates that taking texts in videos as input can further improve the performance of video understanding. As a type of indispensable information in short videos or movies, subtitles can assist LVLMs to better understand videos. Most existing methods for video subtitle extraction are based on a multi-stage framework, handling each frame independently. They can hardly exploit the temporal information of videos. Although some LVLMs exhibit the robust OCR capability, predicting accurate timestamps for subtitle texts is still challenging. In this paper, we propose an End-to-end Video Subtitle Extraction method, called EVE, which consists of three modules: a vision encoder, an adapter module, and a large language model. To effectively compress the visual tokens from the vision encoder, we propose a novel adapter InterleavedVT to interleave two modalities. It contains a visual compressor and a textual region compressor. The proposed InterleavedVT exploits both the merits of average pooling and Q-Former in token compression. Taking the temporal information of videos into account, we introduce a sliding-window mechanism in the textual region compressor. To benchmark the video subtitle extraction task, we propose a large dataset ViSa including 2.5M videos. Extensive experiments on ViSa demonstrate that the proposed EVE can outperform existing open-sourced tools and LVLMs.
zh

[CV-80] Underlying Semantic Diffusion for Effective and Efficient In-Context Learning

【速读】：该论文旨在解决扩散模型在可控图像生成和密集预测任务中面临的核心挑战，包括难以有效捕获底层语义（如边缘、纹理、形状）以及限制上下文理解与生成质量的问题。此外，现有模型通常存在高计算成本和缓慢推理速度，阻碍其实时应用。为应对这些挑战，论文提出Underlying Semantic Diffusion (US-Diffusion)，一种增强型扩散模型，通过提升底层语义学习、计算效率及上下文学习能力，在多任务场景中实现性能突破。

该方案的关键在于引入Separate Gather Adapter (SGA)，它能够解耦不同任务的输入条件，同时共享架构，从而实现更高效的上下文学习与跨多样化视觉领域的泛化能力。此外，论文还提出了Feedback-Aided Learning (FAL) 框架，利用反馈信号指导模型捕捉语义细节并动态适应特定任务的上下文线索。同时，为优化训练和推理效率，论文设计了一种插拔式Efficient Sampling Strategy (ESS)，用于高噪声时间步的密集采样。实验结果表明，US-Diffusion在多个任务中显著优于现有方法，同时大幅提升了推理速度和训练效率。

链接: https://arxiv.org/abs/2503.04050
作者: Zhong Ji,Weilong Cao,Yan Zhang,Yanwei Pang,Jungong Han,Xuelong Li
机构: School of Electrical and Information Engineering, Tianjin Key Laboratory of Brain-inspired Intelligence Technology, Tianjin University (天津大学电气与信息工程学院，天津脑科学启发智能技术重点实验室); Department of Automation, Tsinghua University (清华大学自动化系); Institute of Artificial Intelligence (TeleAI), China Telecom Corp Ltd (中国电信集团有限公司人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models has emerged as a powerful framework for tasks like image controllable generation and dense prediction. However, existing models often struggle to capture underlying semantics (e.g., edges, textures, shapes) and effectively utilize in-context learning, limiting their contextual understanding and image generation quality. Additionally, high computational costs and slow inference speeds hinder their real-time applicability. To address these challenges, we propose Underlying Semantic Diffusion (US-Diffusion), an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities on multi-task scenarios. We introduce Separate Gather Adapter (SGA), which decouples input conditions for different tasks while sharing the architecture, enabling better in-context learning and generalization across diverse visual domains. We also present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details and dynamically adapting to task-specific contextual cues. Furthermore, we propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels, which aims at optimizing training and inference efficiency while maintaining strong in-context learning performance. Experimental results demonstrate that US-Diffusion outperforms the state-of-the-art method, achieving an average reduction of 7.47 in FID on Map2Image tasks and an average reduction of 0.026 in RMSE on Image2Map tasks, while achieving approximately 9.45 times faster inference speed. Our method also demonstrates superior training efficiency and in-context learning capabilities, excelling in new datasets and tasks, highlighting its robustness and adaptability across diverse visual domains.
zh

[CV-81] Beyond Existance: Fulfill 3D Reconstructed Scenes with Pseudo Details

【速读】：该论文旨在解决3D Gaussian Splatting (3D-GS) 在训练过程中因采样不足导致的高倍缩放视图下高斯基元失真及细节缺失的问题。论文的关键解决方案在于提出了一种结合扩散模型（diffusion models）与多尺度训练的新方法，利用伪真实数据（pseudo-ground-truth data）进行训练，以缓解放大和缩小操作中的失真现象，并通过补充精确细节显著提升重建场景的质量。这一方法不仅在多个基准测试中达到了当前最优性能，还扩展了3D重建技术的应用范围。

链接: https://arxiv.org/abs/2503.04037
作者: Yifei Gao,Jun Huang,Lei Wang,Ruiting Dai,Jun Cheng
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of 3D Gaussian Splatting (3D-GS) has significantly advanced 3D reconstruction by providing high fidelity and fast training speeds across various scenarios. While recent efforts have mainly focused on improving model structures to compress data volume or reduce artifacts during zoom-in and zoom-out operations, they often overlook an underlying issue: training sampling deficiency. In zoomed-in views, Gaussian primitives can appear unregulated and distorted due to their dilation limitations and the insufficient availability of scale-specific training samples. Consequently, incorporating pseudo-details that ensure the completeness and alignment of the scene becomes essential. In this paper, we introduce a new training method that integrates diffusion models and multi-scale training using pseudo-ground-truth data. This approach not only notably mitigates the dilation and zoomed-in artifacts but also enriches reconstructed scenes with precise details out of existing scenarios. Our method achieves state-of-the-art performance across various benchmarks and extends the capabilities of 3D reconstruction beyond training datasets.
zh

[CV-82] GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding

【速读】：该论文旨在解决现有基于3D Gaussian Splatting (3DGS) 的场景理解方法在物体分割精度低以及缺乏空间推理能力的问题。论文的关键创新在于提出了GaussianGraph框架，通过引入自适应语义聚类与场景图生成来增强3DGS的场景理解能力。其核心解决方案包括采用“控制-跟随”（Control-Follow）聚类策略以动态适配场景尺度和特征分布，避免特征压缩并显著提升分割精度；同时，通过整合从2D基础模型提取的对象属性和空间关系丰富场景表示，并利用3D校正模块通过空间一致性验证过滤不合理的空间关系，确保可靠场景图构建。实验结果表明，GaussianGraph在语义分割和物体定位任务上均优于现有最先进方法。

链接: https://arxiv.org/abs/2503.04034
作者: Xihan Wang,Dianyi Yang,Yu Gao,Yufeng Yue,Yi Yang,Mengyin Fu
机构: Beijing Institute of Technology (北京理工大学); National Key Lab of Autonomous Intelligent Unmanned Systems (国家自主智能无人系统重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a “Control-Follow” clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state-of-the-art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction.
zh

[CV-83] Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration CVPR2025

【速读】：本文旨在解决现有自监督方法在处理大规模物体表面缺失且点云分布不平衡时无法提供高保真度点云补全的问题。论文的关键创新在于提出了一种利用有限且不平衡的真实标注数据完成大规模点云的新方法。通过粗略边界标注感兴趣区域，并将原始点云投影到多焦点（Multiple-Center-Of-Projection, MCOP）图像中，将点云补全问题转化为对MCOP图像中缺失像素的修复。针对结构不完整及现有部分分布不均的问题，设计了一种自监督方案，学习以现有“完整”片段相似的方式填充MCOP图像中的缺失部分。此外，引入特定损失函数以增强补全后MCOP图像的规则性和一致性，并将其映射回三维空间以形成最终的点云补全结果。实验结果验证了该方法在完成超过600个不完整且分布不均的秘鲁考古结构方面的优越性。

链接: https://arxiv.org/abs/2503.04030
作者: Aocheng Li,James R. Zimmer-Dauphinee,Rajesh Kalyanam,Ian Lindsay,Parker VanValkenburgh,Steven Wernke,Daniel Aliaga
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025

点击查看摘要

Abstract:Point cloud completion helps restore partial incomplete point clouds suffering occlusions. Current self-supervised methods fail to give high fidelity completion for large objects with missing surfaces and unbalanced distribution of available points. In this paper, we present a novel method for restoring large-scale point clouds with limited and imbalanced ground-truth. Using rough boundary annotations for a region of interest, we project the original point clouds into a multiple-center-of-projection (MCOP) image, where fragments are projected to images of 5 channels (RGB, depth, and rotation). Completion of the original point cloud is reduced to inpainting the missing pixels in the MCOP images. Due to lack of complete structures and an unbalanced distribution of existing parts, we develop a self-supervised scheme which learns to infill the MCOP image with points resembling existing “complete” patches. Special losses are applied to further enhance the regularity and consistency of completed MCOP images, which is mapped back to 3D to form final restoration. Extensive experiments demonstrate the superiority of our method in completing 600+ incomplete and unbalanced archaeological structures in Peru.
zh

[CV-84] xtDoctor: Unified Document Image Inpainting via Patch Pyramid Diffusion Models

【速读】：该论文旨在解决数字版现实世界文本文档中存在的问题，如原始文档的环境腐蚀、低质量扫描或人为干扰等。现有文档修复与填补方法通常难以泛化到未见过的文档风格，并且在处理高分辨率图像时表现不佳。为应对这些挑战，论文引入了TextDoctor，这是一种新的统一文档图像填补方法。受人类阅读行为启发，TextDoctor从补丁中恢复基本文本元素，然后将扩散模型应用于整个文档图像，而不是针对特定类型的文档训练模型。为了处理不同大小的文字并避免高分辨率文档中常见的内存不足问题，提出使用结构金字塔预测和补丁金字塔扩散模型。这些技术利用多尺度输入和金字塔补丁来增强全局和局部填补的质量。广泛的定性和定量实验验证了TextDoctor在修复多种类型的高分辨率文档图像方面优于最先进的方法。

链接: https://arxiv.org/abs/2503.04021
作者: Wanglong Lu,Lingming Su,Jingjing Zheng,Vinícius Veloso de Melo,Farzaneh Shoeleh,John Hawkin,Terrence Tricco,Hanli Zhao,Xianta Jiang
机构: College of Computer Science and Artificial Intelligence, Wenzhou University (温州大学计算机科学与人工智能学院), China; Department of Mathematics, University of British Columbia (英属哥伦比亚大学数学系, 加拿大); AI Analytics Team, Nasdaq (纳斯达克AI分析团队); Department of Computer Science, Memorial University of Newfoundland (纽芬兰纪念大学计算机科学系, 加拿大)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 25 figures

点击查看摘要

Abstract:Digital versions of real-world text documents often suffer from issues like environmental corrosion of the original document, low-quality scanning, or human interference. Existing document restoration and inpainting methods typically struggle with generalizing to unseen document styles and handling high-resolution images. To address these challenges, we introduce TextDoctor, a novel unified document image inpainting method. Inspired by human reading behavior, TextDoctor restores fundamental text elements from patches and then applies diffusion models to entire document images instead of training models on specific document types. To handle varying text sizes and avoid out-of-memory issues, common in high-resolution documents, we propose using structure pyramid prediction and patch pyramid diffusion models. These techniques leverage multiscale inputs and pyramid patches to enhance the quality of inpainting both globally and locally. Extensive qualitative and quantitative experiments on seven public datasets validated that TextDoctor outperforms state-of-the-art methods in restoring various types of high-resolution document images.
zh

[CV-85] NsBM-GAT: A Non-stationary Block Maximum and Graph Attention Framework for General Traffic Crash Risk Prediction

【速读】：该论文旨在解决交通碰撞风险预测中的两个主要挑战：一是现有模型因缺乏真实的碰撞前个体车辆数据，通常依赖于研究人员假设的危险场景，导致其实际适用性存疑；二是基于行车记录仪视频的碰撞风险预测框架虽能捕捉单个车辆的碰撞前行为，但往往缺乏周围车辆运动的关键信息，而车辆与周围车辆之间的交互对碰撞发生具有重要影响。为克服这些挑战，论文提出了一种新颖的非平稳极值理论（EVT），其关键在于通过图注意力网络以非线性方式优化协变量函数。该EVT组件利用概率分布刻画碰撞的随机性，提升模型可解释性，同时通过非线性协变量函数捕获目标车辆与其多个周围车辆间的交互行为，从而实现不同驾驶任务下的碰撞风险预测。论文使用三年内通过无人机采集的100组真实碰撞前车辆轨迹数据进行训练与测试，验证了该模型能够学习微观层面的碰撞先兆，并通过非线性协变量函数拟合更精确的概率分布，最终在后端碰撞和侧面擦碰碰撞预测任务中表现出比现有模型更高的准确性。

链接: https://arxiv.org/abs/2503.04018
作者: Kequan Chen,Pan Liu,Yuxuan Wang,David Z. W. Wang,Yifan Dai,Zhibin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate prediction of traffic crash risks for individual vehicles is essential for enhancing vehicle safety. While significant attention has been given to traffic crash risk prediction, existing studies face two main challenges: First, due to the scarcity of individual vehicle data before crashes, most models rely on hypothetical scenarios deemed dangerous by researchers. This raises doubts about their applicability to actual pre-crash conditions. Second, some crash risk prediction frameworks were learned from dashcam videos. Although such videos capture the pre-crash behavior of individual vehicles, they often lack critical information about the movements of surrounding vehicles. However, the interaction between a vehicle and its surrounding vehicles is highly influential in crash occurrences. To overcome these challenges, we propose a novel non-stationary extreme value theory (EVT), where the covariate function is optimized in a nonlinear fashion using a graph attention network. The EVT component incorporates the stochastic nature of crashes through probability distribution, which enhances model interpretability. Notably, the nonlinear covariate function enables the model to capture the interactive behavior between the target vehicle and its multiple surrounding vehicles, facilitating crash risk prediction across different driving tasks. We train and test our model using 100 sets of vehicle trajectory data before real crashes, collected via drones over three years from merging and weaving segments. We demonstrate that our model successfully learns micro-level precursors of crashes and fits a more accurate distribution with the aid of the nonlinear covariate function. Our experiments on the testing dataset show that the proposed model outperforms existing models by providing more accurate predictions for both rear-end and sideswipe crashes simultaneously.
zh

[CV-86] DSV-LFS: Unifying LLM -Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

【速读】：该论文旨在解决 Few-shot Semantic Segmentation (FSS) 中模型泛化能力不足的问题，特别是在支持图像未能充分捕捉目标类别外观变化多样性的情况下。当前方法常因特征表示不完整和有偏而导致性能受限。为应对这一挑战，论文提出了一种名为 DSV-LFS 的新框架，其关键在于结合 Large Language Models (LLMs) 和密集像素级匹配。具体而言，通过在 LLM 词汇表中引入额外标记，从类别描述生成“语义提示”(semantic prompt)，同时利用密集匹配模块识别查询图像与支持图像间的视觉相似性以生成“视觉提示”(visual prompt)。最终，这两种提示共同引导基于提示的解码器实现查询图像的精确分割，从而显著提升了 FSS 的性能及对新颖类别的泛化能力与场景鲁棒性。

链接: https://arxiv.org/abs/2503.04006
作者: Amin Karimi,Charalambos Poullis
机构: Immersive and Creative Technologies Lab (沉浸式与创意技术实验室), Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a “semantic prompt” from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a “visual prompt”. These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal- 5^i and COCO- 20^i demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \hrefthis https URLthis https URL
zh

[CV-87] Enhancing Autonomous Driving Safety with Collision Scenario Integration

【速读】：该论文旨在解决自动驾驶车辆在部署过程中面临的规划安全性挑战，特别是现有方法因过度依赖模仿学习（Imitation Learning）而难以有效利用碰撞数据的问题。此外，收集真实的碰撞或近似碰撞数据存在固有难度，涉及伦理和技术上的限制。为应对这些挑战，论文提出SafeFusion训练框架作为解决方案的核心。SafeFusion通过在训练过程中融入以安全为导向的指标（safety-oriented metrics），实现了对避碰行为的有效学习，而非单纯依赖模仿学习。同时，针对碰撞数据稀缺的问题，论文进一步提出了CollisionGen数据生成管道，利用自然语言提示（natural language prompts）、生成式AI（Generative AI）以及基于规则的筛选（rule-based filtering）生成多样化且高质量的场景数据。实验结果表明，该方法在易发生碰撞的情境下将规划性能提升了56%，同时保持了常规驾驶场景中的有效性。这一工作为提升自动驾驶系统的安全性提供了可扩展且高效的技术路径。

链接: https://arxiv.org/abs/2503.03957
作者: Zi Wang,Shiyi Lan,Xinglong Sun,Nadine Chang,Zhenxin Li,Zhiding Yu,Jose M. Alvarez
机构: Carnegie Mellon University (卡内基梅隆大学); NVIDIA (英伟达); Fudan University (复旦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous vehicle safety is crucial for the successful deployment of self-driving cars. However, most existing planning methods rely heavily on imitation learning, which limits their ability to leverage collision data effectively. Moreover, collecting collision or near-collision data is inherently challenging, as it involves risks and raises ethical and practical concerns. In this paper, we propose SafeFusion, a training framework to learn from collision data. Instead of over-relying on imitation learning, SafeFusion integrates safety-oriented metrics during training to enable collision avoidance learning. In addition, to address the scarcity of collision data, we propose CollisionGen, a scalable data generation pipeline to generate diverse, high-quality scenarios using natural language prompts, generative models, and rule-based filtering. Experimental results show that our approach improves planning performance in collision-prone scenarios by 56% over previous state-of-the-art planners while maintaining effectiveness in regular driving situations. Our work provides a scalable and effective solution for advancing the safety of autonomous driving systems.
zh

[CV-88] COARSE: Collaborative Pseudo-Labeling with Coarse Real Labels for Off-Road Semantic Segmentation

【速读】：该论文致力于解决自动驾驶越野导航中语义分割面临的跨域泛化挑战，由于缺乏密集标注的语义数据以及模拟数据引入的领域适应问题，限制了模型在多样化非结构化环境中的性能。论文的关键解决方案是提出了一种名为COARSE的半监督领域自适应框架，通过利用目标域内的稀疏粗略标签与源域内的密集标注数据，结合预训练视觉变换器，采用互补的像素级和patch级解码器，并辅以无标注数据上的协作伪标签策略，有效桥接了领域差距，显著提升了模型的跨域泛化能力。

链接: https://arxiv.org/abs/2503.03947
作者: Aurelio Noca,Xianmei Lei,Jonathan Becktor,Jeffrey Edlund,Anna Sabel,Patrick Spieler,Curtis Padgett,Alexandre Alahi,Deegan Atha
机构: Jet Propulsion Laboratory, California Institute of Technology (喷气推进实验室，加州理工学院); EPFL, Ecole Polytechnique Federale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: preprint, 8 pages

点击查看摘要

Abstract:Autonomous off-road navigation faces challenges due to diverse, unstructured environments, requiring robust perception with both geometric and semantic understanding. However, scarce densely labeled semantic data limits generalization across domains. Simulated data helps, but introduces domain adaptation issues. We propose COARSE, a semi-supervised domain adaptation framework for off-road semantic segmentation, leveraging sparse, coarse in-domain labels and densely labeled out-of-domain data. Using pretrained vision transformers, we bridge domain gaps with complementary pixel-level and patch-level decoders, enhanced by a collaborative pseudo-labeling strategy on unlabeled data. Evaluations on RUGD and Rellis-3D datasets show significant improvements of 9.7% and 8.4% respectively, versus only using coarse data. Tests on real-world off-road vehicle data in a multi-biome setting further demonstrate COARSE’s applicability.
zh

[CV-89] GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors

【速读】：该论文试图解决扩散模型（Diffusion Models）在图像编辑中的滥用问题，特别是未经授权的修改可能引发的虚假信息传播和剽窃等风险。现有的防御措施主要依赖于对抗扰动（Adversarial Perturbations），以干扰扩散模型的输出，但这些方法容易被简单的图像预处理技术（如压缩和噪声添加）中和。论文的关键解决方案是提出了一种名为GuardDoor的新颖且鲁棒的保护机制，通过促进图像所有者与模型提供者之间的协作来实现。具体而言，模型提供者通过对图像编码器进行微调嵌入保护性后门，使图像所有者能够在其图像中附加不可察觉的触发器。当未经授权的用户尝试使用该扩散模型编辑这些受保护的图像时，模型将生成无意义的输出，从而降低恶意图像编辑的风险。这种机制展示了对图像预处理操作的增强鲁棒性，并具备大规模部署的可扩展性。

链接: https://arxiv.org/abs/2503.03944
作者: Yaopei Zeng,Yuanpu Cao,Lu Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing accessibility of diffusion models has revolutionized image editing but also raised significant concerns about unauthorized modifications, such as misinformation and plagiarism. Existing countermeasures largely rely on adversarial perturbations designed to disrupt diffusion model outputs. However, these approaches are found to be easily neutralized by simple image preprocessing techniques, such as compression and noise addition. To address this limitation, we propose GuardDoor, a novel and robust protection mechanism that fosters collaboration between image owners and model providers. Specifically, the model provider participating in the mechanism fine-tunes the image encoder to embed a protective backdoor, allowing image owners to request the attachment of imperceptible triggers to their images. When unauthorized users attempt to edit these protected images with this diffusion model, the model produces meaningless outputs, reducing the risk of malicious image editing. Our method demonstrates enhanced robustness against image preprocessing operations and is scalable for large-scale deployment. This work underscores the potential of cooperative frameworks between model providers and image owners to safeguard digital content in the era of generative AI.
zh

[CV-90] SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection

【速读】：该论文旨在评估SAM 2（Segment Anything Model 2）在手术场景理解中的器官/组织语义分割能力，特别是在零样本场景以及经过微调后的表现。论文的关键解决方案在于通过利用五个公开数据集对SAM 2进行微调，以优化其在手术视频和图像中分割解剖组织的能力。微调集中在图像编码器和掩码解码器，并通过限制每类从50到400个样本的训练子集来模拟实际数据获取条件下的真实世界约束。研究发现，通过增加提示点数量（从1到10）和扩大训练数据规模（从50样本/类到400样本/类），可以显著提升性能，最终实现加权平均Dice系数（WMDC）的最佳值为0.92。结果表明，经过微调的SurgiSAM 2模型不仅在已知类别上超越了先前最先进的方法，还在未见过的器官类别上达到了77.8%的最优表现。这表明SAM 2具有显著潜力用于构建自动化或半自动化的标注流水线，从而减轻人工标注的工作负担，推动多种手术应用的发展。

链接: https://arxiv.org/abs/2503.03942
作者: Devanish N. Kamtam,Joseph B. Shrager,Satya Deepya Malla,Xiaohan Wang,Nicole Lin,Juan J. Cardona,Serena Yeung-Levy,Clarence Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: We evaluate SAM 2 for surgical scene understanding by examining its semantic segmentation capabilities for organs/tissues both in zero-shot scenarios and after fine-tuning. Methods: We utilized five public datasets to evaluate and fine-tune SAM 2 for segmenting anatomical tissues in surgical videos/images. Fine-tuning was applied to the image encoder and mask decoder. We limited training subsets from 50 to 400 samples per class to better model real-world constraints with data acquisition. The impact of dataset size on fine-tuning performance was evaluated with weighted mean Dice coefficient (WMDC), and the results were also compared against previously reported state-of-the-art (SOTA) results. Results: SurgiSAM 2, a fine-tuned SAM 2 model, demonstrated significant improvements in segmentation performance, achieving a 17.9% relative WMDC gain compared to the baseline SAM 2. Increasing prompt points from 1 to 10 and training data scale from 50/class to 400/class enhanced performance; the best WMDC of 0.92 on the validation subset was achieved with 10 prompt points and 400 samples per class. On the test subset, this model outperformed prior SOTA methods in 24/30 (80%) of the classes with a WMDC of 0.91 using 10-point prompts. Notably, SurgiSAM 2 generalized effectively to unseen organ classes, achieving SOTA on 7/9 (77.8%) of them. Conclusion: SAM 2 achieves remarkable zero-shot and fine-tuned performance for surgical scene segmentation, surpassing prior SOTA models across several organ classes of diverse datasets. This suggests immense potential for enabling automated/semi-automated annotation pipelines, thereby decreasing the burden of annotations facilitating several surgical applications.
zh

[CV-91] CREStE: Scalable Mapless Navigation with Internet Scale Priors and Counterfactual Guidance

【速读】：本文致力于解决长时域无图导航（long-horizon mapless navigation）问题，即让机器人能够在全新环境中无需依赖高精度地图或精确路点即可完成路径规划与导航。为实现这一目标，论文提出了两个主要挑战：一是学习鲁棒且可泛化的环境感知表示，同时应对未预见的导航因素及感知混淆问题；二是利用这些学习到的表示来规划符合人类意图的导航路径。现有方法因依赖手工标注的对象列表、从有限规模机器人数据集中端到端学习导航特征以及扩展性较差的手工设计奖励函数而难以泛化。为克服这些局限性，论文提出了一种名为CREStE的新方法，其核心在于无需大规模机器人数据集或人工设计特征的情况下，通过视觉基础模型学习连续的鸟瞰图表示，并结合语义、实例级特征以及高度信息。此外，为了将学习到的表示应用于路径规划，论文引入了一种基于反事实的损失函数和主动学习流程，在具有挑战性的场景中通过向人类查询反事实轨迹注释来聚焦于最显著的感知线索。实验结果显示，CREStE在六个不同城市环境中的千米级导航任务中显著优于所有最先进的方法，每项任务所需的人类干预次数减少70%，尤其在未知环境中完成了长达2公里的任务仅需一次干预，充分展示了其在长时域无图导航中的鲁棒性和有效性。

链接: https://arxiv.org/abs/2503.03921
作者: Arthur Zhang,Harshit Sikchi,Amy Zhang,Joydeep Biswas
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures, 5 tables

点击查看摘要

Abstract:We address the long-horizon mapless navigation problem: enabling robots to traverse novel environments without relying on high-definition maps or precise waypoints that specify exactly where to navigate. Achieving this requires overcoming two major challenges – learning robust, generalizable perceptual representations of the environment without pre-enumerating all possible navigation factors and forms of perceptual aliasing and utilizing these learned representations to plan human-aligned navigation paths. Existing solutions struggle to generalize due to their reliance on hand-curated object lists that overlook unforeseen factors, end-to-end learning of navigation features from scarce large-scale robot datasets, and handcrafted reward functions that scale poorly to diverse scenarios. To overcome these limitations, we propose CREStE, the first method that learns representations and rewards for addressing the full mapless navigation problem without relying on large-scale robot datasets or manually curated features. CREStE leverages visual foundation models trained on internet-scale data to learn continuous bird’s-eye-view representations capturing elevation, semantics, and instance-level features. To utilize learned representations for planning, we propose a counterfactual-based loss and active learning procedure that focuses on the most salient perceptual cues by querying humans for counterfactual trajectory annotations in challenging scenes. We evaluate CREStE in kilometer-scale navigation tasks across six distinct urban environments. CREStE significantly outperforms all state-of-the-art approaches with 70% fewer human interventions per mission, including a 2-kilometer mission in an unseen environment with just 1 intervention; showcasing its robustness and effectiveness for long-horizon mapless navigation. For videos and additional materials, see this https URL .
zh

[CV-92] Neural Descriptors: Self-Supervised Learning of Robust Local Surface Descriptors Using Polynomial Patches

【速读】：该论文旨在解决经典形状描述符（如Heat Kernel Signature (HKS)、Wave Kernel Signature (WKS)和Signature of Histograms of Orientations (SHOT)）在形状分析中对网格连接性、采样模式和拓扑噪声敏感的问题。同时，尽管微分几何中的微分不变量理论上是鲁棒的形状描述符，但其在离散网格上的计算往往导致数值近似不稳定，限制了其实用性。论文的关键解决方案是提出了一种自监督学习方法，用于从3D表面提取几何特征。该方法结合了合成数据生成与一种专门设计的神经架构，以学习不受采样影响的特征。通过将这些特征集成到现有的形状对应框架中，在FAUST、SCAPE、TOPKIDS和SHREC’16等标准基准测试中展示了改进的性能，特别是在处理拓扑噪声和部分形状方面表现出显著的鲁棒性。

链接: https://arxiv.org/abs/2503.03907
作者: Gal Yona,Roy Velich,Ron Kimmel,Ehud Rivlin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Classical shape descriptors such as Heat Kernel Signature (HKS), Wave Kernel Signature (WKS), and Signature of Histograms of OrienTations (SHOT), while widely used in shape analysis, exhibit sensitivity to mesh connectivity, sampling patterns, and topological noise. While differential geometry offers a promising alternative through its theory of differential invariants, which are theoretically guaranteed to be robust shape descriptors, the computation of these invariants on discrete meshes often leads to unstable numerical approximations, limiting their practical utility. We present a self-supervised learning approach for extracting geometric features from 3D surfaces. Our method combines synthetic data generation with a neural architecture designed to learn sampling-invariant features. By integrating our features into existing shape correspondence frameworks, we demonstrate improved performance on standard benchmarks including FAUST, SCAPE, TOPKIDS, and SHREC’16, showing particular robustness to topological noise and partial shapes.
zh

[CV-93] IC-Mapper: Instance-Centric Spatio-Temporal Modeling for Online Vectorized Map Construction

【速读】：该论文旨在解决在线视觉数据驱动的地图构建任务中，传统方法因局部范围感知局限而缺乏空间可扩展性的问题。论文提出了一种以实例为中心的在线地图构建框架IC-Mapper，其关键是设计了两个核心模块：1）实例级时间关联模块，通过特征和几何维度测量相邻帧的检测结果，实现跨帧实例间的匹配；2）实例级空间融合模块，在空间维度对历史全局地图进行点采样，并与当前帧对应的实例检测结果融合，从而实现实时的地图扩展与更新。实验结果表明，IC-Mapper在nuScenes数据集上的检测、跟踪及全局地图构建性能优于现有最先进的方法。

链接: https://arxiv.org/abs/2503.03882
作者: Jiangtong Zhu,Zhao Yang,Yinan Shi,Jianwu Fang,Jianru Xue
机构: Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); Researcher (未知); Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online vector map construction based on visual data can bypass the processes of data collection, post-processing, and manual annotation required by traditional map construction, which significantly enhances map-building efficiency. However, existing work treats the online mapping task as a local range perception task, overlooking the spatial scalability required for map construction. We propose IC-Mapper, an instance-centric online mapping framework, which comprises two primary components: 1) Instance-centric temporal association module: For the detection queries of adjacent frames, we measure them in both feature and geometric dimensions to obtain the matching correspondence between instances across frames. 2) Instance-centric spatial fusion module: We perform point sampling on the historical global map from a spatial dimension and integrate it with the detection results of instances corresponding to the current frame to achieve real-time expansion and update of the map. Based on the nuScenes dataset, we evaluate our approach on detection, tracking, and global mapping metrics. Experimental results demonstrate the superiority of IC-Mapper against other state-of-the-art methods. Code will be released on this https URL.
zh

[CV-94] Nexar Dashcam Collision Prediction Dataset and Challenge

【速读】：该论文旨在解决交通事件分析、碰撞预测以及自动驾驶安全领域的研究挑战。论文通过发布Nexar行车记录仪碰撞预测数据集与挑战赛（Nexar Dashcam Collision Prediction Dataset and Challenge），为相关研究提供支持。解决方案的关键在于构建一个包含1,500个带标注视频片段的数据集，这些片段涵盖多种真实世界交通场景，并提供详细的标签信息，包括事件类型（碰撞/近似碰撞与正常驾驶）、环境条件（光照与天气）及场景类型（城市、乡村、高速公路等）。对于碰撞和近似碰撞场景，还提供了时间戳标记以指示事件发生的确切时刻和可预测的预警时刻。此外，论文引入基于该数据集的公开竞赛，鼓励研究者开发能够提前预测即将发生的碰撞的机器学习模型，并通过平均精确率（Average Precision, AP）在多个时间间隔内（如事件前500毫秒、1000毫秒和1500毫秒）评估模型性能，强调早期且可靠的预测至关重要。

链接: https://arxiv.org/abs/2503.03848
作者: Daniel C. Moura,Shizhan Zhu,Orly Zvitia
机构: Nexar Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents the Nexar Dashcam Collision Prediction Dataset and Challenge, designed to support research in traffic event analysis, collision prediction, and autonomous vehicle safety. The dataset consists of 1,500 annotated video clips, each approximately 40 seconds long, capturing a diverse range of real-world traffic scenarios. Videos are labeled with event type (collision/near-collision vs. normal driving), environmental conditions (lighting conditions and weather), and scene type (urban, rural, highway, etc.). For collision and near-collision cases, additional temporal labels are provided, including the precise moment of the event and the alert time, marking when the collision first becomes predictable. To advance research on accident prediction, we introduce the Nexar Dashcam Collision Prediction Challenge, a public competition on top of this dataset. Participants are tasked with developing machine learning models that predict the likelihood of an imminent collision, given an input video. Model performance is evaluated using the average precision (AP) computed across multiple intervals before the accident (i.e. 500 ms, 1000 ms, and 1500 ms prior to the event), emphasizing the importance of early and reliable predictions. The dataset is released under an open license with restrictions on unethical use, ensuring responsible research and innovation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.03848 [cs.CV] (or arXiv:2503.03848v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.03848 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-95] ask-Agnostic Attacks Against Vision Foundation Models

【速读】：该论文试图解决机器学习中基础模型（vision foundation models）的安全性问题，特别是针对这些模型的下游任务通用攻击的研究尚属空白。论文关注如何通过扰乱基础模型提取的特征表示，生成与具体下游任务无关的对抗样本，并评估此类攻击对多个下游任务的影响及其在不同模型间的迁移能力。解决方案的关键在于提出了一种通用框架，能够最大化破坏基础模型提取的特征表示，从而生成任务无关的对抗样本，同时系统性地衡量攻击对下游任务的影响及跨模型的迁移效果。

链接: https://arxiv.org/abs/2503.03842
作者: Brian Pulfer,Yury Belousov,Vitaliy Kinakh,Teddy Furon,Slava Voloshynovskiy
机构: University of Geneva (日内瓦大学); University of Rennes, Inria (雷恩大学, Inria), CNRS (国家科学研究中心), IRISA (雷恩信息与自动化实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The study of security in machine learning mainly focuses on downstream task-specific attacks, where the adversarial example is obtained by optimizing a loss function specific to the downstream task. At the same time, it has become standard practice for machine learning practitioners to adopt publicly available pre-trained vision foundation models, effectively sharing a common backbone architecture across a multitude of applications such as classification, segmentation, depth estimation, retrieval, question-answering and more. The study of attacks on such foundation models and their impact to multiple downstream tasks remains vastly unexplored. This work proposes a general framework that forges task-agnostic adversarial examples by maximally disrupting the feature representation obtained with foundation models. We extensively evaluate the security of the feature representations obtained by popular vision foundation models by measuring the impact of this attack on multiple downstream tasks and its transferability between models.
zh

[CV-96] Decoupling the components of geometric understanding in Vision Language Models

【速读】：该论文试图解决的问题是评估最先进的视觉语言模型（Vision Language Models, VLMs）是否能够理解简单的几何概念，并探讨其与人类几何理解的异同。论文通过认知科学中的实验范式，将视觉几何理解从推理和世界知识等其他能力中分离出来，以更准确地衡量VLMs的几何理解能力。此外，研究还比较了VLMs与受过教育的美国成年人以及未接受正式教育的亚马逊土著群体的表现，揭示了两者在几何理解上的差异。

解决方案的关键在于设计一个能够有效隔离几何理解能力的实验范式，并通过任务测试（如需要心理旋转的任务）来评估VLMs与人类在几何理解上的表现差异。研究发现，尽管VLMs在某些几何概念上表现优于人类，但总体上它们的几何理解能力不如人类稳健，尤其在需要心理旋转的任务中表现脆弱。这表明人类的几何理解可能源于正式教育中的印刷材料，而机器则更多依赖于训练数据中的模式学习，而非物理世界的交互。

链接: https://arxiv.org/abs/2503.03840
作者: Eliza Kosoy,Annya Dahmani,Andrew K. Lampinen,Iulia M. Comsa,Soojin Jeong,Ishita Dasgupta,Kelsey Allen
机构: Google DeepMind (谷歌深度思维); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Understanding geometry relies heavily on vision. In this work, we evaluate whether state-of-the-art vision language models (VLMs) can understand simple geometric concepts. We use a paradigm from cognitive science that isolates visual understanding of simple geometry from the many other capabilities it is often conflated with such as reasoning and world knowledge. We compare model performance with human adults from the USA, as well as with prior research on human adults without formal education from an Amazonian indigenous group. We find that VLMs consistently underperform both groups of human adults, although they succeed with some concepts more than others. We also find that VLM geometric understanding is more brittle than human understanding, and is not robust when tasks require mental rotation. This work highlights interesting differences in the origin of geometric understanding in humans and machines – e.g. from printed materials used in formal education vs. interactions with the physical world or a combination of the two – and a small step toward understanding these differences.
zh

[CV-97] EgoLife: Towards Egocentric Life Assistant CVPR2025

【速读】：该论文旨在开发一款基于人工智能的自视角（egocentric）生活助手EgoLife，通过AI-powered可穿戴眼镜提升个人效率。论文的核心问题是解决自视角数据处理与应用中的关键技术挑战，并提供实用的生活辅助功能。为此，作者构建了EgoLife Dataset，包含300小时的多模态、多视角自视角日常生活数据，并基于此提出了EgoLifeQA任务，用于解决长上下文生活导向的问题回答需求。论文的关键解决方案包括：(1) 开发鲁棒的视觉-听觉模型以处理自视角数据（EgoGPT），(2) 实现身份识别能力，以及(3) 支持超长上下文问题回答的技术（EgoRAG）。其中，EgoGPT是一种经过自视角数据训练的全模态模型，具备最先进的自视角视频理解性能；而EgoRAG则是一种基于检索的模块，用于支持超长上下文问答。这些技术共同构成了EgoButler系统，为自视角AI助手的应用奠定了基础。

链接: https://arxiv.org/abs/2503.03803
作者: Jingkang Yang,Shuai Liu,Hongming Guo,Yuhao Dong,Xiamengwei Zhang,Sicheng Zhang,Pengyun Wang,Zitang Zhou,Binzhu Xie,Ziyue Wang,Bei Ouyang,Zhengyu Lin,Marco Cominelli,Zhongang Cai,Yuanhan Zhang,Peiyuan Zhang,Fangzhou Hong,Joerg Widmer,Francesco Gringoli,Lei Yang,Bo Li,Ziwei Liu
机构: EgoLife Team (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL . Code: this https URL

点击查看摘要

Abstract:We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for this assistant, we conducted a comprehensive data collection study where six participants lived together for one week, continuously recording their daily activities - including discussions, shopping, cooking, socializing, and entertainment - using AI glasses for multimodal egocentric video capture, along with synchronized third-person-view video references. This effort resulted in the EgoLife Dataset, a comprehensive 300-hour egocentric, interpersonal, multiview, and multimodal daily life dataset with intensive annotation. Leveraging this dataset, we introduce EgoLifeQA, a suite of long-context, life-oriented question-answering tasks designed to provide meaningful assistance in daily life by addressing practical questions such as recalling past relevant events, monitoring health habits, and offering personalized recommendations. To address the key technical challenges of (1) developing robust visual-audio models for egocentric data, (2) enabling identity recognition, and (3) facilitating long-context question answering over extensive temporal information, we introduce EgoButler, an integrated system comprising EgoGPT and EgoRAG. EgoGPT is an omni-modal model trained on egocentric datasets, achieving state-of-the-art performance on egocentric video understanding. EgoRAG is a retrieval-based component that supports answering ultra-long-context questions. Our experimental studies verify their working mechanisms and reveal critical factors and bottlenecks, guiding future improvements. By releasing our datasets, models, and benchmarks, we aim to stimulate further research in egocentric AI assistants.
zh

[CV-98] BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

【速读】：该论文旨在通过构建BIOSCAN-5M昆虫数据集及提出多项基准任务，解决生物多样性监测与物种分类中的多模态信息整合与利用问题。论文的关键在于充分利用包含形态图像、DNA条形码序列、分类标签、地理分布及尺寸等多模态信息的数据集，通过三种创新实验验证多模态数据对分类与聚类准确性的影响：首先，基于DNA条形码序列预训练掩码语言模型以提升物种及属级别分类性能；其次，探索零样本迁移学习在图像与DNA条形码上的应用，评估自监督特征嵌入的聚类能力；最后，通过对比学习实现DNA条形码、图像数据与分类信息的多模态表征融合，构建通用共享嵌入空间，从而支持基于多种信息和模态的分类任务。这一系列方法的核心在于有效整合多源异构数据，提升生物分类与聚类的精度与泛化能力。

链接: https://arxiv.org/abs/2406.12723
作者: Zahra Gharaee,Scott C. Lowe,ZeMing Gong,Pablo Millan Arias,Nicholas Pellegrino,Austin T. Wang,Joakim Bruslund Haurum,Iuliia Zarubiieva,Lila Kari,Dirk Steinke,Graham W. Taylor,Paul Fieguth,Angel X. Chang
机构: Centre for Biodiversity Genomics (生物多样性基因组中心); University of Guelph (圭尔夫大学); University of Waterloo (滑铁卢大学); Simon Fraser University (西蒙弗雷泽大学); Vector Institute (向量研究所); Alberta Machine Intelligence Institute (Amii) (艾伯塔机器智能研究所); Aalborg University and Pioneer Centre for AI (奥胡斯大学和先锋人工智能中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical, and size information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at this https URL.
zh

[CV-99] Adaptive Prototype Learning for Multimodal Cancer Survival Analysis

【速读】：该论文旨在解决利用多模态数据（尤其是全切片组织学图像和转录组学特征）进行癌症生存预测时，因数据冗余导致模型性能下降的问题。论文提出了一种名为自适应原型学习（Adaptive Prototype Learning, APL）的新方法，其关键是通过数据驱动的方式自适应学习具有代表性的原型，从而在减少冗余的同时保留关键信息。APL 方法采用两组可学习的查询向量作为高维表示与生存预测之间的桥梁，捕捉任务相关的特征，并引入多模态混合自注意力机制以促进跨模态交互，进一步增强信息融合能力。实验结果验证了该方法在五个基准癌症数据集上的优越性。

链接: https://arxiv.org/abs/2503.04643
作者: Hong Liu,Haosen Yang,Federica Eduati,Josien P.W. Pluim,Mitko Veta
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Leveraging multimodal data, particularly the integration of whole-slide histology images (WSIs) and transcriptomic profiles, holds great promise for improving cancer survival prediction. However, excessive redundancy in multimodal data can degrade model performance. In this paper, we propose Adaptive Prototype Learning (APL), a novel and effective approach for multimodal cancer survival analysis. APL adaptively learns representative prototypes in a data-driven manner, reducing redundancy while preserving critical information. Our method employs two sets of learnable query vectors that serve as a bridge between high-dimensional representations and survival prediction, capturing task-relevant features. Additionally, we introduce a multimodal mixed self-attention mechanism to enable cross-modal interactions, further enhancing information fusion. Extensive experiments on five benchmark cancer datasets demonstrate the superiority of our approach over existing methods. The code is available at this https URL.
zh

[CV-100] GBT-SAM: A Parameter-Efficient Depth-Aware Model for Generalizable Brain tumour Segmentation on mp-MRI

【速读】：该论文旨在解决现有自动胶质瘤分割方法在通用性、计算资源需求及多参数磁共振成像（multi-parametric MRI, mp-MRI）数据利用率方面的不足。论文提出了一种名为GBT-SAM的新框架，其关键是通过两阶段训练协议扩展Segment Anything Model (SAM) 至脑肿瘤分割任务：首先微调patch嵌入层以处理完整的mp-MRI模态；其次，在Vision Transformer (ViT) 中引入参数高效的LoRA模块和Depth-Condition模块，以捕捉切片间的相关性。这一方案实现了在成人胶质瘤数据集上的顶级性能（Dice分数为93.54），同时在其他脑肿瘤类型（如脑膜瘤、儿童胶质瘤和撒哈拉以南地区胶质瘤）中表现出强大的泛化能力，并仅使用不到6.5M可训练参数，提供了高效解决方案。

链接: https://arxiv.org/abs/2503.04325
作者: Cecilia Diana-Albelda,Roberto Alcover-Couso,Álvaro García-Martín,Jesus Bescos,Marcos Escudero-Viñolo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gliomas are brain tumours that stand out for their highly lethal and aggressive nature, which demands a precise approach in their diagnosis. Medical image segmentation plays a crucial role in the evaluation and follow-up of these tumours, allowing specialists to analyse their morphology. However, existing methods for automatic glioma segmentation often lack generalization capability across other brain tumour domains, require extensive computational resources, or fail to fully utilize the multi-parametric MRI (mp-MRI) data used to delineate them. In this work, we introduce GBT-SAM, a novel Generalizable Brain Tumour (GBT) framework that extends the Segment Anything Model (SAM) to brain tumour segmentation tasks. Our method employs a two-step training protocol: first, fine-tuning the patch embedding layer to process the entire mp-MRI modalities, and second, incorporating parameter-efficient LoRA blocks and a Depth-Condition block into the Vision Transformer (ViT) to capture inter-slice correlations. GBT-SAM achieves state-of-the-art performance on the Adult Glioma dataset (Dice Score of 93.54 ) while demonstrating robust generalization across Meningioma, Pediatric Glioma, and Sub-Saharan Glioma datasets. Furthermore, GBT-SAM uses less than 6.5M trainable parameters, thus offering an efficient solution for brain tumour segmentation. \ Our code and models are available at this https URL .
zh

人工智能

[AI-0] Predictable Scale: Part I – Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

链接: https://arxiv.org/abs/2503.04715
作者: Houyi Li,Wenzheng Zheng,Jingcheng Hu,Qiufeng Wang,Hanshan Zhang,Zili Wang,Yangshijie Xu,Shuigeng Zhou,Xiangyu Zhang,Daxin Jiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository this https URL

[AI-1] Self-Supervised Models for Phoneme Recognition: Applications in Childrens Speech for Reading Learning INTERSPEECH INTERSPEECH2024

链接: https://arxiv.org/abs/2503.04710
作者: Lucas Block Medin,Thomas Pellegrini,Lucile Gelin
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: This paper was originally published in the Proceedings of Interspeech 2024. DOI: https://doi.org/10.21437/Interspeech.2024-1095

点击查看摘要

Abstract:Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

[AI-2] Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

链接: https://arxiv.org/abs/2503.04704
作者: Alireza Behtash,Marijan Fofonjka,Ethan Baird,Tyler Mauer,Hossein Moghimifam,David Stout,Joel Dennison
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 7 figures, 14 tables; Comments are welcome

点击查看摘要

Abstract:We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures-from 1.6B to 70B parameters-showcasing consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

[AI-3] Matrix Factorization for Inferring Associations and Missing Links

链接: https://arxiv.org/abs/2503.04680
作者: Ryan Barron,Maksim E. Eren,Duc P. Truong,Cynthia Matuszek,James Wendelberger,Mary F. Dorn,Boian Alexandrov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 35 pages, 14 figures, 3 tables, 1 algorithm

点击查看摘要

Abstract:Missing link prediction is a method for network analysis, with applications in recommender systems, biology, social sciences, cybersecurity, information retrieval, and Artificial Intelligence (AI) reasoning in Knowledge Graphs. Missing link prediction identifies unseen but potentially existing connections in a network by analyzing the observed patterns and relationships. In proliferation detection, this supports efforts to identify and characterize attempts by state and non-state actors to acquire nuclear weapons or associated technology - a notoriously challenging but vital mission for global security. Dimensionality reduction techniques like Non-Negative Matrix Factorization (NMF) and Logistic Matrix Factorization (LMF) are effective but require selection of the matrix rank parameter, that is, of the number of hidden features, k, to avoid over/under-fitting. We introduce novel Weighted (WNMFk), Boolean (BNMFk), and Recommender (RNMFk) matrix factorization methods, along with ensemble variants incorporating logistic factorization, for link prediction. Our methods integrate automatic model determination for rank estimation by evaluating stability and accuracy using a modified bootstrap methodology and uncertainty quantification (UQ), assessing prediction reliability under random perturbations. We incorporate Otsu threshold selection and k-means clustering for Boolean matrix factorization, comparing them to coordinate descent-based Boolean thresholding. Our experiments highlight the impact of rank k selection, evaluate model performance under varying test-set sizes, and demonstrate the benefits of UQ for reliable predictions using abstention. We validate our methods on three synthetic datasets (Boolean and uniformly distributed) and benchmark them against LMF and symmetric LMF (symLMF) on five real-world protein-protein interaction networks, showcasing an improved prediction performance.

[AI-4] Multi-Agent Inverse Q-Learning from Demonstrations ICRA

链接: https://arxiv.org/abs/2503.04679
作者: Nathaniel Haynam,Adam Khoja,Dhruv Kumar,Vivek Myers,Erdem Bıyık
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 8 pages, 4 figures, 2 tables. Published at the International Conference on Robotics and Automation (ICRA) 2025

点击查看摘要

Abstract:When reward functions are hand-designed, deep reinforcement learning algorithms often suffer from reward misspecification, causing them to learn suboptimal policies in terms of the intended task objectives. In the single-agent case, inverse reinforcement learning (IRL) techniques attempt to address this issue by inferring the reward function from expert demonstrations. However, in multi-agent problems, misalignment between the learned and true objectives is exacerbated due to increased environment non-stationarity and variance that scales with multiple agents. As such, in multi-agent general-sum games, multi-agent IRL algorithms have difficulty balancing cooperative and competitive objectives. To address these issues, we propose Multi-Agent Marginal Q-Learning from Demonstrations (MAMQL), a novel sample-efficient framework for multi-agent IRL. For each agent, MAMQL learns a critic marginalized over the other agents’ policies, allowing for a well-motivated use of Boltzmann policies in the multi-agent context. We identify a connection between optimal marginalized critics and single-agent soft-Q IRL, allowing us to apply a direct, simple optimization criterion from the single-agent domain. Across our experiments on three different simulated domains, MAMQL significantly outperforms previous multi-agent methods in average reward, sample efficiency, and reward recovery by often more than 2-5x. We make our code available at this https URL .

[AI-5] IDInit: A Universal and Stable Initialization Method for Neural Network Training ICLR2025

链接: https://arxiv.org/abs/2503.04626
作者: Yu Pan,Chaozheng Wang,Zekai Wu,Qifan Wang,Min Zhang,Zenglin Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted in ICLR 2025

点击查看摘要

Abstract:Deep neural networks have achieved remarkable accomplishments in practice. The success of these networks hinges on effective initialization methods, which are vital for ensuring stable and rapid convergence during training. Recently, initialization methods that maintain identity transition within layers have shown good efficiency in network training. These techniques (e.g., Fixup) set specific weights to zero to achieve identity control. However, settings of remaining weight (e.g., Fixup uses random values to initialize non-zero weights) will affect the inductive bias that is achieved only by a zero weight, which may be harmful to training. Addressing this concern, we introduce fully identical initialization (IDInit), a novel method that preserves identity in both the main and sub-stem layers of residual networks. IDInit employs a padded identity-like matrix to overcome rank constraints in non-square weight matrices. Furthermore, we show the convergence problem of an identity matrix can be solved by stochastic gradient descent. Additionally, we enhance the universality of IDInit by processing higher-order weights and addressing dead neuron problems. IDInit is a straightforward yet effective initialization method, with improved convergence, stability, and performance across various settings, including large-scale datasets and deep models.

[AI-6] he Next Frontier of LLM Applications: Open Ecosystems and Hardware Synergy

链接: https://arxiv.org/abs/2503.04596
作者: Xinyi Hou,Yanjie Zhao,Haoyu Wang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) applications, including LLM app stores and autonomous agents, are shaping the future of AI ecosystems. However, platform silos, fragmented hardware integration, and the absence of standardized interfaces limit scalability, interoperability, and resource efficiency. While LLM app stores democratize AI, their closed ecosystems restrict modular AI reuse and cross-platform portability. Meanwhile, agent-based frameworks offer flexibility but often lack seamless integration across diverse environments. This paper envisions the future of LLM applications and proposes a three-layer decoupled architecture grounded in software engineering principles such as layered system design, service-oriented architectures, and hardware-software co-design. This architecture separates application logic, communication protocols, and hardware execution, enhancing modularity, efficiency, and cross-platform compatibility. Beyond architecture, we highlight key security and privacy challenges for safe, scalable AI deployment and outline research directions in software and security engineering. This vision aims to foster open, secure, and interoperable LLM ecosystems, guiding future advancements in AI applications.

[AI-7] ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making

链接: https://arxiv.org/abs/2503.04569
作者: Yitong Luo,Hou Hei Lam,Ziang Chen,Zhenliang Zhang,Xue Feng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite recent advances in artificial intelligence (AI), it poses challenges to ensure personalized decision-making in tasks that are not considered in training datasets. To address this issue, we propose ValuePilot, a two-phase value-driven decision-making framework comprising a dataset generation toolkit DGT and a decision-making module DMM trained on the generated data. DGT is capable of generating scenarios based on value dimensions and closely mirroring real-world tasks, with automated filtering techniques and human curation to ensure the validity of the dataset. In the generated dataset, DMM learns to recognize the inherent values of scenarios, computes action feasibility and navigates the trade-offs between multiple value dimensions to make personalized decisions. Extensive experiments demonstrate that, given human value preferences, our DMM most closely aligns with human decisions, outperforming Claude-3.5-Sonnet, Gemini-2-flash, Llama-3.1-405b and GPT-4o. This research is a preliminary exploration of value-driven decision-making. We hope it will stimulate interest in value-driven decision-making and personalized decision-making within the community.

[AI-8] Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association

链接: https://arxiv.org/abs/2503.04564
作者: Xiang Zhang,Zhou Li,Kai Wan,Hua Sun,Mingyue Ji,Giuseppe Caire
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users’ inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to B consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly nontrivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.

[AI-9] Benchmarking Reasoning Robustness in Large Language Models

链接: https://arxiv.org/abs/2503.04550
作者: Tong Yu,Yongcheng Jing,Xikun Zhang,Wentao Jiang,Wenjie Wu,Yingjie Wang,Wenbin Hu,Bo Du,Dacheng Tao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete data, suggesting a reliance on memorized patterns rather than systematic reasoning. Our closer examination reveals four key unique limitations underlying this issue:(1) Positional bias–models favor earlier queries in multi-query inputs but answering the wrong one in the latter (e.g., GPT-4o’s accuracy drops from 75.8 percent to 72.8 percent); (2) Instruction sensitivity–performance declines by 5.0 to 7.5 percent in the Qwen2.5 Series and by 5.0 percent in DeepSeek-V3 with auxiliary guidance; (3) Numerical fragility–value substitution sharply reduces accuracy (e.g., GPT-4o drops from 97.5 percent to 82.5 percent, GPT-o1-mini drops from 97.5 percent to 92.5 percent); and (4) Memory dependence–models resort to guesswork when missing critical data. These findings further highlight the reliance on heuristic recall over rigorous logical inference, demonstrating challenges in reasoning robustness. To comprehensively investigate these robustness challenges, this paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps. This is achieved by an instruction-based approach to generate diverse datasets that closely resemble training distributions, facilitating a holistic robustness assessment and advancing the development of more robust reasoning frameworks. Bad character(s) in field Abstract.

[AI-10] SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning

链接: https://arxiv.org/abs/2503.04530
作者: Chen Li,Yinyi Luo,Anudeep Bolimera,Marios Savvides
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in reasoning but remain constrained by their Chain-of-Thought (CoT) approach, which struggles with complex tasks requiring more nuanced topological reasoning. We introduce SOLAR, Scalable Optimization of Large-scale Architecture for Reasoning, a framework that dynamically optimizes various reasoning topologies to enhance accuracy and efficiency. Our Topological Annotation Generation (TAG) system automates topological dataset creation and segmentation, improving post-training and evaluation. Additionally, we propose Topological-Scaling, a reward-driven framework that aligns training and inference scaling, equipping LLMs with adaptive, task-aware reasoning. SOLAR achieves substantial gains on MATH and GSM8K: +5% accuracy with Topological Tuning, +9% with Topological Reward, and +10.02% with Hybrid Scaling. It also reduces response length by over 5% for complex problems, lowering inference latency. To foster the reward system, we train a multi-task Topological Reward Model (M-TRM), which autonomously selects the best reasoning topology and answer in a single pass, eliminating the need for training and inference on multiple single-task TRMs (S-TRMs), thus reducing both training cost and inference latency. In addition, in terms of performance, M-TRM surpasses all S-TRMs, improving accuracy by +10% and rank correlation by +9%. To the best of our knowledge, SOLAR sets a new benchmark for scalable, high-precision LLM reasoning while introducing an automated annotation process and a dynamic reasoning topology competition mechanism. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.04530 [cs.AI] (or arXiv:2503.04530v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.04530 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chen Li [view email] [v1] Thu, 6 Mar 2025 15:19:17 UTC (575 KB) Full-text links: Access Paper: View a PDF of the paper titled SOLAR: Scalable Optimization of Large-scale Architecture for Reasoning, by Chen Li and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-11] Dynamic Pricing for On-Demand DNN Inference in the Edge-AI Market

链接: https://arxiv.org/abs/2503.04521
作者: Songyuan Li,Jia Hu,Geyong Min,Haojun Huang,Jiwei Huang
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
*备注: Index Terms: Edge-AI, DNN Inference Offloading, Resource Management, Dynamic Pricing, Auction Mechanism

点击查看摘要

Abstract:The convergence of edge computing and AI gives rise to Edge-AI, which enables the deployment of real-time AI applications and services at the network edge. One of the fundamental research issues in Edge-AI is edge inference acceleration, which aims to realize low-latency high-accuracy DNN inference services by leveraging the fine-grained offloading of partitioned inference tasks from end devices to edge servers. However, existing research has yet to adopt a practical Edge-AI market perspective, which would systematically explore the personalized inference needs of AI users (e.g., inference accuracy, latency, and task complexity), the revenue incentives for AI service providers that offer edge inference services, and multi-stakeholder governance within a market-oriented context. To bridge this gap, we propose an Auction-based Edge Inference Pricing Mechanism (AERIA) for revenue maximization to tackle the multi-dimensional optimization problem of DNN model partition, edge inference pricing, and resource allocation. We investigate the multi-exit device-edge synergistic inference scheme for on-demand DNN inference acceleration, and analyse the auction dynamics amongst the AI service providers, AI users and edge infrastructure provider. Owing to the strategic mechanism design via randomized consensus estimate and cost sharing techniques, the Edge-AI market attains several desirable properties, including competitiveness in revenue maximization, incentive compatibility, and envy-freeness, which are crucial to maintain the effectiveness, truthfulness, and fairness of our auction outcomes. The extensive simulation experiments based on four representative DNN inference workloads demonstrate that our AERIA mechanism significantly outperforms several state-of-the-art approaches in revenue maximization, demonstrating the efficacy of AERIA for on-demand DNN inference in the Edge-AI market.

[AI-12] STX-Search: Explanation Search for Continuous Dynamic Spatio-Temporal Models

链接: https://arxiv.org/abs/2503.04509
作者: Saif Anwar,Nathan Griffiths,Thomas Popham,Abhir Bhalerao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent improvements in the expressive power of spatio-temporal models have led to performance gains in many real-world applications, such as traffic forecasting and social network modelling. However, understanding the predictions from a model is crucial to ensure reliability and trustworthiness, particularly for high-risk applications, such as healthcare and transport. Few existing methods are able to generate explanations for models trained on continuous-time dynamic graph data and, of these, the computational complexity and lack of suitable explanation objectives pose challenges. In this paper, we propose \textbfS patio- \textbfT emporal E \textbfX planation \textbfSearch (STX-Search), a novel method for generating instance-level explanations that is applicable to static and dynamic temporal graph structures. We introduce a novel search strategy and objective function, to find explanations that are highly faithful and interpretable. When compared with existing methods, STX-Search produces explanations of higher fidelity whilst optimising explanation size to maintain interpretability.

[AI-13] Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study

链接: https://arxiv.org/abs/2503.04506
作者: Nenad Petrovic,Yurui Zhang,Moaad Maaroufi,Kuo-Yi Chao,Lukasz Mazur,Fengjunjie Pan,Vahid Zolfaghari,Alois Knoll
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Conference paper accepted for IntelliSys2025

点击查看摘要

Abstract:Multimodal summarization integrating information from diverse data modalities presents a promising solution to aid the understanding of information within various processes. However, the application and advantages of multimodal summarization have not received much attention in model-based engineering (MBE), where it has become a cornerstone in the design and development of complex systems, leveraging formal models to improve understanding, validation and automation throughout the engineering lifecycle. UML and EMF diagrams in model-based engineering contain a large amount of multimodal information and intricate relational data. Hence, our study explores the application of multimodal large language models within the domain of model-based engineering to evaluate their capacity for understanding and identifying relationships, features, and functionalities embedded in UML and EMF diagrams. We aim to demonstrate the transformative potential benefits and limitations of multimodal summarization in improving productivity and accuracy in MBE practices. The proposed approach is evaluated within the context of automotive software development, while many promising state-of-art models were taken into account.

[AI-14] oolFuzz – Automated Agent Tool Testing

链接: https://arxiv.org/abs/2503.04479
作者: Ivan Milev,Mislav Balunović,Maximilian Baader,Martin Vechev
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real-world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent’s accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt-engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches, making it a key component for building reliable AI agents.

[AI-15] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models

链接: https://arxiv.org/abs/2503.04472
作者: Yi Shen,Jian Zhang,Jieyun Huang,Shuming Shi,Wenjing Zhang,Jiangze Yan,Ning Wang,Kai Wang,Shiguo Lian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: working in progress

点击查看摘要

Abstract:Recent advancements in slow-thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, these models often exhibit overthinking-generating redundant reasoning steps for simple problems, leading to excessive computational resource usage. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow-Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought(CoT) based on problem difficulty. We first propose a Token Length Budget (TLB) metric to quantify difficulty, then leveraging length-aware reward shaping and length preference optimization to implement DAST. DAST penalizes overlong responses for simple tasks while incentivizing sufficient reasoning for complex problems. Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking (reducing token usage by over 30% on average) while preserving reasoning accuracy on complex problems.

[AI-16] Privacy Preserving and Robust Aggregation for Cross-Silo Federated Learning in Non-IID Settings

链接: https://arxiv.org/abs/2503.04451
作者: Marco Arazzi,Mert Cihangiroglu,Antonino Nocera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Federated Averaging remains the most widely used aggregation strategy in federated learning due to its simplicity and scalability. However, its performance degrades significantly in non-IID data settings, where client distributions are highly imbalanced or skewed. Additionally, it relies on clients transmitting metadata, specifically the number of training samples, which introduces privacy risks and may conflict with regulatory frameworks like the European GDPR. In this paper, we propose a novel aggregation strategy that addresses these challenges by introducing class-aware gradient masking. Unlike traditional approaches, our method relies solely on gradient updates, eliminating the need for any additional client metadata, thereby enhancing privacy protection. Furthermore, our approach validates and dynamically weights client contributions based on class-specific importance, ensuring robustness against non-IID distributions, convergence prevention, and backdoor attacks. Extensive experiments on benchmark datasets demonstrate that our method not only outperforms FedAvg and other widely accepted aggregation strategies in non-IID settings but also preserves model integrity in adversarial scenarios. Our results establish the effectiveness of gradient masking as a practical and secure solution for federated learning.

[AI-17] Activation Space Interventions Can Be Transferred Between Large Language Models

链接: https://arxiv.org/abs/2503.04429
作者: Narmeen Oozeer,Dhruv Nathawani,Nirmalendu Prakash,Michael Lan,Abir Harrasse,Amirali Abdullah
类目: Artificial Intelligence (cs.AI)
*备注: 68 pages

点击查看摘要

Abstract:The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models’ outputs in a predictable way. Additionally, we propose a new task, \textitcorrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable ``lightweight safety switches", allowing dynamic toggling between model behaviors.

[AI-18] PDX: A Data Layout for Vector Similarity Search SIGMOD’25

链接: https://arxiv.org/abs/2503.04422
作者: Leonardo Kuffo,Elena Krippner,Peter Boncz
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: To be published in Proceedings of The 2025 International Conference on Management of Data (SIGMOD '25). For associated code, see this https URL

点击查看摘要

Abstract:We propose Partition Dimensions Across (PDX), a data layout for vectors (e.g., embeddings) that, similar to PAX [6], stores multiple vectors in one block, using a vertical layout for the dimensions (Figure 1). PDX accelerates exact and approximate similarity search thanks to its dimension-by-dimension search strategy that operates on multiple-vectors-at-a-time in tight loops. It beats SIMD-optimized distance kernels on standard horizontal vector storage (avg 40% faster), only relying on scalar code that gets auto-vectorized. We combined the PDX layout with recent dimension-pruning algorithms ADSampling [19] and BSA [52] that accelerate approximate vector search. We found that these algorithms on the horizontal vector layout can lose to SIMD-optimized linear scans, even if they are SIMD-optimized. However, when used on PDX, their benefit is restored to 2-7x. We find that search on PDX is especially fast if a limited number of dimensions has to be scanned fully, which is what the dimension-pruning approaches do. We finally introduce PDX-BOND, an even more flexible dimension-pruning strategy, with good performance on exact search and reasonable performance on approximate search. Unlike previous pruning algorithms, it can work on vector data “as-is” without preprocessing; making it attractive for vector databases with frequent updates.

[AI-19] From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design

链接: https://arxiv.org/abs/2503.04417
作者: Felix Ocker,Stefan Menzel,Ahmed Sadik,Thiago Rios
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Creating digital models using Computer Aided Design (CAD) is a process that requires in-depth expertise. In industrial product development, this process typically involves entire teams of engineers, spanning requirements engineering, CAD itself, and quality assurance. We present an approach that mirrors this team structure with a Vision Language Model (VLM)-based Multi Agent System, with access to parametric CAD tooling and tool documentation. Combining agents for requirements engineering, CAD engineering, and vision-based quality assurance, a model is generated automatically from sketches and/ or textual descriptions. The resulting model can be refined collaboratively in an iterative validation loop with the user. Our approach has the potential to increase the effectiveness of design processes, both for industry experts and for hobbyists who create models for 3D printing. We demonstrate the potential of the architecture at the example of various design tasks and provide several ablations that show the benefits of the architecture’s individual components.

[AI-20] Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search ICLR2025

链接: https://arxiv.org/abs/2503.04412
作者: Kou Misaki,Yuichi Inoue,Yuki Imajuku,So Kuroki,Taishi Nakamura,Takuya Akiba
类目: Artificial Intelligence (cs.AI)
*备注: To appear at ICLR 2025 Workshop on Foundation Models in the Wild

点击查看摘要

Abstract:Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose \textitAdaptive Branching Monte Carlo Tree Search (AB-MCTS) , a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to “go wider” by expanding new candidate responses or “go deeper” by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling.

[AI-21] raining-Free Graph Filtering via Multimodal Feature Refinement for Extremely Fast Multimodal Recommendation

链接: https://arxiv.org/abs/2503.04406
作者: Yu-Seung Roh,Joo-Young Kim,Jin-Duk Park,Won-Yong Shin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 10 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Multimodal recommender systems improve the performance of canonical recommender systems with no item features by utilizing diverse content types such as text, images, and videos, while alleviating inherent sparsity of user-item interactions and accelerating user engagement. However, current neural network-based models often incur significant computational overhead due to the complex training process required to learn and integrate information from multiple modalities. To overcome this limitation, we propose MultiModal-Graph Filtering (MM-GF), a training-free method based on the notion of graph filtering (GF) for efficient and accurate multimodal recommendations. Specifically, MM-GF first constructs multiple similarity graphs through nontrivial multimodal feature refinement such as robust scaling and vector shifting by addressing the heterogeneous characteristics across modalities. Then, MM-GF optimally fuses multimodal information using linear low-pass filters across different modalities. Extensive experiments on real-world benchmark datasets demonstrate that MM-GF not only improves recommendation accuracy by up to 13.35% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.

[AI-22] Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling

链接: https://arxiv.org/abs/2503.04398
作者: Yan Li,Pengfei Zheng,Shuang Chen,Zewei Xu,Yunfei Du,Zhengang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs (Large Language Models) to unprecedented scales. Nevertheless, large MoEs’ great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE’s inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP’s communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens’ expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP’s communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.

[AI-23] AgentS afe: Safeguarding Large Language Model-based Multi-agent Systems via Hierarchical Data Management

链接: https://arxiv.org/abs/2503.04392
作者: Junyuan Mao,Fanci Meng,Yifan Duan,Miao Yu,Xiaojun Jia,Junfeng Fang,Yuxuan Liang,Kun Wang,Qingsong Wen
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Model based multi-agent systems are revolutionizing autonomous communication and collaboration, yet they remain vulnerable to security threats like unauthorized access and data breaches. To address this, we introduce AgentSafe, a novel framework that enhances MAS security through hierarchical information management and memory protection. AgentSafe classifies information by security levels, restricting sensitive data access to authorized agents. AgentSafe incorporates two components: ThreatSieve, which secures communication by verifying information authority and preventing impersonation, and HierarCache, an adaptive memory management system that defends against unauthorized access and malicious poisoning, representing the first systematic defense for agent memory. Experiments across various LLMs show that AgentSafe significantly boosts system resilience, achieving defense success rates above 80% under adversarial conditions. Additionally, AgentSafe demonstrates scalability, maintaining robust performance as agent numbers and information complexity grow. Results underscore effectiveness of AgentSafe in securing MAS and its potential for real-world application.

[AI-24] Causally Reliable Concept Bottleneck Models

链接: https://arxiv.org/abs/2503.04363
作者: Giovanni De Felice,Arianna Casanova Flores,Francesco De Santis,Silvia Santini,Johannes Schneider,Pietro Barbiero,Alberto Termine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable concepts, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose \emphCausally reliable Concept Bottleneck Models (C ^2 BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and \emphunstructured background knowledge (e.g., scientific literature). Experimental evidence suggest that C ^2 BM are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.

[AI-25] A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery

链接: https://arxiv.org/abs/2503.04362
作者: Yiheng Zhu,Mingyang Li,Junlong Liu,Kun Fu,Jiansheng Wu,Qiuyi Li,Mingze Yin,Jieping Ye,Jian Wu,Zheng Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross-domain relationships pivotal to SBDD. To fill this gap, we propose a general-purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture-of-Domain-Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture-of-Structure-Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture-of-experts approach enables BIT to achieve both deep fusion and domain-specific encoding, effectively capturing fine-grained molecular interactions within protein-ligand complexes. Then, we perform cross-domain pre-training on the shared Transformer backbone via several unified self-supervised denoising tasks. Experimental results on various benchmarks demonstrate that BIT achieves exceptional performance in downstream tasks, including binding affinity prediction, structure-based virtual screening, and molecular property prediction.

[AI-26] scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge

链接: https://arxiv.org/abs/2503.04357
作者: Zhen Yu,Jianan Han,Yang Liu,Qingchao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.

[AI-27] alking Back – human input and explanations to interactive AI systems

链接: https://arxiv.org/abs/2503.04343
作者: Alan Dix,Tommaso Turchi,Ben Wilson,Anna Monreale,Matt Roach
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:While XAI focuses on providing AI explanations to humans, can the reverse - humans explaining their judgments to AI - foster richer, synergistic human-AI systems? This paper explores various forms of human inputs to AI and examines how human explanations can guide machine learning models toward automated judgments and explanations that align more closely with human concepts.

[AI-28] Provable Robust Overfitting Mitigation in Wasserstein Distributionally Robust Optimization

链接: https://arxiv.org/abs/2503.04315
作者: Shuang Liu,Yihan Wang,Yifan Zhu,Yibo Miao,Xiao-Shan Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for adversarial noise via Wasserstein distance and statistical error via Kullback-Leibler divergence, called the Statistically Robust WDRO. We establish a robust generalization bound for the new optimization framework, implying that out-of-distribution adversarial performance is at least as good as the statistically robust training loss with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary, giving an optimal robust model in certain sense. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.

[AI-29] Malware Detection at the Edge with Lightweight LLM s: A Performance Evaluation

链接: https://arxiv.org/abs/2503.04302
作者: Christian Rondanini,Barbara Carminati,Elena Ferrari,Antonio Gaudiano,Ashish Kundu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The rapid evolution of malware attacks calls for the development of innovative detection methods, especially in resource-constrained edge computing. Traditional detection techniques struggle to keep up with modern malware’s sophistication and adaptability, prompting a shift towards advanced methodologies like those leveraging Large Language Models (LLMs) for enhanced malware detection. However, deploying LLMs for malware detection directly at edge devices raises several challenges, including ensuring accuracy in constrained environments and addressing edge devices’ energy and computational limits. To tackle these challenges, this paper proposes an architecture leveraging lightweight LLMs’ strengths while addressing limitations like reduced accuracy and insufficient computational power. To evaluate the effectiveness of the proposed lightweight LLM-based approach for edge computing, we perform an extensive experimental evaluation using several state-of-the-art lightweight LLMs. We test them with several publicly available datasets specifically designed for edge and IoT scenarios and different edge nodes with varying computational power and characteristics.

[AI-30] Mapping AI Benchmark Data to Quantitative Risk Estimates Through Expert Elicitation

链接: https://arxiv.org/abs/2503.04299
作者: Malcolm Murray,Henry Papadatos,Otter Quarks,Pierre-François Gimenez,Simeon Campos
类目: Artificial Intelligence (cs.AI)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:The literature and multiple experts point to many potential risks from large language models (LLMs), but there are still very few direct measurements of the actual harms posed. AI risk assessment has so far focused on measuring the models’ capabilities, but the capabilities of models are only indicators of risk, not measures of risk. Better modeling and quantification of AI risk scenarios can help bridge this disconnect and link the capabilities of LLMs to tangible real-world harm. This paper makes an early contribution to this field by demonstrating how existing AI benchmarks can be used to facilitate the creation of risk estimates. We describe the results of a pilot study in which experts use information from Cybench, an AI benchmark, to generate probability estimates. We show that the methodology seems promising for this purpose, while noting improvements that can be made to further strengthen its application in quantitative AI risk assessment.

[AI-31] MathMistake Checker: A Comprehensive Demonstration for Step-by-Step Math Problem Mistake Finding by Prompt-Guided LLM s AAAI2025

链接: https://arxiv.org/abs/2503.04291
作者: Tianyang Zhang,Zhuoxuan Jiang,Haotian Zhang,Lin Lin,Shaohua Zhang
类目: Artificial Intelligence (cs.AI)
*备注: Published in AAAI 2025

点击查看摘要

Abstract:We propose a novel system, MathMistake Checker, designed to automate step-by-step mistake finding in mathematical problems with lengthy answers through a two-stage process. The system aims to simplify grading, increase efficiency, and enhance learning experiences from a pedagogical perspective. It integrates advanced technologies, including computer vision and the chain-of-thought capabilities of the latest large language models (LLMs). Our system supports open-ended grading without reference answers and promotes personalized learning by providing targeted feedback. We demonstrate its effectiveness across various types of math problems, such as calculation and word problems.

[AI-32] How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale

链接: https://arxiv.org/abs/2503.04290
作者: Jeanette Falk,Yiyi Chen,Janet Rafner,Mike Zhang,Johannes Bjerva,Alexander Nolte
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Accepted in Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems

点击查看摘要

Abstract:Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

[AI-33] Explainable AI in Time-Sensitive Scenarios: Prefetched Offline Explanation Model

链接: https://arxiv.org/abs/2503.04283
作者: Fabio Michele Russo,Carlo Metta,Anna Monreale,Salvatore Rinzivillo,Fabio Pinelli
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As predictive machine learning models become increasingly adopted and advanced, their role has evolved from merely predicting outcomes to actively shaping them. This evolution has underscored the importance of Trustworthy AI, highlighting the necessity to extend our focus beyond mere accuracy and toward a comprehensive understanding of these models’ behaviors within the specific contexts of their applications. To further progress in explainability, we introduce Poem, Prefetched Offline Explanation Model, a model-agnostic, local explainability algorithm for image data. The algorithm generates exemplars, counterexemplars and saliency maps to provide quick and effective explanations suitable for time-sensitive scenarios. Leveraging an existing local algorithm, \poem infers factual and counterfactual rules from data to create illustrative examples and opposite scenarios with an enhanced stability by design. A novel mechanism then matches incoming test points with an explanation base and produces diverse exemplars, informative saliency maps and believable counterexemplars. Experimental results indicate that Poem outperforms its predecessor Abele in speed and ability to generate more nuanced and varied exemplars alongside more insightful saliency maps and valuable counterexemplars.

[AI-34] owards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models

链接: https://arxiv.org/abs/2503.04280
作者: Niccolò Turcato,Matteo Iovino,Aris Synodinos,Alberto Dalla Libera,Ruggero Carli,Pietro Falco
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) and Visual Language Models (VLMs) have significantly impacted robotics, enabling high-level semantic motion planning applications. Reinforcement Learning (RL), a complementary paradigm, enables agents to autonomously optimize complex behaviors through interaction and reward signals. However, designing effective reward functions for RL remains challenging, especially in real-world tasks where sparse rewards are insufficient and dense rewards require elaborate design. In this work, we propose Autonomous Reinforcement learning for Complex HumanInformed Environments (ARCHIE), an unsupervised pipeline leveraging GPT-4, a pre-trained LLM, to generate reward functions directly from natural language task descriptions. The rewards are used to train RL agents in simulated environments, where we formalize the reward generation process to enhance feasibility. Additionally, GPT-4 automates the coding of task success criteria, creating a fully automated, one-shot procedure for translating human-readable text into deployable robot skills. Our approach is validated through extensive simulated experiments on single-arm and bi-manual manipulation tasks using an ABB YuMi collaborative robot, highlighting its practicality and effectiveness. Tasks are demonstrated on the real robot setup.

[AI-35] Prompt Programming: A Platform for Dialogue-based Computational Problem Solving with Generative AI Models ICSE’25

链接: https://arxiv.org/abs/2503.04267
作者: Victor-Alexandru Pădurean,Paul Denny,Alkis Gotovos,Adish Singla
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Preprint of the ITiCSE’25 paper

点击查看摘要

Abstract:Computing students increasingly rely on generative AI tools for programming assistance, often without formal instruction or guidance. This highlights a need to teach students how to effectively interact with AI models, particularly through natural language prompts, to generate and critically evaluate code for solving computational tasks. To address this, we developed a novel platform for prompt programming that enables authentic dialogue-based interactions, supports problems involving multiple interdependent functions, and offers on-request execution of generated code. Data analysis from over 900 students in an introductory programming course revealed high engagement, with the majority of prompts occurring within multi-turn dialogues. Problems with multiple interdependent functions encouraged iterative refinement, with progression graphs highlighting several common strategies. Students were highly selective about the code they chose to test, suggesting that on-request execution of generated code promoted critical thinking. Given the growing importance of learning dialogue-based programming with AI, we provide this tool as a publicly accessible resource, accompanied by a corpus of programming problems for educational use.

[AI-36] Guidelines for Applying RL and MARL in Cybersecurity Applications

链接: https://arxiv.org/abs/2503.04262
作者: Vasilios Mavroudis,Gregory Palmer,Sara Farmer,Kez Smithson Whitehead,David Foster,Adam Price,Ian Miles,Alberto Caron,Stephen Pasteris
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL) have emerged as promising methodologies for addressing challenges in automated cyber defence (ACD). These techniques offer adaptive decision-making capabilities in high-dimensional, adversarial environments. This report provides a structured set of guidelines for cybersecurity professionals and researchers to assess the suitability of RL and MARL for specific use cases, considering factors such as explainability, exploration needs, and the complexity of multi-agent coordination. It also discusses key algorithmic approaches, implementation challenges, and real-world constraints, such as data scarcity and adversarial interference. The report further outlines open research questions, including policy optimality, agent cooperation levels, and the integration of MARL systems into operational cybersecurity frameworks. By bridging theoretical advancements and practical deployment, these guidelines aim to enhance the effectiveness of AI-driven cyber defence strategies.

[AI-37] VirtualXAI: A User-Centric Framework for Explainability Assessment Leverag ing GPT -Generated Personas

链接: https://arxiv.org/abs/2503.04261
作者: Georgios Makridis,Vasileios Koukos,Georgios Fatouros,Dimosthenis Kyriazis
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:In today’s data-driven era, computational systems generate vast amounts of data that drive the digital transformation of industries, where Artificial Intelligence (AI) plays a key role. Currently, the demand for eXplainable AI (XAI) has increased to enhance the interpretability, transparency, and trustworthiness of AI models. However, evaluating XAI methods remains challenging: existing evaluation frameworks typically focus on quantitative properties such as fidelity, consistency, and stability without taking into account qualitative characteristics such as satisfaction and interpretability. In addition, practitioners face a lack of guidance in selecting appropriate datasets, AI models, and XAI methods -a major hurdle in human-AI collaboration. To address these gaps, we propose a framework that integrates quantitative benchmarking with qualitative user assessments through virtual personas based on the “Anthology” of backstories of the Large Language Model (LLM). Our framework also incorporates a content-based recommender system that leverages dataset-specific characteristics to match new input data with a repository of benchmarked datasets. This yields an estimated XAI score and provides tailored recommendations for both the optimal AI model and the XAI method for a given scenario.

[AI-38] Knowledge Retention for Continual Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2503.04256
作者: Yixiang Sun,Haotian Fu,Michael Littman,George Konidaris
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and Regaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.

[AI-39] How to Mitigate Overfitting in Weak-to-strong Generalization?

链接: https://arxiv.org/abs/2503.04249
作者: Junhao Shi,Qinyuan Cheng,Zhaoye Fei,Yining Zheng,Qipeng Guo,Xipeng Qiu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of \textbfsuperalignment. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR compared to naive weak-to-strong generalization, even achieving up to 100% PGR on some models.

[AI-40] One-Shot Clustering for Federated Learning

链接: https://arxiv.org/abs/2503.04231
作者: Maciej Krzysztof Zuziak,Roberto Pellungrini,Salvatore Rinzivillo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a widespread and well adopted paradigm of decentralized learning that allows training one model from multiple sources without the need to directly transfer data between participating clients. Since its inception in 2015, it has been divided into numerous sub-fields that deal with application-specific issues, be it data heterogeneity or resource allocation. One such sub-field, Clustered Federated Learning (CFL), is dealing with the problem of clustering the population of clients into separate cohorts to deliver personalized models. Although few remarkable works have been published in this domain, the problem is still largely unexplored, as its basic assumption and settings are slightly different from standard FL. In this work, we present One-Shot Clustered Federated Learning (OCFL), a clustering-agnostic algorithm that can automatically detect the earliest suitable moment for clustering. Our algorithm is based on the computation of cosine similarity between gradients of the clients and a temperature measure that detects when the federated model starts to converge. We empirically evaluate our methodology by testing various one-shot clustering algorithms for over thirty different tasks on three benchmark datasets. Our experiments showcase the good performance of our approach when used to perform CFL in an automated manner without the need to adjust hyperparameters.

[AI-41] Quantum-Inspired Reinforcement Learning in the Presence of Epistemic Ambivalence

链接: https://arxiv.org/abs/2503.04219
作者: Alireza Habibi,Saeed Ghoorchian,Setareh Maghsudi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.

[AI-42] CrowdHMTware: A Cross-level Co-adaptation Middleware for Context-aware Mobile DL Deployment

链接: https://arxiv.org/abs/2503.04183
作者: Sicong Liu,Bin Guo,Shiyan Luo,Yuzhan Wang,Hao Luo,Cheng Fang,Yuan Xu,Ke Ma,Yao Li,Zhiwen Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper is accepted by IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:There are many deep learning (DL) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human this http URL enable robust and private mobile sensing, DL models are often deployed locally on resource-constrained mobile devices using techniques such as model compression or this http URL, existing methods, either front-end algorithm level (i.e. DL model compression/partitioning) or back-end scheduling level (i.e. operator/resource scheduling), cannot be locally online because they require offline retraining to ensure accuracy or rely on manually pre-defined strategies, struggle with dynamic this http URL primary challenge lies in feeding back runtime performance from the back-end level to the front-end level optimization decision. Moreover, the adaptive mobile DL model porting middleware with cross-level co-adaptation is less explored, particularly in mobile environments with diversity and dynamics. In response, we introduce CrowdHMTware, a dynamic context-adaptive DL model deployment middleware for heterogeneous mobile devices. It establishes an automated adaptation loop between cross-level functional components, i.e. elastic inference, scalable offloading, and model-adaptive engine, enhancing scalability and adaptability. Experiments with four typical tasks across 15 platforms and a real-world case study demonstrate that CrowdHMTware can effectively scale DL model, offloading, and engine actions across diverse platforms and tasks. It hides run-time system issues from developers, reducing the required developer expertise.

[AI-43] owards Intelligent Transportation with Pedestrians and Vehicles In-the-Loop: A Surveillance Video-Assisted Federated Digital Twin Framework

链接: https://arxiv.org/abs/2503.04170
作者: Xiaolong Li,Jianhao Wei,Haidong Wang,Li Dong,Ruoyang Chen,Changyan Yi,Jun Cai,Dusit Niyato,Xuemin(Sherman)Shen
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in-the-loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real-world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. Specifically, SVFDT builds comprehensive pedestrian-vehicle interaction models by leveraging multi-source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT’s system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal-server frameworks highlight SV-FDT’s advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.

[AI-44] Semantic Retrieval Augmented Contrastive Learning for Sequential Recommendation

链接: https://arxiv.org/abs/2503.04162
作者: Ziqiang Cui,Yunpeng Weng,Xing Tang,Xiaokun Zhang,Dugang Liu,Shiwei Li,Peiyang Liu,Bowei He,Weihong Luo,Xiuqiang He,Chen Ma
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sequential recommendation aims to model user preferences based on historical behavior sequences, which is crucial for various online platforms. Data sparsity remains a significant challenge in this area as most users have limited interactions and many items receive little attention. To mitigate this issue, contrastive learning has been widely adopted. By constructing positive sample pairs from the data itself and maximizing their agreement in the embedding space,it can leverage available data more effectively. Constructing reasonable positive sample pairs is crucial for the success of contrastive learning. However, current approaches struggle to generate reliable positive pairs as they either rely on representations learned from inherently sparse collaborative signals or use random perturbations which introduce significant uncertainty. To address these limitations, we propose a novel approach named Semantic Retrieval Augmented Contrastive Learning (SRA-CL), which leverages semantic information to improve the reliability of contrastive samples. SRA-CL comprises two main components: (1) Cross-Sequence Contrastive Learning via User Semantic Retrieval, which utilizes large language models (LLMs) to understand diverse user preferences and retrieve semantically similar users to form reliable positive samples through a learnable sample synthesis method; and (2) Intra-Sequence Contrastive Learning via Item Semantic Retrieval, which employs LLMs to comprehend items and retrieve similar items to perform semantic-based item substitution, thereby creating semantically consistent augmented views for contrastive learning. SRA-CL is plug-and-play and can be integrated into standard sequential recommendation models. Extensive experiments on four public datasets demonstrate the effectiveness and generalizability of the proposed approach.

[AI-45] Unseen Fake News Detection Through Casual Debiasing

链接: https://arxiv.org/abs/2503.04160
作者: Shuzhi Gong,Richard Sinnott,Jianzhong Qi,Cecile Paris
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 2025 The Web Conference, 6 pages, 4 figures

点击查看摘要

Abstract:The widespread dissemination of fake news on social media poses significant risks, necessitating timely and accurate detection. However, existing methods struggle with unseen news due to their reliance on training data from past events and domains, leaving the challenge of detecting novel fake news largely unresolved. To address this, we identify biases in training data tied to specific domains and propose a debiasing solution FNDCD. Originating from causal analysis, FNDCD employs a reweighting strategy based on classification confidence and propagation structure regularization to reduce the influence of domain-specific biases, enhancing the detection of unseen fake news. Experiments on real-world datasets with non-overlapping news domains demonstrate FNDCD’s effectiveness in improving generalization across domains.

[AI-46] KidneyTalk-open: No-code Deployment of a Private Large Language Model with Medical Documentation-Enhanced Knowledge Database for Kidney Disease

链接: https://arxiv.org/abs/2503.04153
作者: Yongchao Long(1 and 2),Chao Yang(4 and 7),Gongzheng Tang(2),Jinwei Wang(4),Zhun Sui(9),Yuxi Zhou(1, 3),Shenda Hong(2, 8),Luxia Zhang(2, 4, 5, 6, 7) ((1) Department of Computer Science, Tianjin University of Technology, Tianjin, China, (2) National Institute of Health Data Science, Peking University, Beijing, China, (3) Institute of Internet Industry, Tsinghua University, Beijing, China, (4) Renal Division, Department of Medicine, Peking University First Hospital, Beijing, China, (5) Research Units of Diagnosis and Treatment of Immune-Mediated Kidney Diseases, Chinese Academy of Medical Sciences, Beijing, China, (6) State Key Laboratory of Vascular Homeostasis and Remodeling, Peking University, Beijing, China, (7) Center for Digital Health and Artificial Intelligence, Peking University First Hospital, Beijing, China, (8) Department of Emergency Medicine, Peking University First Hospital, Beijing, China, (9) Renal Department, Peking University People’s Hospital, Beijing, China)
类目: Artificial Intelligence (cs.AI)
*备注: Corresponding authors: zhanglx@bjmu. this http URL ; joy_yuxi@pku. this http URL ; hongshenda@pku. this http URL

点击查看摘要

Abstract:Privacy-preserving medical decision support for kidney disease requires localized deployment of large language models (LLMs) while maintaining clinical reasoning capabilities. Current solutions face three challenges: 1) Cloud-based LLMs pose data security risks; 2) Local model deployment demands technical expertise; 3) General LLMs lack mechanisms to integrate medical knowledge. Retrieval-augmented systems also struggle with medical document processing and clinical usability. We developed KidneyTalk-open, a desktop system integrating three technical components: 1) No-code deployment of state-of-the-art (SOTA) open-source LLMs (such as DeepSeek-r1, Qwen2.5) via local inference engine; 2) Medical document processing pipeline combining context-aware chunking and intelligent filtering; 3) Adaptive Retrieval and Augmentation Pipeline (AddRep) employing agents collaboration for improving the recall rate of medical documents. A graphical interface was designed to enable clinicians to manage medical documents and conduct AI-powered consultations without technical expertise. Experimental validation on 1,455 challenging nephrology exam questions demonstrates AddRep’s effectiveness: achieving 29.1% accuracy (+8.1% over baseline) with intelligent knowledge integration, while maintaining robustness through 4.9% rejection rate to suppress hallucinations. Comparative case studies with the mainstream products (AnythingLLM, Chatbox, GPT4ALL) demonstrate KidneyTalk-open’s superior performance in real clinical query. KidneyTalk-open represents the first no-code medical LLM system enabling secure documentation-enhanced medical QA on desktop. Its designs establishes a new framework for privacy-sensitive clinical AI applications. The system significantly lowers technical barriers while improving evidence traceability, enabling more medical staff or patients to use SOTA open-source LLMs conveniently.

[AI-47] MTS: A Deep Reinforcement Learning Portfolio Management Framework with Time-Awareness and Short-Selling

链接: https://arxiv.org/abs/2503.04143
作者: Fengchen Gu,Zhengyong Jiang,Ángel F. García-Fernández,Angelos Stefanidis,Jionglong Su,Huakang Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Portfolio management remains a crucial challenge in finance, with traditional methods often falling short in complex and volatile market environments. While deep reinforcement approaches have shown promise, they still face limitations in dynamic risk management, exploitation of temporal markets, and incorporation of complex trading strategies such as short-selling. These limitations can lead to suboptimal portfolio performance, increased vulnerability to market volatility, and missed opportunities in capturing potential returns from diverse market conditions. This paper introduces a Deep Reinforcement Learning Portfolio Management Framework with Time-Awareness and Short-Selling (MTS), offering a robust and adaptive strategy for sustainable investment performance. This framework utilizes a novel encoder-attention mechanism to address the limitations by incorporating temporal market characteristics, a parallel strategy for automated short-selling based on market trends, and risk management through innovative Incremental Conditional Value at Risk, enhancing adaptability and performance. Experimental validation on five diverse datasets from 2019 to 2023 demonstrates MTS’s superiority over traditional algorithms and advanced machine learning techniques. MTS consistently achieves higher cumulative returns, Sharpe, Omega, and Sortino ratios, underscoring its effectiveness in balancing risk and return while adapting to market dynamics. MTS demonstrates an average relative increase of 30.67% in cumulative returns and 29.33% in Sharpe ratio compared to the next best-performing strategies across various datasets.

[AI-48] Artificial Intelligence in Pronunciation Teaching: Use and Beliefs of Foreign Language Teachers

链接: https://arxiv.org/abs/2503.04128
作者: Georgios P. Georgiou
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pronunciation instruction in foreign language classrooms has often been an overlooked area of focus. With the widespread adoption of Artificial Intelligence (AI) and its potential benefits, investigating how AI is utilized in pronunciation teaching and understanding the beliefs of teachers about this tool is essential for improving learning outcomes. This study aims to examine how AI use for pronunciation instruction varies across different demographic and professional factors among teachers, and how these factors, including AI use, influence the beliefs of teachers about AI. The study involved 117 English as a Foreign Language (EFL) in-service teachers working in Cyprus, who completed an online survey designed to assess their beliefs about the effectiveness of AI, its drawbacks, and their willingness to integrate AI into their teaching practices. The results revealed that teachers were significantly more likely to agree on the perceived effectiveness of AI and their willingness to adopt it, compared to their concerns about its use. Furthermore, teachers working in higher education and adult education, as well as those who had received more extensive training, reported using AI more frequently in their teaching. Teachers who utilized AI more often expressed stronger agreement with its effectiveness, while those who had received more training were less likely to express concerns about its integration. Given the limited training that many teachers currently receive, these findings demonstrate the need for tailored training sessions that address the specific needs and concerns of educators, ultimately fostering the adoption of AI in pronunciation instruction.

[AI-49] Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Ability

链接: https://arxiv.org/abs/2503.04111
作者: Lijia Yu,Yibo Miao,Yifan Zhu,Xiao-Shan Gao,Lijun Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The primary objective of learning methods is generalization. Classic uniform generalization bounds, which rely on VC-dimension or Rademacher complexity, fail to explain the significant attribute that over-parameterized models in deep learning exhibit nice generalizability. On the other hand, algorithm-dependent generalization bounds, like stability bounds, often rely on strict assumptions. To establish generalizability under less stringent assumptions, this paper investigates the generalizability of neural networks that minimize or approximately minimize empirical risk. We establish a lower bound for population accuracy based on the expressiveness of these networks, which indicates that with an adequate large number of training samples and network sizes, these networks, including over-parameterized ones, can generalize effectively. Additionally, we provide a necessary condition for generalization, demonstrating that, for certain data distributions, the quantity of training data required to ensure generalization exceeds the network size needed to represent the corresponding data distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust generalization, importance of over-parameterization, and effect of loss function on generalization.

[AI-50] InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions

链接: https://arxiv.org/abs/2503.04110
作者: Juntong Chen,Jiang Wu,Jiajing Guo,Vikram Mohanty,Xueming Li,Jorge Piazentin Ono,Wenbin He,Liu Ren,Dongyu Liu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Manuscript submitted to EuroVis 2025

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data-driven insights, yet significant challenges persist in accurately interpreting users’ analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error-prone, and time-intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM-driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.

[AI-51] SED2AM: Solving Multi-Trip Time-Dependent Vehicle Routing Problem using Deep Reinforcement Learning KDD

链接: https://arxiv.org/abs/2503.04085
作者: Arash Mozhdehi,Yunli Wang,Sun Sun,Xin Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACM TKDD: this https URL

点击查看摘要

Abstract:Deep reinforcement learning (DRL)-based frameworks, featuring Transformer-style policy networks, have demonstrated their efficacy across various vehicle routing problem (VRP) variants. However, the application of these methods to the multi-trip time-dependent vehicle routing problem (MTTDVRP) with maximum working hours constraints – a pivotal element of urban logistics – remains largely unexplored. This paper introduces a DRL-based method called the Simultaneous Encoder and Dual Decoder Attention Model (SED2AM), tailored for the MTTDVRP with maximum working hours constraints. The proposed method introduces a temporal locality inductive bias to the encoding module of the policy networks, enabling it to effectively account for the time-dependency in travel distance or time. The decoding module of SED2AM includes a vehicle selection decoder that selects a vehicle from the fleet, effectively associating trips with vehicles for functional multi-trip routing. Additionally, this decoding module is equipped with a trip construction decoder leveraged for constructing trips for the vehicles. This policy model is equipped with two classes of state representations, fleet state and routing state, providing the information needed for effective route construction in the presence of maximum working hours constraints. Experimental results using real-world datasets from two major Canadian cities not only show that SED2AM outperforms the current state-of-the-art DRL-based and metaheuristic-based baselines but also demonstrate its generalizability to solve larger-scale problems.

[AI-52] Can We Optimize Deep RL Policy Weights as Trajectory Modeling? ICLR2025

链接: https://arxiv.org/abs/2503.04074
作者: Hongyao Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted as an extended abstract to ICLR 2025 Workshop on Weight Space Learning (WSL)

点击查看摘要

Abstract:Learning the optimal policy from a random network initialization is the theme of deep Reinforcement Learning (RL). As the scale of DRL training increases, treating DRL policy network weights as a new data modality and exploring the potential becomes appealing and possible. In this work, we focus on the policy learning path in deep RL, represented by the trajectory of network weights of historical policies, which reflects the evolvement of the policy learning process. Taking the idea of trajectory modeling with Transformer, we propose Transformer as Implicit Policy Learner (TIPL), which processes policy network weights in an autoregressive manner. We collect the policy learning path data by running independent RL training trials, with which we then train our TIPL model. In the experiments, we demonstrate that TIPL is able to fit the implicit dynamics of policy learning and perform the optimization of policy network by inference.

[AI-53] Continual Optimization with Symmetry Teleportation for Multi-Task Learning

链接: https://arxiv.org/abs/2503.04046
作者: Zhipeng Zhou,Ziqiao Meng,Pengcheng Wu,Peilin Zhao,Chunyan Miao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages,8 figures

点击查看摘要

Abstract:Multi-task learning (MTL) is a widely explored paradigm that enables the simultaneous learning of multiple tasks using a single model. Despite numerous solutions, the key issues of optimization conflict and task imbalance remain under-addressed, limiting performance. Unlike existing optimization-based approaches that typically reweight task losses or gradients to mitigate conflicts or promote progress, we propose a novel approach based on Continual Optimization with Symmetry Teleportation (COST). During MTL optimization, when an optimization conflict arises, we seek an alternative loss-equivalent point on the loss landscape to reduce conflict. Specifically, we utilize a low-rank adapter (LoRA) to facilitate this practical teleportation by designing convergent, loss-invariant objectives. Additionally, we introduce a historical trajectory reuse strategy to continually leverage the benefits of advanced optimizers. Extensive experiments on multiple mainstream datasets demonstrate the effectiveness of our approach. COST is a plug-and-play solution that enhances a wide range of existing MTL methods. When integrated with state-of-the-art methods, COST achieves superior performance.

[AI-54] Subgraph Federated Learning for Local Generalization ICLR2025

链接: https://arxiv.org/abs/2503.03995
作者: Sungwon Kim,Yoonho Lee,Yunhak Oh,Namkyeong Lee,Sukwon Yun,Junseok Lee,Sein Kim,Carl Yang,Chanyoung Park
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICLR 2025 (oral)

点击查看摘要

Abstract:Federated Learning (FL) on graphs enables collaborative model training to enhance performance without compromising the privacy of each client. However, existing methods often overlook the mutable nature of graph data, which frequently introduces new nodes and leads to shifts in label distribution. Since they focus solely on performing well on each client’s local data, they are prone to overfitting to their local distributions (i.e., local overfitting), which hinders their ability to generalize to unseen data with diverse label distributions. In contrast, our proposed method, FedLoG, effectively tackles this issue by mitigating local overfitting. Our model generates global synthetic data by condensing the reliable information from each class representation and its structural information across clients. Using these synthetic data as a training set, we alleviate the local overfitting problem by adaptively generalizing the absent knowledge within each local dataset. This enhances the generalization capabilities of local models, enabling them to handle unseen data effectively. Our model outperforms baselines in our proposed experimental settings, which are designed to measure generalization power to unseen data in practical scenarios. Our code is available at this https URL

[AI-55] raining neural networks faster with minimal tuning using pre-computed lists of hyperparameters for NAdamW

链接: https://arxiv.org/abs/2503.03986
作者: Sourabh Medapati,Priya Kasimbeg,Shankar Krishnan,Naman Agarwal,George Dahl
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Good defaults for NadamW Optimizer, generalizes well to unseen problems

点击查看摘要

Abstract:If we want to train a neural network using any of the most popular optimization algorithms, we are immediately faced with a dilemma: how to set the various optimization and regularization hyperparameters? When computational resources are abundant, there are a variety of methods for finding good hyperparameter settings, but when resources are limited the only realistic choices are using standard default values of uncertain quality and provenance, or tuning only a couple of the most important hyperparameters via extremely limited handdesigned sweeps. Extending the idea of default settings to a modest tuning budget, Metz et al. (2020) proposed using ordered lists of well-performing hyperparameter settings, derived from a broad hyperparameter search on a large library of training workloads. However, to date, no practical and performant hyperparameter lists that generalize to representative deep learning workloads have been demonstrated. In this paper, we present hyperparameter lists for NAdamW derived from extensive experiments on the realistic workloads in the AlgoPerf: Training Algorithms benchmark. Our hyperparameter lists also include values for basic regularization techniques (i.e. weight decay, label smoothing, and dropout). In particular, our best NAdamW hyperparameter list performs well on AlgoPerf held-out workloads not used to construct it, and represents a compelling turn-key approach to tuning when restricted to five or fewer trials. It also outperforms basic learning rate/weight decay sweeps and an off-the-shelf Bayesian optimization tool when restricted to the same budget.

[AI-56] All-atom Diffusion Transformers: Unified generative modelling of molecules and materials

链接: https://arxiv.org/abs/2503.03965
作者: Chaitanya K. Joshi,Xiang Fu,Yi-Lun Liao,Vahe Gharakhanyan,Benjamin Kurt Miller,Anuroop Sriram,Zachary W. Ulissi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems - such as molecules and materials - the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on QM9 and MP20 datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, exceeding state-of-the-art results from molecule and crystal-specific models. ADiT uses standard Transformers for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: this https URL

[AI-57] WIP: Assessing the Effectiveness of ChatGPT in Preparatory Testing Activities

链接: https://arxiv.org/abs/2503.03951
作者: Susmita Haldar,Mary Pierce,Luiz Fernando Capretz
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 5 pages

点击查看摘要

Abstract:This innovative practice WIP paper describes a research study that explores the integration of ChatGPT into the software testing curriculum and evaluates its effectiveness compared to human-generated testing artifacts. In a Capstone Project course, students were tasked with generating preparatory testing artifacts using ChatGPT prompts, which they had previously created manually. Their understanding and the effectiveness of the Artificial Intelligence generated artifacts were assessed through targeted questions. The results, drawn from this in-class assignment at a North American community college indicate that while ChatGPT can automate many testing preparation tasks, it cannot fully replace human expertise. However, students, already familiar with Information Technology at the postgraduate level, found the integration of ChatGPT into their workflow to be straightforward. The study suggests that AI can be gradually introduced into software testing education to keep pace with technological advancements.

[AI-58] GlucoLens: Explainable Postprandial Blood Glucose Prediction from Diet and Physical Activity

链接: https://arxiv.org/abs/2503.03935
作者: Abdullah Mamun,Asiful Arefeen,Susan B. Racette,Dorothy D. Sears,Corrie M. Whisner,Matthew P. Buman,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Postprandial hyperglycemia, marked by the blood glucose level exceeding the normal range after meals, is a critical indicator of progression toward type 2 diabetes in prediabetic and healthy individuals. A key metric for understanding blood glucose dynamics after eating is the postprandial area under the curve (PAUC). Predicting PAUC in advance based on a person’s diet and activity level and explaining what affects postprandial blood glucose could allow an individual to adjust their lifestyle accordingly to maintain normal glucose levels. In this paper, we propose GlucoLens, an explainable machine learning approach to predict PAUC and hyperglycemia from diet, activity, and recent glucose patterns. We conducted a five-week user study with 10 full-time working individuals to develop and evaluate the computational model. Our machine learning model takes multimodal data including fasting glucose, recent glucose, recent activity, and macronutrient amounts, and provides an interpretable prediction of the postprandial glucose pattern. Our extensive analyses of the collected data revealed that the trained model achieves a normalized root mean squared error (NRMSE) of 0.123. On average, GlucoLense with a Random Forest backbone provides a 16% better result than the baseline models. Additionally, GlucoLens predicts hyperglycemia with an accuracy of 74% and recommends different options to help avoid hyperglycemia through diverse counterfactual explanations. Code available: this https URL.

[AI-59] “Impressively Scary:” Exploring User Perceptions and Reactions to Unraveling Machine Learning Models in Social Media Applications

链接: https://arxiv.org/abs/2503.03927
作者: Jack West,Bengisu Cagiltay,Shirley Zhang,Jingjie Li,Kassem Fawaz,Suman Banerjee
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 21 pages, 2 figures, to appear at CHI 2025

点击查看摘要

Abstract:Machine learning models deployed locally on social media applications are used for features, such as face filters which read faces in-real time, and they expose sensitive attributes to the apps. However, the deployment of machine learning models, e.g., when, where, and how they are used, in social media applications is opaque to users. We aim to address this inconsistency and investigate how social media user perceptions and behaviors change once exposed to these models. We conducted user studies (N=21) and found that participants were unaware to both what the models output and when the models were used in Instagram and TikTok, two major social media platforms. In response to being exposed to the models’ functionality, we observed long term behavior changes in 8 participants. Our analysis uncovers the challenges and opportunities in providing transparency for machine learning models that interact with local user data.

[AI-60] De-skilling Cognitive Offloading and Misplaced Responsibilities: Potential Ironies of AI-Assisted Design

链接: https://arxiv.org/abs/2503.03924
作者: Prakash Shukla,Phuong Bui,Sean S Levy,Max Kowalski,Ali Baigelenov,Paul Parsons
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid adoption of generative AI (GenAI) in design has sparked discussions about its benefits and unintended consequences. While AI is often framed as a tool for enhancing productivity by automating routine tasks, historical research on automation warns of paradoxical effects, such as de-skilling and misplaced responsibilities. To assess UX practitioners’ perceptions of AI, we analyzed over 120 articles and discussions from UX-focused subreddits. Our findings indicate that while practitioners express optimism about AI reducing repetitive work and augmenting creativity, they also highlight concerns about over-reliance, cognitive offloading, and the erosion of critical design skills. Drawing from human-automation interaction literature, we discuss how these perspectives align with well-documented automation ironies and function allocation challenges. We argue that UX professionals should critically evaluate AI’s role beyond immediate productivity gains and consider its long-term implications for creative autonomy and expertise. This study contributes empirical insights into practitioners’ perspectives and links them to broader debates on automation in design.

[AI-61] Learning to Negotiate via Voluntary Commitment AISTATS2025

链接: https://arxiv.org/abs/2503.03866
作者: Shuhui Zhu,Baoxiang Wang,Sriram Ganapathi Subramanian,Pascal Poupart
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: AISTATS 2025

点击查看摘要

Abstract:The partial alignment and conflict of autonomous agents lead to mixed-motive scenarios in many real-world applications. However, agents may fail to cooperate in practice even when cooperation yields a better outcome. One well known reason for this failure comes from non-credible commitments. To facilitate commitments among agents for better cooperation, we define Markov Commitment Games (MCGs), a variant of commitment games, where agents can voluntarily commit to their proposed future plans. Based on MCGs, we propose a learnable commitment protocol via policy gradients. We further propose incentive-compatible learning to accelerate convergence to equilibria with better social welfare. Experimental results in challenging mixed-motive tasks demonstrate faster empirical convergence and higher returns for our method compared with its counterparts. Our code is available at this https URL.

[AI-62] RiskAgent : Autonomous Medical AI Copilot for Generalist Risk Prediction ALT

链接: https://arxiv.org/abs/2503.03802
作者: Fenglin Liu,Jinge Wu,Hongjian Zhou,Xiao Gu,Soheila Molaei,Anshul Thakur,Lei Clifton,Honghan Wu,David A. Clifton
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 18 pages, 6 figures, 4 tables, code is available at this https URL

点击查看摘要

Abstract:The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at this https URL.

[AI-63] VoiceGRPO: Modern MoE Transformers with Group Relative Policy Optimization GRPO for AI Voice Health Care Applications on Voice Pathology Detection

链接: https://arxiv.org/abs/2503.03797
作者: Enkhtogtokh Togootogtokh,Christian Klasen
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This research introduces a novel AI techniques as Mixture-of-Experts Transformers with Group Relative Policy Optimization (GRPO) for voice health care applications on voice pathology detection. With the architectural innovations, we adopt advanced training paradigms inspired by reinforcement learning, namely Proximal Policy Optimization (PPO) and Group-wise Regularized Policy Optimization (GRPO), to enhance model stability and performance. Experiments conducted on a synthetically generated voice pathology dataset demonstrate that our proposed models significantly improve diagnostic accuracy, F1 score, and ROC-AUC compared to conventional approaches. These findings underscore the potential of integrating transformer architectures with novel training strategies to advance automated voice pathology detection and ultimately contribute to more effective healthcare delivery. The code we used to train and evaluate our models is available at this https URL

[AI-64] Human Implicit Preference-Based Policy Fine-tuning for Multi-Agent Reinforcement Learning in USV Swarm

链接: https://arxiv.org/abs/2503.03796
作者: Hyeonjun Kim,Kanghoon Lee,Junho Park,Jiachen Li,Jinkyoo Park
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Multi-Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents, such as an Unmanned Surface Vehicle (USV) swarm used in search and rescue, surveillance, and vessel protection. However, aligning system behavior with user preferences is challenging due to the difficulty of encoding expert intuition into reward functions. To address the issue, we propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit-assignment challenges through an Agent-Level Feedback system categorizing feedback into intra-agent, inter-agent, and intra-team types. To overcome the challenges of direct human feedback, we employ a Large Language Model (LLM) evaluator to validate our approach using feedback scenarios such as region constraints, collision avoidance, and task allocation. Our method effectively refines USV swarm policies, addressing key challenges in multi-agent systems while maintaining fairness and performance consistency.

[AI-65] Synthetic Data Augmentation for Enhancing Harmful Algal Bloom Detection with Machine Learning

链接: https://arxiv.org/abs/2503.03794
作者: Tianyi Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted Paper at the 2025 IEEE Conference on Technologies for Sustainability (SusTech)

点击查看摘要

Abstract:Harmful Algal Blooms (HABs) pose severe threats to aquatic ecosystems and public health, resulting in substantial economic losses globally. Early detection is crucial but often hindered by the scarcity of high-quality datasets necessary for training reliable machine learning (ML) models. This study investigates the use of synthetic data augmentation using Gaussian Copulas to enhance ML-based HAB detection systems. Synthetic datasets of varying sizes (100-1,000 samples) were generated using relevant environmental features \unicodex2015 water temperature, salinity, and UVB radiation \unicodex2015 with corrected Chlorophyll-a concentration as the target variable. Experimental results demonstrate that moderate synthetic augmentation significantly improves model performance (RMSE reduced from 0.4706 to 0.1850; p 0.001 ). However, excessive synthetic data introduces noise and reduces predictive accuracy, emphasizing the need for a balanced approach to data augmentation. These findings highlight the potential of synthetic data to enhance HAB monitoring systems, offering a scalable and cost-effective method for early detection and mitigation of ecological and public health risks.

[AI-66] Rebalanced Multimodal Learning with Data-aware Unimodal Sampling

链接: https://arxiv.org/abs/2503.03792
作者: Qingyuan Jiang,Zhouyang Chi,Xiao Ma,Qirong Mao,Yang Yang,Jinhui Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To address the modality learning degeneration caused by modality imbalance, existing multimodal learning~(MML) approaches primarily attempt to balance the optimization process of each modality from the perspective of model learning. However, almost all existing methods ignore the modality imbalance caused by unimodal data sampling, i.e., equal unimodal data sampling often results in discrepancies in informational content, leading to modality imbalance. Therefore, in this paper, we propose a novel MML approach called \underlineData-aware \underlineUnimodal \underlineSampling~(\method), which aims to dynamically alleviate the modality imbalance caused by sampling. Specifically, we first propose a novel cumulative modality discrepancy to monitor the multimodal learning process. Based on the learning status, we propose a heuristic and a reinforcement learning~(RL)-based data-aware unimodal sampling approaches to adaptively determine the quantity of sampled data at each iteration, thus alleviating the modality imbalance from the perspective of sampling. Meanwhile, our method can be seamlessly incorporated into almost all existing multimodal learning approaches as a plugin. Experiments demonstrate that \method~can achieve the best performance by comparing with diverse state-of-the-art~(SOTA) baselines.

[AI-67] Predicting Team Performance from Communications in Simulated Search-and-Rescue

链接: https://arxiv.org/abs/2503.03791
作者: Ali Jalal-Kamali,Nikolos Gurney,David Pynadath
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding how individual traits influence team performance is valuable, but these traits are not always directly observable. Prior research has inferred traits like trust from behavioral data. We analyze conversational data to identify team traits and their correlation with teaming outcomes. Using transcripts from a Minecraft-based search-and-rescue experiment, we apply topic modeling and clustering to uncover key interaction patterns. Our findings show that variations in teaming outcomes can be explained through these inferences, with different levels of predictive power derived from individual traits and team dynamics.

[AI-68] Positive-Unlabeled Diffusion Models for Preventing Sensitive Data Generation ICLR2025 IROS

链接: https://arxiv.org/abs/2503.03789
作者: Hiroshi Takahashi,Tomoharu Iwata,Atsutoshi Kumagai,Yuuki Yamanaka,Tomoya Yamashita
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at ICLR2025. Code is available at this https URL

点击查看摘要

Abstract:Diffusion models are powerful generative models but often generate sensitive data that are unwanted by users, mainly because the unlabeled training data frequently contain such sensitive data. Since labeling all sensitive data in the large-scale unlabeled training data is impractical, we address this problem by using a small amount of labeled sensitive data. In this paper, we propose positive-unlabeled diffusion models, which prevent the generation of sensitive data using unlabeled and sensitive data. Our approach can approximate the evidence lower bound (ELBO) for normal (negative) data using only unlabeled and sensitive (positive) data. Therefore, even without labeled normal data, we can maximize the ELBO for normal data and minimize it for labeled sensitive data, ensuring the generation of only normal data. Through experiments across various datasets and settings, we demonstrated that our approach can prevent the generation of sensitive images without compromising image quality.

[AI-69] Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

链接: https://arxiv.org/abs/2503.03779
作者: Yimin Tang,Zhenghong Yu,Jiaoyang Li,Sven Koenig
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 7 pages

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function–an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level CT nodes and 50% low-level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

[AI-70] FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

链接: https://arxiv.org/abs/2503.03777
作者: Hongchao Du,Shangyu Wu,Arina Kharlamova,Nan Guan,Chun Jason Xue
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI)
*备注: 9 pages, 5 figures, to be published in EuroMLSys '25

点击查看摘要

Abstract:Large Language Models (LLMs) face challenges for on-device inference due to high memory demands. Traditional methods to reduce memory usage often compromise performance and lack adaptability. We propose FlexInfer, an optimized offloading framework for on-device inference, addressing these issues with techniques like asynchronous prefetching, balanced memory locking, and flexible tensor preservation. These strategies enhance memory efficiency and mitigate I/O bottlenecks, ensuring high performance within user-specified resource constraints. Experiments demonstrate that FlexInfer significantly improves throughput under limited resources, achieving up to 12.5 times better performance than existing methods and facilitating the deployment of large models on resource-constrained devices.

[AI-71] BotUmc: An Uncertainty-Aware Twitter Bot Detection with Multi-view Causal Inference

链接: https://arxiv.org/abs/2503.03775
作者: Tao Yang,Yang Hu,Feihong Lu,Ziwei Zhang,Qingyun Sun,Jianxin Li
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Social bots have become widely known by users of social platforms. To prevent social bots from spreading harmful speech, many novel bot detections are proposed. However, with the evolution of social bots, detection methods struggle to give high-confidence answers for samples. This motivates us to quantify the uncertainty of the outputs, informing the confidence of the results. Therefore, we propose an uncertainty-aware bot detection method to inform the confidence and use the uncertainty score to pick a high-confidence decision from multiple views of a social network under different environments. Specifically, our proposed BotUmc uses LLM to extract information from tweets. Then, we construct a graph based on the extracted information, the original user information, and the user relationship and generate multiple views of the graph by causal interference. Lastly, an uncertainty loss is used to force the model to quantify the uncertainty of results and select the result with low uncertainty in one view as the final decision. Extensive experiments show the superiority of our method.

[AI-72] Fair Play in the Fast Lane: Integrating Sportsmanship into Autonomous Racing Systems

链接: https://arxiv.org/abs/2503.03774
作者: Zhenmin Huang,Ce Hao,Wei Zhan,Jun Ma,Masayoshi Tomizuka
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Autonomous racing has gained significant attention as a platform for high-speed decision-making and motion control. While existing methods primarily focus on trajectory planning and overtaking strategies, the role of sportsmanship in ensuring fair competition remains largely unexplored. In human racing, rules such as the one-motion rule and the enough-space rule prevent dangerous and unsportsmanlike behavior. However, autonomous racing systems often lack mechanisms to enforce these principles, potentially leading to unsafe maneuvers. This paper introduces a bi-level game-theoretic framework to integrate sportsmanship (SPS) into versus racing. At the high level, we model racing intentions using a Stackelberg game, where Monte Carlo Tree Search (MCTS) is employed to derive optimal strategies. At the low level, vehicle interactions are formulated as a Generalized Nash Equilibrium Problem (GNEP), ensuring that all agents follow sportsmanship constraints while optimizing their trajectories. Simulation results demonstrate the effectiveness of the proposed approach in enforcing sportsmanship rules while maintaining competitive performance. We analyze different scenarios where attackers and defenders adhere to or disregard sportsmanship rules and show how knowledge of these constraints influences strategic decision-making. This work highlights the importance of balancing competition and fairness in autonomous racing and provides a foundation for developing ethical and safe AI-driven racing systems.

[AI-73] Efficient Finetuning for Dimensional Speech Emotion Recognition in the Age of Transformers ICASSP2025 ICASSP

链接: https://arxiv.org/abs/2503.03756
作者: Aneesha Sampath,James Tavernor,Emily Mower Provost
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

点击查看摘要

Abstract:Accurate speech emotion recognition is essential for developing human-facing systems. Recent advancements have included finetuning large, pretrained transformer models like Wav2Vec 2.0. However, the finetuning process requires substantial computational resources, including high-memory GPUs and significant processing time. As the demand for accurate emotion recognition continues to grow, efficient finetuning approaches are needed to reduce the computational burden. Our study focuses on dimensional emotion recognition, predicting attributes such as activation (calm to excited) and valence (negative to positive). We present various finetuning techniques, including full finetuning, partial finetuning of transformer layers, finetuning with mixed precision, partial finetuning with caching, and low-rank adaptation (LoRA) on the Wav2Vec 2.0 base model. We find that partial finetuning with mixed precision achieves performance comparable to full finetuning while increasing training speed by 67%. Caching intermediate representations further boosts efficiency, yielding an 88% speedup and a 71% reduction in learnable parameters. We recommend finetuning the final three transformer layers in mixed precision to balance performance and training efficiency, and adding intermediate representation caching for optimal speed with minimal performance trade-offs. These findings lower the barriers to finetuning speech emotion recognition systems, making accurate emotion recognition more accessible to a broader range of researchers and practitioners.

[AI-74] Generative Diffusion Model-based Compression of MIMO CSI

链接: https://arxiv.org/abs/2503.03753
作者: Heasung Kim,Taekyun Lee,Hyeji Kim,Gustavo De Veciana,Mohamed Amine Arfaoui,Asil Koc,Phil Pietraski,Guodong Zhang,John Kaewell
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 6 pages

点击查看摘要

Abstract:While neural lossy compression techniques have markedly advanced the efficiency of Channel State Information (CSI) compression and reconstruction for feedback in MIMO communications, efficient algorithms for more challenging and practical tasks-such as CSI compression for future channel prediction and reconstruction with relevant side information-remain underexplored, often resulting in suboptimal performance when existing methods are extended to these scenarios. To that end, we propose a novel framework for compression with side information, featuring an encoding process with fixed-rate compression using a trainable codebook for codeword quantization, and a decoding procedure modeled as a backward diffusion process conditioned on both the codeword and the side information. Experimental results show that our method significantly outperforms existing CSI compression algorithms, often yielding over twofold performance improvement by achieving comparable distortion at less than half the data rate of competing methods in certain scenarios. These findings underscore the potential of diffusion-based compression for practical deployment in communication systems.

[AI-75] Interpretable Transformation and Analysis of Timelines through Learning via Surprisability

链接: https://arxiv.org/abs/2503.04502
作者: Osnat Mokryn,Teddy Lazebnik,Hagit Ben Shoshan
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:The analysis of high-dimensional timeline data and the identification of outliers and anomalies is critical across diverse domains, including sensor readings, biological and medical data, historical records, and global statistics. However, conventional analysis techniques often struggle with challenges such as high dimensionality, complex distributions, and sparsity. These limitations hinder the ability to extract meaningful insights from complex temporal datasets, making it difficult to identify trending features, outliers, and anomalies effectively. Inspired by surprisability – a cognitive science concept describing how humans instinctively focus on unexpected deviations - we propose Learning via Surprisability (LvS), a novel approach for transforming high-dimensional timeline data. LvS quantifies and prioritizes anomalies in time-series data by formalizing deviations from expected behavior. LvS bridges cognitive theories of attention with computational methods, enabling the detection of anomalies and shifts in a way that preserves critical context, offering a new lens for interpreting complex datasets. We demonstrate the usefulness of LvS on three high-dimensional timeline use cases: a time series of sensor data, a global dataset of mortality causes over multiple years, and a textual corpus containing over two centuries of State of the Union Addresses by U.S. presidents. Our results show that the LvS transformation enables efficient and interpretable identification of outliers, anomalies, and the most variable features along the timeline.

[AI-76] Passive Heart Rate Monitoring During Smartphone Use in Everyday Life

链接: https://arxiv.org/abs/2503.03783
作者: Shun Liao,Paolo Di Achille,Jiang Wu,Silviu Borac,Jonathan Wang,Xin Liu,Eric Teasley,Lawrence Cai,Yun Liu,Daniel McDuff,Hao-Wei Su,Brent Winslow,Anupam Pathak,Shwetak Patel,Jameson K. Rogers,Ming-Zher Poh
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Resting heart rate (RHR) is an important biomarker of cardiovascular health and mortality, but tracking it longitudinally generally requires a wearable device, limiting its availability. We present PHRM, a deep learning system for passive heart rate (HR) and RHR measurements during everyday smartphone use, using facial video-based photoplethysmography. Our system was developed using 225,773 videos from 495 participants and validated on 185,970 videos from 205 participants in laboratory and free-living conditions, representing the largest validation study of its kind. Compared to reference electrocardiogram, PHRM achieved a mean absolute percentage error (MAPE) 10% for HR measurements across three skin tone groups of light, medium and dark pigmentation; MAPE for each skin tone group was non-inferior versus the others. Daily RHR measured by PHRM had a mean absolute error 5 bpm compared to a wearable HR tracker, and was associated with known risk factors. These results highlight the potential of smartphones to enable passive and equitable heart health monitoring.

[AI-77] Multimodal AI predicts clinical outcomes of drug combinations from preclinical data

链接: https://arxiv.org/abs/2503.02781
作者: Yepeng Huang,Xiaorui Su,Varun Ullanat,Ivy Liang,Lindsay Clegg,Damilola Olabode,Nicholas Ho,Bino John,Megan Gibbs,Marinka Zitnik
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting clinical outcomes from preclinical data is essential for identifying safe and effective drug combinations. Current models rely on structural or target-based features to identify high-efficacy, low-toxicity drug combinations. However, these approaches fail to incorporate the multimodal data necessary for accurate, clinically-relevant predictions. Here, we introduce MADRIGAL, a multimodal AI model that learns from structural, pathway, cell viability, and transcriptomic data to predict drug combination effects across 953 clinical outcomes and 21842 compounds, including combinations of approved drugs and novel compounds in development. MADRIGAL uses a transformer bottleneck module to unify preclinical drug data modalities while handling missing data during training and inference–a major challenge in multimodal learning. It outperforms single-modality methods and state-of-the-art models in predicting adverse drug interactions. MADRIGAL performs virtual screening of anticancer drug combinations and supports polypharmacy management for type II diabetes and metabolic dysfunction-associated steatohepatitis (MASH). It identifies transporter-mediated drug interactions. MADRIGAL predicts resmetirom, the first and only FDA-approved drug for MASH, among therapies with the most favorable safety profile. It supports personalized cancer therapy by integrating genomic profiles from cancer patients. Using primary acute myeloid leukemia samples and patient-derived xenograft models, it predicts the efficacy of personalized drug combinations. Integrating MADRIGAL with a large language model allows users to describe clinical outcomes in natural language, improving safety assessment by identifying potential adverse interactions and toxicity risks. MADRIGAL provides a multimodal approach for designing combination therapies with improved predictive accuracy and clinical relevance.

机器学习

[LG-0] Sample-Optimal Agnostic Boosting with Unlabeled Data

链接: https://arxiv.org/abs/2503.04706
作者: Udaya Ghai,Karan Singh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. This paper highlights an unexpected and previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm – so this result is never worse – while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We detail other applications of this result in reinforcement learning. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2503.04706 [cs.LG] (or arXiv:2503.04706v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.04706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Compositional World Knowledge leads to High Utility Synthetic data

链接: https://arxiv.org/abs/2503.04687
作者: Sachit Gaudi,Gautam Sreekumar,Vishnu Boddeti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning systems struggle with robustness, under subpopulation shifts. This problem becomes especially pronounced in scenarios where only a subset of attribute combinations is observed during training -a severe form of subpopulation shift, referred as compositional shift. To address this problem, we ask the following question: Can we improve the robustness by training on synthetic data, spanning all possible attribute combinations? We first show that training of conditional diffusion models on limited data lead to incorrect underlying distribution. Therefore, synthetic data sampled from such models will result in unfaithful samples and does not lead to improve performance of downstream machine learning systems. To address this problem, we propose CoInD to reflect the compositional nature of the world by enforcing conditional independence through minimizing Fisher’s divergence between joint and marginal distributions. We demonstrate that synthetic data generated by CoInD is faithful and this translates to state-of-the-art worst-group accuracy on compositional shift tasks on CelebA.

[LG-2] CLDyB: Towards Dynamic Benchmarking for Continual Learning with Pre-trained Models

链接: https://arxiv.org/abs/2503.04655
作者: Shengzhuang Chen,Yikai Liao,Xiaoxiao Sun,Kede Ma,Ying Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The advent of the foundation model era has sparked significant research interest in leveraging pre-trained representations for continual learning (CL), yielding a series of top-performing CL methods on standard evaluation benchmarks. Nonetheless, there are growing concerns regarding potential data contamination during the pre-training stage. Furthermore, standard evaluation benchmarks, which are typically static, fail to capture the complexities of real-world CL scenarios, resulting in saturated performance. To address these issues, we describe CL on dynamic benchmarks (CLDyB), a general computational framework based on Markov decision processes for evaluating CL methods reliably. CLDyB dynamically identifies inherently difficult and algorithm-dependent tasks for the given CL methods, and determines challenging task orders using Monte Carlo tree search. Leveraging CLDyB, we first conduct a joint evaluation of multiple state-of-the-art CL methods, leading to a set of commonly challenging and generalizable task sequences where existing CL methods tend to perform poorly. We then conduct separate evaluations of individual CL methods using CLDyB, discovering their respective strengths and weaknesses. The source code and generated task sequences are publicly accessible at this https URL.

[LG-3] Joint Masked Reconstruction and Contrastive Learning for Mining Interactions Between Proteins

链接: https://arxiv.org/abs/2503.04650
作者: Jiang Li,Xiaoping Wang
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: Submitted

点击查看摘要

Abstract:Protein-protein interaction (PPI) prediction is an instrumental means in elucidating the mechanisms underlying cellular operations, holding significant practical implications for the realms of pharmaceutical development and clinical treatment. Presently, the majority of research methods primarily concentrate on the analysis of amino acid sequences, while investigations predicated on protein structures remain in the nascent stages of exploration. Despite the emergence of several structure-based algorithms in recent years, these are still confronted with inherent challenges: (1) the extraction of intrinsic structural information of proteins typically necessitates the expenditure of substantial computational resources; (2) these models are overly reliant on seen protein data, struggling to effectively unearth interaction cues between unknown proteins. To further propel advancements in this domain, this paper introduces a novel PPI prediction method jointing masked reconstruction and contrastive learning, termed JmcPPI. This methodology dissects the PPI prediction task into two distinct phases: during the residue structure encoding phase, JmcPPI devises two feature reconstruction tasks and employs graph attention mechanism to capture structural information between residues; during the protein interaction inference phase, JmcPPI perturbs the original PPI graph and employs a multi-graph contrastive learning strategy to thoroughly mine extrinsic interaction information of novel proteins. Extensive experiments conducted on three widely utilized PPI datasets demonstrate that JmcPPI surpasses existing optimal baseline models across various data partition schemes. The associated code can be accessed via this https URL.

[LG-4] No Forgetting Learning: Memory-free Continual Learning ICCV2025

链接: https://arxiv.org/abs/2503.04638
作者: Mohammad Ali Vahedifar,Qi Zhang
类目: Machine Learning (cs.LG)
*备注: This paper is submitted to ICCV 2025

点击查看摘要

Abstract:Continual Learning (CL) remains a central challenge in deep learning, where models must sequentially acquire new knowledge while mitigating Catastrophic Forgetting (CF) of prior tasks. Existing approaches often struggle with efficiency and scalability, requiring extensive memory or model buffers. This work introduces ``No Forgetting Learning" (NFL), a memory-free CL framework that leverages knowledge distillation to maintain stability while preserving plasticity. Memory-free means the NFL does not rely on any memory buffer. Through extensive evaluations of three benchmark datasets, we demonstrate that NFL achieves competitive performance while utilizing approximately 14.75 times less memory than state-of-the-art methods. Furthermore, we introduce a new metric to better assess CL’s plasticity-stability trade-off.

[LG-5] Advancing Solutions for the Three-Body Problem Through Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2503.04585
作者: Manuel Santos Pereira,Luís Tripa,Nélson Lima,Francisco Caldas,Cláudia Soares
类目: Machine Learning (cs.LG)
*备注: 14 pages, 25 figures, 3 tables. 75th International Astronautical Congress (IAC), Milan, Italy, 14-18 October

点击查看摘要

Abstract:First formulated by Sir Isaac Newton in his work “Philosophiae Naturalis Principia Mathematica”, the concept of the Three-Body Problem was put forth as a study of the motion of the three celestial bodies within the Earth-Sun-Moon system. In a generalized definition, it seeks to predict the motion for an isolated system composed of three point masses freely interacting under Newton’s law of universal attraction. This proves to be analogous to a multitude of interactions between celestial bodies, and thus, the problem finds applicability within the studies of celestial mechanics. Despite numerous attempts by renowned physicists to solve it throughout the last three centuries, no general closed-form solutions have been reached due to its inherently chaotic nature for most initial conditions. Current state-of-the-art solutions are based on two approaches, either numerical high-precision integration or machine learning-based. Notwithstanding the breakthroughs of neural networks, these present a significant limitation, which is their ignorance of any prior knowledge of the chaotic systems presented. Thus, in this work, we propose a novel method that utilizes Physics-Informed Neural Networks (PINNs). These deep neural networks are able to incorporate any prior system knowledge expressible as an Ordinary Differential Equation (ODE) into their learning processes as a regularizing agent. Our findings showcase that PINNs surpass current state-of-the-art machine learning methods with comparable prediction quality. Despite a better prediction quality, the usability of numerical integrators suffers due to their prohibitively high computational cost. These findings confirm that PINNs are both effective and time-efficient open-form solvers of the Three-Body Problem that capitalize on the extensive knowledge we hold of classical mechanics.

[LG-6] PSDNorm: Test-Time Temporal Normalization for Deep Learning on EEG Signals

链接: https://arxiv.org/abs/2503.04582
作者: Théo Gnassounou,Antoine Collas,Rémi Flamary,Alexandre Gramfort
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution shift poses a significant challenge in machine learning, particularly in biomedical applications such as EEG signals collected across different subjects, institutions, and recording devices. While existing normalization layers, Batch-Norm, LayerNorm and InstanceNorm, help address distribution shifts, they fail to capture the temporal dependencies inherent in temporal signals. In this paper, we propose PSDNorm, a layer that leverages Monge mapping and temporal context to normalize feature maps in deep learning models. Notably, the proposed method operates as a test-time domain adaptation technique, addressing distribution shifts without additional training. Evaluations on 10 sleep staging datasets using the U-Time model demonstrate that PSDNorm achieves state-of-the-art performance at test time on datasets not seen during training while being 4x more data-efficient than the best baseline. Additionally, PSDNorm provides a significant improvement in robustness, achieving markedly higher F1 scores for the 20% hardest subjects.

[LG-7] Data-augmented Learning of Geodesic Distances in Irregular Domains through Soner Boundary Conditions

链接: https://arxiv.org/abs/2503.04579
作者: Rafael I. Cabral Muchacho,Florian T. Pokorny
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geodesic distances play a fundamental role in robotics, as they efficiently encode global geometric information of the domain. Recent methods use neural networks to approximate geodesic distances by solving the Eikonal equation through physics-informed approaches. While effective, these approaches often suffer from unstable convergence during training in complex environments. We propose a framework to learn geodesic distances in irregular domains by using the Soner boundary condition, and systematically evaluate the impact of data losses on training stability and solution accuracy. Our experiments demonstrate that incorporating data losses significantly improves convergence robustness, reducing training instabilities and sensitivity to initialization. These findings suggest that hybrid data-physics approaches can effectively enhance the reliability of learning-based geodesic distance solvers with sparse data.

[LG-8] Meta Learning not to Learn: Robustly Informing Meta-Learning under Nuisance-Varying Families

链接: https://arxiv.org/abs/2503.04570
作者: Louis McConnell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In settings where both spurious and causal predictors are available, standard neural networks trained under the objective of empirical risk minimization (ERM) with no additional inductive biases tend to have a dependence on a spurious feature. As a result, it is necessary to integrate additional inductive biases in order to guide the network toward generalizable hypotheses. Often these spurious features are shared across related tasks, such as estimating disease prognoses from image scans coming from different hospitals, making the challenge of generalization more difficult. In these settings, it is important that methods are able to integrate the proper inductive biases to generalize across both nuisance-varying families as well as task families. Motivated by this setting, we present RIME (Robustly Informed Meta lEarning), a new method for meta learning under the presence of both positive and negative inductive biases (what to learn and what not to learn). We first develop a theoretical causal framework showing why existing approaches at knowledge integration can lead to worse performance on distributionally robust objectives. We then show that RIME is able to simultaneously integrate both biases, reaching state of the art performance under distributionally robust objectives in informed meta-learning settings under nuisance-varying families.

[LG-9] Federated Dynamic Modeling and Learning for Spatiotemporal Data Forecasting

链接: https://arxiv.org/abs/2503.04528
作者: Thien Pham,Angelo Furno,Faïcel Chamroukhi,Latifa Oukhellou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an advanced Federated Learning (FL) framework for forecasting complex spatiotemporal data, improving upon recent state-of-the-art models. In the proposed approach, the original Gated Recurrent Unit (GRU) module within previous Dynamic Spatial–Temporal Graph Convolutional Recurrent Network (DSTGCRN) modeling is first replaced with a Long Short-Term Memory (LSTM) network, enabling the resulting model to more effectively capture long-term dependencies inherent to time series data. The resulting architecture significantly improves the model’s capacity to handle complex temporal patterns in diverse forecasting applications. Furthermore, the proposed FL framework integrates a novel Client-Side Validation (CSV) mechanism, introducing a critical validation step at the client level before incorporating aggregated parameters from the central server into local models. This ensures that only the most effective updates are adopted, improving both the robustness and accuracy of the forecasting model across clients. The efficiency of our approach is demonstrated through extensive experiments on real-world applications, including public datasets for multimodal transport demand forecasting and private datasets for Origin-Destination (OD) matrix forecasting in urban areas. The results demonstrate substantial improvements over conventional methods, highlighting the framework’s ability to capture complex spatiotemporal dependencies while preserving data privacy. This work not only provides a scalable and privacy-preserving solution for real-time, region-specific forecasting and management but also underscores the potential of leveraging distributed data sources in a FL context. We provide our algorithms as open-source on GitHub.

[LG-10] Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges ICLR’25

链接: https://arxiv.org/abs/2503.04474
作者: Francisco Eiras,Eliott Zemour,Eric Lin,Vaikkunth Mugunthan
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to the ICBINB Workshop at ICLR’25

点击查看摘要

Abstract:Large Language Model (LLM) based judges form the underpinnings of key safety evaluation processes such as offline benchmarking, automated red-teaming, and online guardrailing. This widespread requirement raises the crucial question: can we trust the evaluations of these evaluators? In this paper, we highlight two critical challenges that are typically overlooked: (i) evaluations in the wild where factors like prompt sensitivity and distribution shifts can affect performance and (ii) adversarial attacks that target the judge. We highlight the importance of these through a study of commonly used safety judges, showing that small changes such as the style of the model output can lead to jumps of up to 0.24 in the false negative rate on the same dataset, whereas adversarial attacks on the model generation can fool some judges into misclassifying 100% of harmful generations as safe ones. These findings reveal gaps in commonly used meta-evaluation benchmarks and weaknesses in the robustness of current LLM judges, indicating that low attack success under certain judges could create a false sense of security.

[LG-11] PALo: Learning Posture-Aware Locomotion for Quadruped Robots

链接: https://arxiv.org/abs/2503.04462
作者: Xiangyu Miao,Jun Sun,Hang Lai,Xinpeng Di,Jiahang Cao,Yong Yu,Weinan Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of embodied intelligence, locomotion control of quadruped robots on complex terrains has become a research hotspot. Unlike traditional locomotion control approaches focusing solely on velocity tracking, we pursue to balance the agility and robustness of quadruped robots on diverse and complex terrains. To this end, we propose an end-to-end deep reinforcement learning framework for posture-aware locomotion named PALo, which manages to handle simultaneous linear and angular velocity tracking and real-time adjustments of body height, pitch, and roll angles. In PALo, the locomotion control problem is formulated as a partially observable Markov decision process, and an asymmetric actor-critic architecture is adopted to overcome the sim-to-real challenge. Further, by incorporating customized training curricula, PALo achieves agile posture-aware locomotion control in simulated environments and successfully transfers to real-world settings without fine-tuning, allowing real-time control of the quadruped robot’s locomotion and body posture across challenging terrains. Through in-depth experimental analysis, we identify the key components of PALo that contribute to its performance, further validating the effectiveness of the proposed method. The results of this study provide new possibilities for the low-level locomotion control of quadruped robots in higher dimensional command spaces and lay the foundation for future research on upper-level modules for embodied intelligence.

[LG-12] FORTALESA: Fault-Tolerant Reconfigurable Systolic Array for DNN Inference

链接: https://arxiv.org/abs/2503.04426
作者: Natalia Cherezova,Artur Jutman,Maksim Jenihhin
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 11 pages, 15 figures

点击查看摘要

Abstract:The emergence of Deep Neural Networks (DNNs) in mission- and safety-critical applications brings their reliability to the front. High performance demands of DNNs require the use of specialized hardware accelerators. Systolic array architecture is widely used in DNN accelerators due to its parallelism and regular structure. This work presents a run-time reconfigurable systolic array architecture with three execution modes and four implementation options. All four implementations are evaluated in terms of resource utilization, throughput, and fault tolerance improvement. The proposed architecture is used for reliability enhancement of DNN inference on systolic array through heterogeneous mapping of different network layers to different execution modes. The approach is supported by a novel reliability assessment method based on fault propagation analysis. It is used for the exploration of the appropriate execution mode-layer mapping for DNN inference. The proposed architecture efficiently protects registers and MAC units of systolic array PEs from transient and permanent faults. The reconfigurability feature enables a speedup of up to 3\times , depending on layer vulnerability. Furthermore, it requires 6\times less resources compared to static redundancy and 2.5\times less resources compared to the previously proposed solution for transient faults.

[LG-13] AOLO: Analysis and Optimization For Low-Carbon Oriented Wireless Large Language Model Services

链接: https://arxiv.org/abs/2503.04418
作者: Xiaoqi Wang,Hongyang Du,Yuehong Gao,Dong In Kim
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to their widespread adoption and large-scale deployment across various domains. However, their environmental impact, particularly during inference, has become a growing concern due to their substantial energy consumption and carbon footprint. Existing research has focused on inference computation alone, overlooking the analysis and optimization of carbon footprint in network-aided LLM service systems. To address this gap, we propose AOLO, a framework for analysis and optimization for low-carbon oriented wireless LLM services. AOLO introduces a comprehensive carbon footprint model that quantifies greenhouse gas emissions across the entire LLM service chain, including computational inference and wireless communication. Furthermore, we formulate an optimization problem aimed at minimizing the overall carbon footprint, which is solved through joint optimization of inference outputs and transmit power under quality-of-experience and system performance constraints. To achieve this joint optimization, we leverage the energy efficiency of spiking neural networks (SNNs) by adopting SNN as the actor network and propose a low-carbon-oriented optimization algorithm, i.e., SNN-based deep reinforcement learning (SDRL). Comprehensive simulations demonstrate that SDRL algorithm significantly reduces overall carbon footprint, achieving an 18.77% reduction compared to the benchmark soft actor-critic, highlighting its potential for enabling more sustainable LLM inference services.

[LG-14] mporal Analysis of NetFlow Datasets for Network Intrusion Detection Systems

链接: https://arxiv.org/abs/2503.04404
作者: Majed Luay,Siamak Layeghy,Seyedehfaezeh Hosseininoorbin,Mohanad Sarhan,Nour Moustafa,Marius Portmann
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This paper investigates the temporal analysis of NetFlow datasets for machine learning (ML)-based network intrusion detection systems (NIDS). Although many previous studies have highlighted the critical role of temporal features, such as inter-packet arrival time and flow length/duration, in NIDS, the currently available NetFlow datasets for NIDS lack these temporal features. This study addresses this gap by creating and making publicly available a set of NetFlow datasets that incorporate these temporal features [1]. With these temporal features, we provide a comprehensive temporal analysis of NetFlow datasets by examining the distribution of various features over time and presenting time-series representations of NetFlow features. This temporal analysis has not been previously provided in the existing literature. We also borrowed an idea from signal processing, time frequency analysis, and tested it to see how different the time frequency signal presentations (TFSPs) are for various attacks. The results indicate that many attacks have unique patterns, which could help ML models to identify them more easily.

[LG-15] How can representation dimension dominate structurally pruned LLM s? ICLR2025

链接: https://arxiv.org/abs/2503.04377
作者: Mingxue Xu,Lisa Alazraki,Danilo P. Mandic
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

点击查看摘要

Abstract:Pruning assumes a subnetwork exists in the original deep neural network, which can achieve comparative model performance with less computation than the original. However, it is unclear how the model performance varies with the different subnetwork extractions. In this paper, we choose the representation dimension (or embedding dimension, model dimension, the dimension of the residual stream in the relevant literature) as the entry point to this issue. We investigate the linear transformations in the LLM transformer blocks and consider a specific structured pruning approach, SliceGPT, to extract the subnetworks of different representation dimensions. We mechanistically analyse the activation flow during the model forward passes, and find the representation dimension dominates the linear transformations, model predictions, and, finally, the model performance. Explicit analytical relations are given to calculate the pruned model performance (perplexity and accuracy) without actual evaluation, and are empirically validated with Llama-3-8B-Instruct and Phi-3-mini-4k-Instruct.

[LG-16] FILM: Framework for Imbalanced Learning Machines based on a new unbiased performance measure and a new ensemble-based technique

链接: https://arxiv.org/abs/2503.04370
作者: Antonio Guillén-Teruel(1),Marcos Caracena(1),Jose A. Pardo(1),Fernando de-la-Gándara(1),José Palma(1),Juan A. Botía(1,2) ((1) Departamento de Ingeniería de la Información y Las Comunicaciones, Universidad de Murcia, Murcia, 30100, Murcia, Spain, (2) Department of Neurodegenerative Disease, Institute of Neurology, University College London, London, WC1N 3BG, UK.)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research addresses the challenges of handling unbalanced datasets for binary classification tasks. In such scenarios, standard evaluation metrics are often biased by the disproportionate representation of the minority class. Conducting experiments across seven datasets, we uncovered inconsistencies in evaluation metrics when determining the model that outperforms others for each binary classification problem. This justifies the need for a metric that provides a more consistent and unbiased evaluation across unbalanced datasets, thereby supporting robust model selection. To mitigate this problem, we propose a novel metric, the Unbiased Integration Coefficients (UIC), which exhibits significantly reduced bias ( p 10^-4 ) towards the minority class compared to conventional metrics. The UIC is constructed by aggregating existing metrics while penalising those more prone to imbalance. In addition, we introduce the Identical Partitions for Imbalance Problems (IPIP) algorithm for imbalanced ML problems, an ensemble-based approach. Our experimental results show that IPIP outperforms other baseline imbalance-aware approaches using Random Forest and Logistic Regression models in three out of seven datasets as assessed by the UIC metric, demonstrating its effectiveness in addressing imbalanced data challenges in binary classification tasks. This new framework for dealing with imbalanced datasets is materialized in the FILM (Framework for Imbalanced Learning Machines) R Package, accessible at this https URL.

[LG-17] EDCA – An Evolutionary Data-Centric AutoML Framework for Efficient Pipelines

链接: https://arxiv.org/abs/2503.04350
作者: Joana Simões,João Correia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated Machine Learning (AutoML) gained popularity due to the increased demand for Machine Learning (ML) specialists, allowing them to apply ML techniques effortlessly and quickly. AutoML implementations use optimisation methods to identify the most effective ML solution for a given dataset, aiming to improve one or more predefined metrics. However, most implementations focus on model selection and hyperparameter tuning. Despite being an important factor in obtaining high-performance ML systems, data quality is usually an overlooked part of AutoML and continues to be a manual and time-consuming task. This work presents EDCA, an Evolutionary Data Centric AutoML framework. In addition to the traditional tasks such as selecting the best models and hyperparameters, EDCA enhances the given data by optimising data processing tasks such as data reduction and cleaning according to the problems’ needs. All these steps create an ML pipeline that is optimised by an evolutionary algorithm. To assess its effectiveness, EDCA was compared to FLAML and TPOT, two frameworks at the top of the AutoML benchmarks. The frameworks were evaluated in the same conditions using datasets from AMLB classification benchmarks. EDCA achieved statistically similar results in performance to FLAML and TPOT but used significantly less data to train the final solutions. Moreover, EDCA experimental results reveal that a good performance can be achieved using less data and efficient ML algorithm aspects that align with Green AutoML guidelines

[LG-18] Large Language Models for Zero-shot Inference of Causal Structures in Biology ICLR2025

链接: https://arxiv.org/abs/2503.04347
作者: Izzy Newsham,Luka Kovačević,Richard Moulange,Nan Rosemary Ke,Sach Mukherjee
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: ICLR 2025 Workshop on Machine Learning for Genomics Explorations

点击查看摘要

Abstract:Genes, proteins and other biological entities influence one another via causal molecular networks. Causal relationships in such networks are mediated by complex and diverse mechanisms, through latent variables, and are often specific to cellular context. It remains challenging to characterise such networks in practice. Here, we present a novel framework to evaluate large language models (LLMs) for zero-shot inference of causal relationships in biology. In particular, we systematically evaluate causal claims obtained from an LLM using real-world interventional data. This is done over one hundred variables and thousands of causal hypotheses. Furthermore, we consider several prompting and retrieval-augmentation strategies, including large, and potentially conflicting, collections of scientific articles. Our results show that with tailored augmentation and prompting, even relatively small LLMs can capture meaningful aspects of causal structure in biological systems. This supports the notion that LLMs could act as orchestration tools in biological discovery, by helping to distil current knowledge in ways amenable to downstream analysis. Our approach to assessing LLMs with respect to experimental data is relevant for a broad range of problems at the intersection of causal learning, LLMs and scientific discovery.

[LG-19] he Challenge of Identifying the Origin of Black-Box Large Language Models

链接: https://arxiv.org/abs/2503.04332
作者: Ziqing Yang,Yixin Wu,Yun Shen,Wei Dai,Michael Backes,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The tremendous commercial potential of large language models (LLMs) has heightened concerns about their unauthorized use. Third parties can customize LLMs through fine-tuning and offer only black-box API access, effectively concealing unauthorized usage and complicating external auditing processes. This practice not only exacerbates unfair competition, but also violates licensing agreements. In response, identifying the origin of black-box LLMs is an intrinsic solution to this issue. In this paper, we first reveal the limitations of state-of-the-art passive and proactive identification methods with experiments on 30 LLMs and two real-world black-box APIs. Then, we propose the proactive technique, PlugAE, which optimizes adversarial token embeddings in a continuous space and proactively plugs them into the LLM for tracing and identification. The experiments show that PlugAE can achieve substantial improvement in identifying fine-tuned derivatives. We further advocate for legal frameworks and regulations to better address the challenges posed by the unauthorized use of LLMs.

[LG-20] InFL-UX: A Toolkit for Web-Based Interactive Federated Learning

链接: https://arxiv.org/abs/2503.04318
作者: Tim Maurer,Abdulrahman Mohamed Selim,Hasan Md Tusfiqur Alam,Matthias Eiletz,Michael Barz,Daniel Sonntag
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This paper presents InFL-UX, an interactive, proof-of-concept browser-based Federated Learning (FL) toolkit designed to integrate user contributions seamlessly into the machine learning (ML) workflow. InFL-UX enables users across multiple devices to upload datasets, define classes, and collaboratively train classification models directly in the browser using modern web technologies. Unlike traditional FL toolkits, which often focus on backend simulations, InFL-UX provides a simple user interface for researchers to explore how users interact with and contribute to FL systems in real-world, interactive settings. By prioritising usability and decentralised model training, InFL-UX bridges the gap between FL and Interactive Machine Learning (IML), empowering non-technical users to actively participate in ML classification tasks.

[LG-21] A General Framework for Scalable UE-AP Association in User-Centric Cell-Free Massive MIMO based on Recurrent Neural Networks

链接: https://arxiv.org/abs/2503.04278
作者: Giovanni Di Gennaro,Amedeo Buonanno,Gianmarco Romano,Stefano Buzzi,Francesco A. N. Palmieri
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: submitted to IEEE journal

点击查看摘要

Abstract:This study addresses the challenge of access point (AP) and user equipment (UE) association in cell-free massive MIMO networks. It introduces a deep learning algorithm leveraging Bidirectional Long Short-Term Memory cells and a hybrid probabilistic methodology for weight updating. This approach enhances scalability by adapting to variations in the number of UEs without requiring retraining. Additionally, the study presents a training methodology that improves scalability not only with respect to the number of UEs but also to the number of APs. Furthermore, a variant of the proposed AP-UE algorithm ensures robustness against pilot contamination effects, a critical issue arising from pilot reuse in channel estimation. Extensive numerical results validate the effectiveness and adaptability of the proposed methods, demonstrating their superiority over widely used heuristic alternatives.

[LG-22] Frequency Hopping Synchronization by Reinforcement Learning for Satellite Communication System

链接: https://arxiv.org/abs/2503.04266
作者: Inkyu Kim,Sangkeum Lee,Haechan Jeong,Sarvar Hussain Nengroo,Dongsoo Har
类目: Machine Learning (cs.LG)
*备注: 18pages, 5figures

点击查看摘要

Abstract:Satellite communication systems (SCSs) used for tactical purposes require robust security and anti-jamming capabilities, making frequency hopping (FH) a powerful option. However, the current FH systems face challenges due to significant interference from other devices and the considerable path loss inherent in satellite communication. This misalignment leads to inefficient synchronization, crucial for maintaining reliable communication. Traditional methods, such as those employing long short-term memory (LSTM) networks, have made improvements, but they still struggle in dynamic conditions of satellite environments. This paper presents a novel method for synchronizing FH signals in tactical SCSs by combining serial search and reinforcement learning to achieve coarse and fine acquisition, respectively. The mathematical analysis and simulation results demonstrate that the proposed method reduces the average number of hops required for synchronization by 58.17% and mean squared error (MSE) of the uplink hop timing estimation by 76.95%, as compared to the conventional serial search method. Comparing with the early late gate synchronization method based on serial search and use of LSTM network, the average number of hops for synchronization is reduced by 12.24% and the MSE by 18.5%.

[LG-23] Bi-Lipschitz Ansatz for Anti-Symmetric Functions

链接: https://arxiv.org/abs/2503.04263
作者: Nadav Dym,Jianfeng Lu,Matan Mizrachi
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Motivated by applications for simulating quantum many body functions, we propose a new universal ansatz for approximating anti-symmetric functions. The main advantage of this ansatz over previous alternatives is that it is bi-Lipschitz with respect to a naturally defined metric. As a result, we are able to obtain quantitative approximation results for approximation of Lipschitz continuous antisymmetric functions. Moreover, we provide preliminary experimental evidence to the improved performance of this ansatz for learning antisymmetric functions.

[LG-24] RCRank: Multimodal Ranking of Root Causes of Slow Queries in Cloud Database Systems VLDB2025

链接: https://arxiv.org/abs/2503.04252
作者: Biao Ouyang,Yingying Zhang,Hanyin Cheng,Yang Shu,Chenjuan Guo,Bin Yang,Qingsong Wen,Lunting Fan,Christian S. Jensen
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted by VLDB 2025

点击查看摘要

Abstract:With the continued migration of storage to cloud database systems,the impact of slow queries in such systems on services and user experience is increasing. Root-cause diagnosis plays an indispensable role in facilitating slow-query detection and revision. This paper proposes a method capable of both identifying possible root cause types for slow queries and ranking these according to their potential for accelerating slow queries. This enables prioritizing root causes with the highest impact, in turn improving slow-query revision effectiveness. To enable more accurate and detailed diagnoses, we propose the multimodal Ranking for the Root Causes of slow queries (RCRank) framework, which formulates root cause analysis as a multimodal machine learning problem and leverages multimodal information from query statements, execution plans, execution logs, and key performance indicators. To obtain expressive embeddings from its heterogeneous multimodal input, RCRank integrates self-supervised pre-training that enhances cross-modal alignment and task relevance. Next, the framework integrates root-cause-adaptive cross Transformers that enable adaptive fusion of multimodal features with varying characteristics. Finally, the framework offers a unified model that features an impact-aware training objective for identifying and ranking root causes. We report on experiments on real and synthetic datasets, finding that RCRank is capable of consistently outperforming the state-of-the-art methods at root cause identification and ranking according to a range of metrics.

[LG-25] Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques

链接: https://arxiv.org/abs/2503.04242
作者: Manh Cuong Dao,Phi Le Nguyen,Thao Nguyen Truong,Trong Nghia Hoang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation. The key idea is to learn a surrogate of the black-box function that underlines the target experiment using a static (offline) dataset of its previous input-output queries. Such an approach is, however, fraught with an out-of-distribution issue where the learned surrogate becomes inaccurate outside the offline data regimes. To mitigate this, existing offline optimizers have proposed numerous conditioning techniques to prevent the learned surrogate from being too erratic. Nonetheless, such conditioning strategies are often specific to particular surrogate or search models, which might not generalize to a different model choice. This motivates us to develop a model-agnostic approach instead, which incorporates a notion of model sharpness into the training loss of the surrogate as a regularizer. Our approach is supported by a new theoretical analysis demonstrating that reducing surrogate sharpness on the offline dataset provably reduces its generalized sharpness on unseen data. Our analysis extends existing theories from bounding generalized prediction loss (on unseen data) with loss sharpness to bounding the worst-case generalized surrogate sharpness with its empirical estimate on training data, providing a new perspective on sharpness regularization. Our extensive experimentation on a diverse range of optimization tasks also shows that reducing surrogate sharpness often leads to significant improvement, marking (up to) a noticeable 9.6% performance boost. Our code is publicly available at this https URL

[LG-26] hrowBench: Benchmarking LLM s by Predicting Runtime Exceptions

链接: https://arxiv.org/abs/2503.04241
作者: Julian Aron Prenner,Romain Robbes
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code synthesis from natural language instructions. Hence, such benchmarks do not test for other forms of code understanding. Moreover, there have been concerns about contamination and leakage. That is, benchmark problems (or closely related problems) may appear in training set, strongly biasing benchmark results. In this work we investigate whether large language models can correctly predict runtime program behavior. To this end, we introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages. The majority of these programs throw an exception during runtime (due to a bug). LLMs are asked to predict whether a presented program throws an exception and, if so, which one. Evaluating our benchmark on six state-of-the-art code LLMs we see modest performance ranging from 19 to 38% (F1 score). Benchmarking a wider set of code capabilities could improve the assessment of code LLMs and help identify weak points in current models. Moreover, as ground-truth answers have been determined through program execution, leakage is not a concern. We release ThrowBench as well as all of our results together with this work.

[LG-27] Geometric Re-Analysis of Classical MDP Solving Algorithms

链接: https://arxiv.org/abs/2503.04203
作者: Arsenii Mustafin,Aleksei Pakharev,Alex Olshevsky,Ioannis Ch. Paschalidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We build on a recently introduced geometric interpretation of Markov Decision Processes (MDPs) to analyze classical MDP-solving algorithms: Value Iteration (VI) and Policy Iteration (PI). First, we develop a geometry-based analytical apparatus, including a transformation that modifies the discount factor \gamma , to improve convergence guarantees for these algorithms in several settings. In particular, one of our results identifies a rotation component in the VI method, and as a consequence shows that when a Markov Reward Process (MRP) induced by the optimal policy is irreducible and aperiodic, the asymptotic convergence rate of value iteration is strictly smaller than \gamma .

[LG-28] Computational Intractability of Strategizing against Online Learners

链接: https://arxiv.org/abs/2503.04202
作者: Angelos Assos,Yuval Dagan,Nived Rajaraman
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases - such as structured auction settings or contract design - no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless \mathsfP = \mathsfNP , no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an \Omega(T) hardness bound, significantly strengthening previous work that only showed an additive \Theta(1) impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play - an algorithm that is not no-regret - we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings. Comments: 32 pages Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) MSC classes: 91A05 ACMclasses: F.2.2 Cite as: arXiv:2503.04202 [cs.GT] (or arXiv:2503.04202v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2503.04202 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-29] Boosting Offline Optimizers with Surrogate Sensitivity

链接: https://arxiv.org/abs/2503.04181
作者: Manh Cuong Dao,Phi Le Nguyen,Thao Nguyen Truong,Trong Nghia Hoang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline optimization is an important task in numerous material engineering domains where online experimentation to collect data is too expensive and needs to be replaced by an in silico maximization of a surrogate of the black-box function. Although such a surrogate can be learned from offline data, its prediction might not be reliable outside the offline data regime, which happens when the surrogate has narrow prediction margin and is (therefore) sensitive to small perturbations of its parameterization. This raises the following questions: (1) how to regulate the sensitivity of a surrogate model; and (2) whether conditioning an offline optimizer with such less sensitive surrogate will lead to better optimization performance. To address these questions, we develop an optimizable sensitivity measurement for the surrogate model, which then inspires a sensitivity-informed regularizer that is applicable to a wide range of offline optimizers. This development is both orthogonal and synergistic to prior research on offline optimization, which is demonstrated in our extensive experiment benchmark.

[LG-30] Unsupervised anomaly detection on cybersecurity data streams: a case with BETH dataset

链接: https://arxiv.org/abs/2503.04178
作者: Evgeniy Eremin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern world the importance of cybersecurity of various systems is increasing from year to year. The number of information security events generated by information security tools grows up with the development of the IT infrastructure. At the same time, the cyber threat landscape does not remain constant, and monitoring should take into account both already known attack indicators and those for which there are no signature rules in information security products of various classes yet. Detecting anomalies in large cybersecurity data streams is a complex task that, if properly addressed, can allow for timely response to atypical and previously unknown cyber threats. The possibilities of using of offline algorithms may be limited for a number of reasons related to the time of training and the frequency of retraining. Using stream learning algorithms for solving this task is capable of providing near-real-time data processing. This article examines the results of ten algorithms from three Python stream machine-learning libraries on BETH dataset with cybersecurity events, which contains information about the creation, cloning, and destruction of operating system processes collected using extended eBPF. ROC-AUC metric and total processing time of processing with these algorithms are presented. Several combinations of features and the order of events are considered. In conclusion, some mentions are given about the most promising algorithms and possible directions for further research are outlined.

[LG-31] UniNet: A Unified Multi-granular Traffic Modeling Framework for Network Security

链接: https://arxiv.org/abs/2503.04174
作者: Binghui Wu,Dinil Mon Divakaran,Mohan Gurusamy
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 21 pages, 6 figures,15 tables

点击查看摘要

Abstract:As modern networks grow increasingly complex–driven by diverse devices, encrypted protocols, and evolving threats–network traffic analysis has become critically important. Existing machine learning models often rely only on a single representation of packets or flows, limiting their ability to capture the contextual relationships essential for robust analysis. Furthermore, task-specific architectures for supervised, semi-supervised, and unsupervised learning lead to inefficiencies in adapting to varying data formats and security tasks. To address these gaps, we propose UniNet, a unified framework that introduces a novel multi-granular traffic representation (T-Matrix), integrating session, flow, and packet-level features to provide comprehensive contextual information. Combined with T-Attent, a lightweight attention-based model, UniNet efficiently learns latent embeddings for diverse security tasks. Extensive evaluations across four key network security and privacy problems–anomaly detection, attack classification, IoT device identification, and encrypted website fingerprinting–demonstrate UniNet’s significant performance gain over state-of-the-art methods, achieving higher accuracy, lower false positive rates, and improved scalability. By addressing the limitations of single-level models and unifying traffic analysis paradigms, UniNet sets a new benchmark for modern network security.

[LG-32] Ecomap: Sustainability-Driven Optimization of Multi-Tenant DNN Execution on Edge Servers

链接: https://arxiv.org/abs/2503.04148
作者: Varatheepan Paramanayakam,Andreas Karatzas,Dimitrios Stamoulis,Iraklis Anagnostopoulos
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: 12 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Edge computing systems struggle to efficiently manage multiple concurrent deep neural network (DNN) workloads while meeting strict latency requirements, minimizing power consumption, and maintaining environmental sustainability. This paper introduces Ecomap, a sustainability-driven framework that dynamically adjusts the maximum power threshold of edge devices based on real-time carbon intensity. Ecomap incorporates the innovative use of mixed-quality models, allowing it to dynamically replace computationally heavy DNNs with lighter alternatives when latency constraints are violated, ensuring service responsiveness with minimal accuracy loss. Additionally, it employs a transformer-based estimator to guide efficient workload mappings. Experimental results using NVIDIA Jetson AGX Xavier demonstrate that Ecomap reduces carbon emissions by an average of 30% and achieves a 25% lower carbon delay product (CDP) compared to state-of-the-art methods, while maintaining comparable or better latency and power efficiency.

[LG-33] Mixed Likelihood Variational Gaussian Processes

链接: https://arxiv.org/abs/2503.04138
作者: Kaiwen Wu,Craig Sanders,Benjamin Letham,Phillip Guan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages

点击查看摘要

Abstract:Gaussian processes (GPs) are powerful models for human-in-the-loop experiments due to their flexibility and well-calibrated uncertainty. However, GPs modeling human responses typically ignore auxiliary information, including a priori domain expertise and non-task performance information like user confidence ratings. We propose mixed likelihood variational GPs to leverage auxiliary information, which combine multiple likelihoods in a single evidence lower bound to model multiple types of data. We demonstrate the benefits of mixing likelihoods in three real-world experiments with human participants. First, we use mixed likelihood training to impose prior knowledge constraints in GP classifiers, which accelerates active learning in a visual perception task where users are asked to identify geometric errors resulting from camera position errors in virtual reality. Second, we show that leveraging Likert scale confidence ratings by mixed likelihood training improves model fitting for haptic perception of surface roughness. Lastly, we show that Likert scale confidence ratings improve human preference learning in robot gait optimization. The modeling performance improvements found using our framework across this diverse set of applications illustrates the benefits of incorporating auxiliary information into active learning and preference learning by using mixed likelihoods to jointly model multiple inputs.

[LG-34] A Comparative Study of Diabetes Prediction Based on Lifestyle Factors Using Machine Learning

链接: https://arxiv.org/abs/2503.04137
作者: Bruce Nguyen,Yan Zhang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, submitted CSCSU 2025

点击查看摘要

Abstract:Diabetes is a prevalent chronic disease with significant health and economic burdens worldwide. Early prediction and diagnosis can aid in effective management and prevention of complications. This study explores the use of machine learning models to predict diabetes based on lifestyle factors using data from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey. The dataset consists of 21 lifestyle and health-related features, capturing aspects such as physical activity, diet, mental health, and socioeconomic status. Three classification models, Decision Tree, K-Nearest Neighbors (KNN), and Logistic Regression, are implemented and evaluated to determine their predictive performance. The models are trained and tested using a balanced dataset, and their performances are assessed based on accuracy, precision, recall, and F1-score. The results indicate that the Decision Tree, KNN, and Logistic Regression achieve an accuracy of 0.74, 0.72, and 0.75, respectively, with varying strengths in precision and recall. The findings highlight the potential of machine learning in diabetes prediction and suggest future improvements through feature selection and ensemble learning techniques.

[LG-35] meFound: A Foundation Model for Time Series Forecasting

链接: https://arxiv.org/abs/2503.04118
作者: Congxi Xiao,Jingbo Zhou,Yixiong Xiao,Xinjiang Lu,Le Zhang,Hui Xiong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present TimeFound, an encoder-decoder transformer-based time series foundation model for out-of-the-box zero-shot forecasting. To handle time series data from various domains, TimeFound employs a multi-resolution patching strategy to capture complex temporal patterns at multiple scales. We pre-train our model with two sizes (200M and 710M parameters) on a large time-series corpus comprising both real-world and synthetic datasets. Over a collection of unseen datasets across diverse domains and forecasting horizons, our empirical evaluations suggest that TimeFound can achieve superior or competitive zero-shot forecasting performance, compared to state-of-the-art time series foundation models.

[LG-36] PokéChamp: an Expert-level Minimax Language Agent

链接: https://arxiv.org/abs/2503.04094
作者: Seth Karten,Andy Luu Nguyen,Chi Jin
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 24 pages, 13 figures

点击查看摘要

Abstract:We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pokémon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at this https URL.

[LG-37] Cloud Computing Energy Consumption Prediction Based on Kernel Extreme Learning Machine Algorithm Improved by Vector Weighted Averag e Algorithm

链接: https://arxiv.org/abs/2503.04088
作者: Yuqing Wang,Xiao Yang
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:With the rapid expansion of cloud computing infrastructure, energy consumption has become a critical challenge, driving the need for accurate and efficient prediction models. This study proposes a novel Vector Weighted Average Kernel Extreme Learning Machine (VWAA-KELM) model to enhance energy consumption prediction in cloud computing environments. By integrating a vector weighted average algorithm (VWAA) with kernel extreme learning machine (KELM), the proposed model dynamically adjusts feature weights and optimizes kernel functions, significantly improving prediction accuracy and generalization. Experimental results demonstrate the superior performance of VWAA-KELM: 94.7% of test set prediction errors fall within [0, 50] units, with only three cases exceeding 100 units, indicating strong stability. The model achieves a coefficient of determination (R2) of 0.987 in the training set (RMSE = 28.108, RPD = 8.872) and maintains excellent generalization with R2 = 0.973 in the test set (RMSE = 43.227, RPD = 6.202). Visual analysis confirms that predicted values closely align with actual energy consumption trends, avoiding overfitting while capturing nonlinear dependencies. A key innovation of this study is the introduction of adaptive feature weighting, allowing the model to dynamically assign importance to different input parameters, thereby enhancing high-dimensional data processing. This advancement provides a scalable and efficient approach for optimizing cloud data center energy consumption. Beyond cloud computing, the proposed hybrid framework has broader applications in Internet of Things (IoT) and edge computing, supporting real-time energy management and intelligent resource allocation.

[LG-38] Controlled privacy leakage propagation throughout overlapping grouped learning

链接: https://arxiv.org/abs/2503.04054
作者: Shahrzad Kiani,Franziska Boenisch,Stark C. Draper
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: This paper was presented in part at the 2024 IEEE International Symposium on Information Theory (ISIT), Athens, Greece, 2024, pp. 386-391, doi: https://doi.org/10.1109/ISIT57864.2024.10619521 . The paper was published in the IEEE Journal on Selected Areas in Information Theory (JSAIT)

点击查看摘要

Abstract:Federated Learning (FL) is the standard protocol for collaborative learning. In FL, multiple workers jointly train a shared model. They exchange model updates calculated on their data, while keeping the raw data itself local. Since workers naturally form groups based on common interests and privacy policies, we are motivated to extend standard FL to reflect a setting with multiple, potentially overlapping groups. In this setup where workers can belong and contribute to more than one group at a time, complexities arise in understanding privacy leakage and in adhering to privacy policies. To address the challenges, we propose differential private overlapping grouped learning (DPOGL), a novel method to implement privacy guarantees within overlapping groups. Under the honest-but-curious threat model, we derive novel privacy guarantees between arbitrary pairs of workers. These privacy guarantees describe and quantify two key effects of privacy leakage in DP-OGL: propagation delay, i.e., the fact that information from one group will leak to other groups only with temporal offset through the common workers and information degradation, i.e., the fact that noise addition over model updates limits information leakage between workers. Our experiments show that applying DP-OGL enhances utility while maintaining strong privacy compared to standard FL setups.

[LG-39] he Impact Analysis of Delays in Asynchronous Federated Learning with Data Heterogeneity for Edge Intelligence

链接: https://arxiv.org/abs/2503.04052
作者: Ziruo Hao,Zhenhua Cui,Tao Yang,Bo Hu,Xiaofeng Wu,Hui Feng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has provided a new methodology for coordinating a group of clients to train a machine learning model collaboratively, bringing an efficient paradigm in edge intelligence. Despite its promise, FL faces several critical challenges in practical applications involving edge devices, such as data heterogeneity and delays stemming from communication and computation constraints. This paper examines the impact of unknown causes of delay on training performance in an Asynchronous Federated Learning (AFL) system with data heterogeneity. Initially, an asynchronous error definition is proposed, based on which the solely adverse impact of data heterogeneity is theoretically analyzed within the traditional Synchronous Federated Learning (SFL) framework. Furthermore, Asynchronous Updates with Delayed Gradients (AUDG), a conventional AFL scheme, is discussed. Investigation into AUDG reveals that the negative influence of data heterogeneity is correlated with delays, while a shorter average delay from a specific client does not consistently enhance training performance. In order to compensate for the scenarios where AUDG are not adapted, Pseudo-synchronous Updates by Reusing Delayed Gradients (PSURDG) is proposed, and its theoretical convergence is analyzed. In both AUDG and PSURDG, only a random set of clients successfully transmits their updated results to the central server in each iteration. The critical difference between them lies in whether the delayed information is reused. Finally, both schemes are validated and compared through theoretical analysis and simulations, demonstrating more intuitively that discarding outdated information due to time delays is not always the best approach.

[LG-40] Neural Network Surrogate Model for Junction Temperature and Hotspot Position in 3D Multi-Layer High Bandwidth Memory (HBM) Chiplets under Varying Thermal Conditions

链接: https://arxiv.org/abs/2503.04049
作者: Chengxin Zhang,Yujie Liu,Quan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the demand for computational power increases, high-bandwidth memory (HBM) has become a critical technology for next-generation computing systems. However, the widespread adoption of HBM presents significant thermal management challenges, particularly in multilayer through-silicon-via (TSV) stacked structures under varying thermal conditions, where accurate prediction of junction temperature and hotspot position is essential during the early design. This work develops a data-driven neural network model for the fast prediction of junction temperature and hotspot position in 3D HBM chiplets. The model, trained with a data set of 13,494 different combinations of thermal condition parameters, sampled from a vast parameter space characterized by high-dimensional combination (up to 3^27 ), can accurately and quickly infer the junction temperature and hotspot position for any thermal conditions in the parameter space. Moreover, it shows good generalizability for other thermal conditions not considered in the parameter space. The data set is constructed using accurate finite element solvers. This method not only minimizes the reliance on costly experimental tests and extensive computational resources for finite element analysis but also accelerates the design and optimization of complex HBM systems, making it a valuable tool for improving thermal management and performance in high-performance computing applications.

[LG-41] An optimal Petrov-Galerkin framework for operator networks

链接: https://arxiv.org/abs/2503.04024
作者: Philip Charles,Deep Ray,Yue Yu,Joost Prins,Hugo Melchers,Michael R. A. Abdelmalik,Jeffrey Cochran,Assad A. Oberai,Thomas J. R. Hughes,Mats G. Larson
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 39 pages, 22 figures, 5 tables

点击查看摘要

Abstract:The optimal Petrov-Galerkin formulation to solve partial differential equations (PDEs) recovers the best approximation in a specified finite-dimensional (trial) space with respect to a suitable norm. However, the recovery of this optimal solution is contingent on being able to construct the optimal weighting functions associated with the trial basis. While explicit constructions are available for simple one- and two-dimensional problems, such constructions for a general multidimensional problem remain elusive. In the present work, we revisit the optimal Petrov-Galerkin formulation through the lens of deep learning. We propose an operator network framework called Petrov-Galerkin Variationally Mimetic Operator Network (PG-VarMiON), which emulates the optimal Petrov-Galerkin weak form of the underlying PDE. The PG-VarMiON is trained in a supervised manner using a labeled dataset comprising the PDE data and the corresponding PDE solution, with the training loss depending on the choice of the optimal norm. The special architecture of the PG-VarMiON allows it to implicitly learn the optimal weighting functions, thus endowing the proposed operator network with the ability to generalize well beyond the training set. We derive approximation error estimates for PG-VarMiON, highlighting the contributions of various error sources, particularly the error in learning the true weighting functions. Several numerical results are presented for the advection-diffusion equation to demonstrate the efficacy of the proposed method. By embedding the Petrov-Galerkin structure into the network architecture, PG-VarMiON exhibits greater robustness and improved generalization compared to other popular deep operator frameworks, particularly when the training data is limited.

[LG-42] Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure

链接: https://arxiv.org/abs/2503.04010
作者: Aleksandrs Slivkins,Yunzong Xu,Shiliang Zuo
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the greedy (exploitation-only) algorithm in bandit problems with a known reward structure. We allow arbitrary finite reward structures, while prior work focused on a few specific ones. We fully characterize when the greedy algorithm asymptotically succeeds or fails, in the sense of sublinear vs. linear regret as a function of time. Our characterization identifies a partial identifiability property of the problem instance as the necessary and sufficient condition for the asymptotic success. Notably, once this property holds, the problem becomes easy – any algorithm will succeed (in the same sense as above), provided it satisfies a mild non-degeneracy condition. We further extend our characterization to contextual bandits and interactive decision-making with arbitrary feedback, and demonstrate its broad applicability across various examples.

[LG-43] Data-driven identification of nonlinear dynamical systems with LSTM autoencoders and Normalizing Flows

链接: https://arxiv.org/abs/2503.03977
作者: Abdolvahhab Rostamijavanani,Shanwu Li,Yongchao Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While linear systems have been useful in solving problems across different fields, the need for improved performance and efficiency has prompted them to operate in nonlinear modes. As a result, nonlinear models are now essential for the design and control of these systems. However, identifying a nonlinear system is more complicated than identifying a linear one. Therefore, modeling and identifying nonlinear systems are crucial for the design, manufacturing, and testing of complex systems. This study presents using advanced nonlinear methods based on deep learning for system identification. Two deep neural network models, LSTM autoencoder and Normalizing Flows, are explored for their potential to extract temporal features from time series data and relate them to system parameters, respectively. The presented framework offers a nonlinear approach to system identification, enabling it to handle complex systems. As case studies, we consider Duffing and Lorenz systems, as well as fluid flows such as flows over a cylinder and the 2-D lid-driven cavity problem. The results indicate that the presented framework is capable of capturing features and effectively relating them to system parameters, satisfying the identification requirements of nonlinear systems.

[LG-44] Generative Learning of Densities on Manifolds

链接: https://arxiv.org/abs/2503.03963
作者: Dimitris G. Giovanis,Ellis Crabtree,Roger G. Ghanem,Ioannis G. kevrekidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A generative modeling framework is proposed that combines diffusion models and manifold learning to efficiently sample data densities on manifolds. The approach utilizes Diffusion Maps to uncover possible low-dimensional underlying (latent) spaces in the high-dimensional data (ambient) space. Two approaches for sampling from the latent data density are described. The first is a score-based diffusion model, which is trained to map a standard normal distribution to the latent data distribution using a neural network. The second one involves solving an Itô stochastic differential equation in the latent space. Additional realizations of the data are generated by lifting the samples back to the ambient space using Double Diffusion Maps, a recently introduced technique typically employed in studying dynamical system reduction; here the focus lies in sampling densities rather than system dynamics. The proposed approaches enable sampling high dimensional data densities restricted to low-dimensional, a priori unknown manifolds. The efficacy of the proposed framework is demonstrated through a benchmark problem and a material with multiscale structure.

[LG-45] A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

链接: https://arxiv.org/abs/2503.03961
作者: William Merrill,Ashish Sabharwal
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: Preprint

点击查看摘要

Abstract:Recent theoretical results show transformers cannot express sequential reasoning problems over long input lengths, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer’s depth affects its expressive power. We address these questions by analyzing the expressive power of transformers whose depth can grow minimally with context length n . We show even highly uniform transformers with depth \Theta(\log n) can express two important problems: recognizing regular languages, which captures state tracking abilities, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, we find our theoretical depth requirements for regular language recognition match the practical depth requirements of transformers remarkably well. Thus, our results clarify precisely how depth affects transformers’ reasoning capabilities, providing potential practical insights for designing models that are better at sequential reasoning.

[LG-46] Dyads: Artist-Centric AI-Generated Dance Duets

链接: https://arxiv.org/abs/2503.03954
作者: Zixuan Wang,Luis Zerkowski,Ilya Vidrin,Mariel Pettee
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Existing AI-generated dance methods primarily train on motion capture data from solo dance performances, but a critical feature of dance in nearly any genre is the interaction of two or more bodies in space. Moreover, many works at the intersection of AI and dance fail to incorporate the ideas and needs of the artists themselves into their development process, yielding models that produce far more useful insights for the AI community than for the dance community. This work addresses both needs of the field by proposing an AI method to model the complex interactions between pairs of dancers and detailing how the technical methodology can be shaped by ongoing co-creation with the artistic stakeholders who curated the movement data. Our model is a probability-and-attention-based Variational Autoencoder that generates a choreographic partner conditioned on an input dance sequence. We construct a custom loss function to enhance the smoothness and coherence of the generated choreography. Our code is open-source, and we also document strategies for other interdisciplinary research teams to facilitate collaboration and strong communication between artists and technologists.

[LG-47] Safe LLM -Controlled Robots with Formal Guarantees via Reachability Analysis

链接: https://arxiv.org/abs/2503.03911
作者: Ahmad Hafez,Alireza Naderi Akhormeh,Amr Hegazy,Amr Alanwar
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLMs) in robotic systems presents unique safety challenges, particularly in unpredictable environments. Although LLMs, leveraging zero-shot learning, enhance human-robot interaction and decision-making capabilities, their inherent probabilistic nature and lack of formal guarantees raise significant concerns for safety-critical applications. Traditional model-based verification approaches often rely on precise system models, which are difficult to obtain for real-world robotic systems and may not be fully trusted due to modeling inaccuracies, unmodeled dynamics, or environmental uncertainties. To address these challenges, this paper introduces a safety assurance framework for LLM-controlled robots based on data-driven reachability analysis, a formal verification technique that ensures all possible system trajectories remain within safe operational limits. Our framework specifically investigates the problem of instructing an LLM to navigate the robot to a specified goal and assesses its ability to generate low-level control actions that successfully guide the robot safely toward that goal. By leveraging historical data to construct reachable sets of states for the robot-LLM system, our approach provides rigorous safety guarantees against unsafe behaviors without relying on explicit analytical models. We validate the framework through experimental case studies in autonomous navigation and task planning, demonstrating its effectiveness in mitigating risks associated with LLM-generated commands. This work advances the integration of formal methods into LLM-based robotics, offering a principled and practical approach to ensuring safety in next-generation autonomous systems.

[LG-48] On the Convergence of Adam-Type Algorithm for Bilevel Optimization under Unbounded Smoothness

链接: https://arxiv.org/abs/2503.03908
作者: Xiaochuan Gong,Jie Hao,Mingrui Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 49 pages, 5 figures

点击查看摘要

Abstract:Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we introduce AdamBO, a single-loop Adam-type method that achieves \widetildeO(\epsilon^-4) oracle complexity to find \epsilon -stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithm.

[LG-49] he Signed Two-Space Proximity Model for Learning Representations in Protein-Protein Interaction Networks

链接: https://arxiv.org/abs/2503.03904
作者: Nikolaos Nakis,Chrysoula Kosma,Anastasia Brativnyk,Michail Chatzianastasis,Iakovos Evdaimon,Michalis Vazirgiannis
类目: Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注: Preprint

点击查看摘要

Abstract:Accurately predicting complex protein-protein interactions (PPIs) is crucial for decoding biological processes, from cellular functioning to disease mechanisms. However, experimental methods for determining PPIs are computationally expensive. Thus, attention has been recently drawn to machine learning approaches. Furthermore, insufficient effort has been made toward analyzing signed PPI networks, which capture both activating (positive) and inhibitory (negative) interactions. To accurately represent biological relationships, we present the Signed Two-Space Proximity Model (S2-SPM) for signed PPI networks, which explicitly incorporates both types of interactions, reflecting the complex regulatory mechanisms within biological systems. This is achieved by leveraging two independent latent spaces to differentiate between positive and negative interactions while representing protein similarity through proximity in these spaces. Our approach also enables the identification of archetypes representing extreme protein profiles. S2-SPM’s superior performance in predicting the presence and sign of interactions in SPPI networks is demonstrated in link prediction tasks against relevant baseline methods. Additionally, the biological prevalence of the identified archetypes is confirmed by an enrichment analysis of Gene Ontology (GO) terms, which reveals that distinct biological tasks are associated with archetypal groups formed by both interactions. This study is also validated regarding statistical significance and sensitivity analysis, providing insights into the functional roles of different interaction types. Finally, the robustness and consistency of the extracted archetype structures are confirmed using the Bayesian Normalized Mutual Information (BNMI) metric, proving the model’s reliability in capturing meaningful SPPI patterns.

[LG-50] LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation

链接: https://arxiv.org/abs/2503.03890
作者: Qian Feng,David S. Martinez Lema,Jianxiang Feng,Zhaopeng Chen,Alois Knoll
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches.

[LG-51] Pretrained LLM s as Real-Time Controllers for Robot Operated Serial Production Line

链接: https://arxiv.org/abs/2503.03889
作者: Muhammad Waseem,Kshitij Bhatta,Chen Li,Qing Chang
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:The manufacturing industry is undergoing a transformative shift, driven by cutting-edge technologies like 5G, AI, and cloud computing. Despite these advancements, effective system control, which is crucial for optimizing production efficiency, remains a complex challenge due to the intricate, knowledge-dependent nature of manufacturing processes and the reliance on domain-specific expertise. Conventional control methods often demand heavy customization, considerable computational resources, and lack transparency in decision-making. In this work, we investigate the feasibility of using Large Language Models (LLMs), particularly GPT-4, as a straightforward, adaptable solution for controlling manufacturing systems, specifically, mobile robot scheduling. We introduce an LLM-based control framework to assign mobile robots to different machines in robot assisted serial production lines, evaluating its performance in terms of system throughput. Our proposed framework outperforms traditional scheduling approaches such as First-Come-First-Served (FCFS), Shortest Processing Time (SPT), and Longest Processing Time (LPT). While it achieves performance that is on par with state-of-the-art methods like Multi-Agent Reinforcement Learning (MARL), it offers a distinct advantage by delivering comparable throughput without the need for extensive retraining. These results suggest that the proposed LLM-based solution is well-suited for scenarios where technical expertise, computational resources, and financial investment are limited, while decision transparency and system scalability are critical concerns.

[LG-52] Seldonian Reinforcement Learning for Ad Hoc Teamwork

链接: https://arxiv.org/abs/2503.03885
作者: Edoardo Zorzi,Alberto Castellini,Leonidas Bakopoulos,Georgios Chalkiadakis,Alessandro Farinelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most offline RL algorithms return optimal policies but do not provide statistical guarantees on undesirable behaviors. This could generate reliability issues in safety-critical applications, such as in some multiagent domains where agents, and possibly humans, need to interact to reach their goals without harming each other. In this work, we propose a novel offline RL approach, inspired by Seldonian optimization, which returns policies with good performance and statistically guaranteed properties with respect to predefined undesirable behaviors. In particular, our focus is on Ad Hoc Teamwork settings, where agents must collaborate with new teammates without prior coordination. Our method requires only a pre-collected dataset, a set of candidate policies for our agent, and a specification about the possible policies followed by the other players – it does not require further interactions, training, or assumptions on the type and architecture of the policies. We test our algorithm in Ad Hoc Teamwork problems and show that it consistently finds reliable policies while improving sample efficiency with respect to standard ML baselines.

[LG-53] DeepGrav: Anomalous Gravitational-Wave Detection Through Deep Latent Features

链接: https://arxiv.org/abs/2503.03799
作者: Jianqi Yan(1),Alex P. Leung(1),Zhiyuan Pei(2),David C. Y. Hui(3),Sangin Kim(3) ((1) The University of Hong Kong, (2) Macau University of Science and Technology, (3) Chungnam National University)
类目: Machine Learning (cs.LG); High Energy Astrophysical Phenomena (astro-ph.HE); General Relativity and Quantum Cosmology (gr-qc)
*备注: 6 pages, 3 figures, A concise introduction to the winning solution for NSF HDR A3D3 GW challenge. Our training code is publicly available at this https URL

点击查看摘要

Abstract:This work introduces a novel deep learning-based approach for gravitational wave anomaly detection, aiming to overcome the limitations of traditional matched filtering techniques in identifying unknown waveform gravitational wave signals. We introduce a modified convolutional neural network architecture inspired by ResNet that leverages residual blocks to extract high-dimensional features, effectively capturing subtle differences between background noise and gravitational wave signals. This network architecture learns a high-dimensional projection while preserving discrepancies with the original input, facilitating precise identification of gravitational wave signals. In our experiments, we implement an innovative data augmentation strategy that generates new data by computing the arithmetic mean of multiple signal samples while retaining the key features of the original signals. In the NSF HDR A3D3: Detecting Anomalous Gravitational Wave Signals competition, it is honorable for us (group name: easonyan123) to get to the first place at the end with our model achieving a true negative rate (TNR) of 0.9708 during development/validation phase and 0.9832 on an unseen challenge dataset during final/testing phase, the highest among all competitors. These results demonstrate that our method not only achieves excellent generalization performance but also maintains robust adaptability in addressing the complex uncertainties inherent in gravitational wave anomaly detection. Comments: 6 pages, 3 figures, A concise introduction to the winning solution for NSF HDR A3D3 GW challenge. Our training code is publicly available at this https URL Subjects: Machine Learning (cs.LG); High Energy Astrophysical Phenomena (astro-ph.HE); General Relativity and Quantum Cosmology (gr-qc) Cite as: arXiv:2503.03799 [cs.LG] (or arXiv:2503.03799v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.03799 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-54] A Survey on Semantic Communications in Internet of Vehicles

链接: https://arxiv.org/abs/2503.03767
作者: Sha Ye,Qiong Wu,Pingyi Fan,Qiang Fan
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This paper has been submitted to Entropy

点击查看摘要

Abstract:Internet of Vehicles (IoV), as the core of intelligent transportation system, enables comprehensive interconnection between vehicles and their surroundings through multiple communication modes, which is significant for autonomous driving and intelligent traffic management. However, with the emergence of new applications, traditional communication technologies face the problems of scarce spectrum resources and high latency. Semantic communication, which focuses on extracting, transmitting, and recovering some useful semantic information from messages, can reduce redundant data transmission, improve spectrum utilization, and provide innovative solutions to communication challenges in the IoV. This paper systematically reviews state of art of semantic communications in the IoV, elaborates the technical background of IoV and semantic communications, and deeply discusses key technologies of semantic communications in IoV, including semantic information extraction, semantic communication architecture, resource allocation and management, and so on. Through specific case studies, it demonstrates that semantic communications can be effectively employed in the scenarios of traffic environment perception and understanding, intelligent driving decision support, IoV service optimization, and intelligent traffic management. Additionally, it analyzes the current challenges and future research directions. This survey reveals that semantic communications has broad application prospects in IoV, but it is necessary to solve the real existing problems by combining advanced technologies to promote its wide application in IoV and contributing to the development of intelligent transportation system.

[LG-55] Efficiently Escaping Saddle Points under Generalized Smoothness via Self-Bounding Regularity

链接: https://arxiv.org/abs/2503.04712
作者: Daniel Yiming Cao,August Y. Chen,Karthik Sridharan,Benjamin Tang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 79 pages

点击查看摘要

Abstract:In this paper, we study the problem of non-convex optimization on functions that are not necessarily smooth using first order methods. Smoothness (functions whose gradient and/or Hessian are Lipschitz) is not satisfied by many machine learning problems in both theory and practice, motivating a recent line of work studying the convergence of first order methods to first order stationary points under appropriate generalizations of smoothness. We develop a novel framework to study convergence of first order methods to first and \textitsecond order stationary points under generalized smoothness, under more general smoothness assumptions than the literature. Using our framework, we show appropriate variants of GD and SGD (e.g. with appropriate perturbations) can converge not just to first order but also \textitsecond order stationary points in runtime polylogarithmic in the dimension. To our knowledge, our work contains the first such result, as well as the first ‘non-textbook’ rate for non-convex optimization under generalized smoothness. We demonstrate that several canonical non-convex optimization problems fall under our setting and framework. Comments: 79 pages Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2503.04712 [math.OC] (or arXiv:2503.04712v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2503.04712 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Coarse graining and reduced order models for plume ejection dynamics

链接: https://arxiv.org/abs/2503.04690
作者: Ike Griss Salas,Megan R. Ebers,Jake Stevens-Haas,J. Nathan Kutz
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Monitoring the atmospheric dispersion of pollutants is increasingly critical for environmental impact assessments. High-fidelity computational models are often employed to simulate plume dynamics, guiding decision-making and prioritizing resource deployment. However, such models can be prohibitively expensive to simulate, as they require resolving turbulent flows at fine spatial and temporal resolutions. Moreover, there are at least two distinct dynamical regimes of interest in the plume: (i) the initial ejection of the plume where turbulent mixing is generated by the shear-driven Kelvin-Helmholtz instability, and (ii) the ensuing turbulent diffusion and advection which is often modeled by the Gaussian plume model. We address the challenge of modeling the initial plume generation. Specifically, we propose a data-driven framework that identifies a reduced-order analytical model for plume dynamics – directly from video data. We extract a time series of plume center and edge points from video snapshots and evaluate different regressions based to their extrapolation performance to generate a time series of coefficients that characterize the plume’s overall direction and spread. We regress to a sinusoidal model inspired by the Kelvin-Helmholtz instability for the edge points in order to identify the plume’s dispersion and vorticity. Overall, this reduced-order modeling framework provides a data-driven and lightweight approach to capture the dominant features of the initial nonlinear point-source plume dynamics, agnostic to plume type and starting only from video. The resulting model is a pre-cursor to standard models such as the Gaussian plume model and has the potential to enable rapid assessment and evaluation of critical environmental hazards, such as methane leaks, chemical spills, and pollutant dispersal from smokestacks.

[LG-57] Propagating Model Uncertainty through Filtering-based Probabilistic Numerical ODE Solvers

链接: https://arxiv.org/abs/2503.04684
作者: Dingling Yao,Filip Tronarp,Nathanael Bosch
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Filtering-based probabilistic numerical solvers for ordinary differential equations (ODEs), also known as ODE filters, have been established as efficient methods for quantifying numerical uncertainty in the solution of ODEs. In practical applications, however, the underlying dynamical system often contains uncertain parameters, requiring the propagation of this model uncertainty to the ODE solution. In this paper, we demonstrate that ODE filters, despite their probabilistic nature, do not automatically solve this uncertainty propagation problem. To address this limitation, we present a novel approach that combines ODE filters with numerical quadrature to properly marginalize over uncertain parameters, while accounting for both parameter uncertainty and numerical solver uncertainty. Experiments across multiple dynamical systems demonstrate that the resulting uncertainty estimates closely match reference solutions. Notably, we show how the numerical uncertainty from the ODE solver can help prevent overconfidence in the propagated uncertainty estimates, especially when using larger step sizes. Our results illustrate that probabilistic numerical methods can effectively quantify both numerical and parametric uncertainty in dynamical systems.

[LG-58] Leverag ing priors on distribution functions for multi-arm bandits

链接: https://arxiv.org/abs/2503.04518
作者: Sumit Vashishtha,Odalric-Ambrym Maillard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Dirichlet Process Posterior Sampling (DPPS), a Bayesian non-parametric algorithm for multi-arm bandits based on Dirichlet Process (DP) priors. Like Thompson-sampling, DPPS is a probability-matching algorithm, i.e., it plays an arm based on its posterior-probability of being optimal. Instead of assuming a parametric class for the reward generating distribution of each arm, and then putting a prior on the parameters, in DPPS the reward generating distribution is directly modeled using DP priors. DPPS provides a principled approach to incorporate prior belief about the bandit environment, and in the noninformative limit of the DP posteriors (i.e. Bayesian Bootstrap), we recover Non Parametric Thompson Sampling (NPTS), a popular non-parametric bandit algorithm, as a special case of DPPS. We employ stick-breaking representation of the DP priors, and show excellent empirical performance of DPPS in challenging synthetic and real world bandit environments. Finally, using an information-theoretic analysis, we show non-asymptotic optimality of DPPS in the Bayesian regret setup.

[LG-59] A Morse Transform for Drug Discovery

链接: https://arxiv.org/abs/2503.04507
作者: Alexander M. Tanaka,Aras T. Asaad,Richard Cooper,Vidit Nanda
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 25 pages, 5 main figures, 2 main tables, 6 supplementary figures and 4 supplementary tables

点击查看摘要

Abstract:We introduce a new ligand-based virtual screening (LBVS) framework that uses piecewise linear (PL) Morse theory to predict ligand binding potential. We model ligands as simplicial complexes via a pruned Delaunay triangulation, and catalogue the critical points across multiple directional height functions. This produces a rich feature vector, consisting of crucial topological features – peaks, troughs, and saddles – that characterise ligand surfaces relevant to binding interactions. Unlike contemporary LBVS methods that rely on computationally-intensive deep neural networks, we require only a lightweight classifier. The Morse theoretic approach achieves state-of-the-art performance on standard datasets while offering an interpretable feature vector and scalable method for ligand prioritization in early-stage drug discovery.

[LG-60] Accurate predictive model of band gap with selected important features based on explainable machine learning

链接: https://arxiv.org/abs/2503.04492
作者: Joohwi Lee,Kaito Miyamoto
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, SI is included

点击查看摘要

Abstract:In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to, or even deteriorate, model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets while demonstrating superior generalization with lower prediction errors on out-of-domain data. Additionally, the study underscores the necessity for eliminating strongly correlated features to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML’s effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles.

[LG-61] InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference ICLR2025

链接: https://arxiv.org/abs/2503.04483
作者: Tianyu Cui,Song-Jun Xu,Artem Moskalev,Shuwei Li,Tommaso Mansi,Mangal Prakash,Rui Liao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: ICLR 2025 AI4NA Oral, ICLR 2025 MLGenX Spotlight, ICLR 2025 LMRL

点击查看摘要

Abstract:Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.

[LG-62] Poisoning Bayesian Inference via Data Deletion and Replication

链接: https://arxiv.org/abs/2503.04480
作者: Matthieu Carreau,Roi Naveiro,William N. Caballero
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research in adversarial machine learning (AML) has shown that statistical models are vulnerable to maliciously altered data. However, despite advances in Bayesian machine learning models, most AML research remains concentrated on classical techniques. Therefore, we focus on extending the white-box model poisoning paradigm to attack generic Bayesian inference, highlighting its vulnerability in adversarial contexts. A suite of attacks are developed that allow an attacker to steer the Bayesian posterior toward a target distribution through the strategic deletion and replication of true observations, even when only sampling access to the posterior is available. Analytic properties of these algorithms are proven and their performance is empirically examined in both synthetic and real-world scenarios. With relatively little effort, the attacker is able to substantively alter the Bayesian’s beliefs and, by accepting more risk, they can mold these beliefs to their will. By carefully constructing the adversarial posterior, surgical poisoning is achieved such that only targeted inferences are corrupted and others are minimally disturbed.

[LG-63] An artificially intelligent magnetic resonance spectroscopy quantification method: Comparison between QNet and LCModel on the cloud computing platform CloudBrain-MRS

链接: https://arxiv.org/abs/2503.04469
作者: Meijin Lin,Lin Guo,Dicheng Chen,Jianshu Chen,Zhangren Tu,Xu Huang,Jianhua Wang,Ji Qi,Yuan Long,Zhiguo Huang,Di Guo,Xiaobo Qu,Haiwei Han
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objctives: This work aimed to statistically compare the metabolite quantification of human brain magnetic resonance spectroscopy (MRS) between the deep learning method QNet and the classical method LCModel through an easy-to-use intelligent cloud computing platform CloudBrain-MRS. Materials and Methods: In this retrospective study, two 3 T MRI scanners Philips Ingenia and Achieva collected 61 and 46 in vivo 1H magnetic resonance (MR) spectra of healthy participants, respectively, from the brain region of pregenual anterior cingulate cortex from September to October 2021. The analyses of Bland-Altman, Pearson correlation and reasonability were performed to assess the degree of agreement, linear correlation and reasonability between the two quantification methods. Results: Fifteen healthy volunteers (12 females and 3 males, age range: 21-35 years, mean age/standard deviation = 27.4/3.9 years) were recruited. The analyses of Bland-Altman, Pearson correlation and reasonability showed high to good consistency and very strong to moderate correlation between the two methods for quantification of total N-acetylaspartate (tNAA), total choline (tCho), and inositol (Ins) (relative half interval of limits of agreement = 3.04%, 9.3%, and 18.5%, respectively; Pearson correlation coefficient r = 0.775, 0.927, and 0.469, respectively). In addition, quantification results of QNet are more likely to be closer to the previous reported average values than those of LCModel. Conclusion: There were high or good degrees of consistency between the quantification results of QNet and LCModel for tNAA, tCho, and Ins, and QNet generally has more reasonable quantification than LCModel.

[LG-64] Reproducibility Assessment of Magnetic Resonance Spectroscopy of Pregenual Anterior Cingulate Cortex across Sessions and Vendors via the Cloud Computing Platform CloudBrain-MRS

链接: https://arxiv.org/abs/2503.04453
作者: Runhan Chen,Meijin Lin,Jianshu Chen,Liangjie Lin,Jiazheng Wang,Xiaoqing Li,Jianhua Wang,Xu Huang,Ling Qian,Shaoxing Liu,Yuan Long,Di Guo,Xiaobo Qu,Haiwei Han
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Given the need to elucidate the mechanisms underlying illnesses and their treatment, as well as the lack of harmonization of acquisition and post-processing protocols among different magnetic resonance system vendors, this work is to determine if metabolite concentrations obtained from different sessions, machine models and even different vendors of 3 T scanners can be highly reproducible and be pooled for diagnostic analysis, which is very valuable for the research of rare diseases. Participants underwent magnetic resonance imaging (MRI) scanning once on two separate days within one week (one session per day, each session including two proton magnetic resonance spectroscopy (1H-MRS) scans with no more than a 5-minute interval between scans (no off-bed activity)) on each machine. were analyzed for reliability of within- and between- sessions using the coefficient of variation (CV) and intraclass correlation coefficient (ICC), and for reproducibility of across the machines using correlation coefficient. As for within- and between- session, all CV values for a group of all the first or second scans of a session, or for a session were almost below 20%, and most of the ICCs for metabolites range from moderate (0.4-0.59) to excellent (0.75-1), indicating high data reliability. When it comes to the reproducibility across the three scanners, all Pearson correlation coefficients across the three machines approached 1 with most around 0.9, and majority demonstrated statistical significance (P0.01). Additionally, the intra-vendor reproducibility was greater than the inter-vendor ones.

[LG-65] A Graph-Partitioning Based Continuous Optimization Approach to Semi-supervised Clustering Problems

链接: https://arxiv.org/abs/2503.04447
作者: Wei Liu,Xin Liu,Michael K. Ng,Zaikun Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Semi-supervised clustering is a basic problem in various applications. Most existing methods require knowledge of the ideal cluster number, which is often difficult to obtain in practice. Besides, satisfying the must-link constraints is another major challenge for these methods. In this work, we view the semi-supervised clustering task as a partitioning problem on a graph associated with the given dataset, where the similarity matrix includes a scaling parameter to reflect the must-link constraints. Utilizing a relaxation technique, we formulate the graph partitioning problem into a continuous optimization model that does not require the exact cluster number, but only an overestimate of it. We then propose a block coordinate descent algorithm to efficiently solve this model, and establish its convergence result. Based on the obtained solution, we can construct the clusters that theoretically meet the must-link constraints under mild assumptions. Furthermore, we verify the effectiveness and efficiency of our proposed method through comprehensive numerical experiments.

[LG-66] Determinant Estimation under Memory Constraints and Neural Scaling Laws

链接: https://arxiv.org/abs/2503.04424
作者: Siavash Ameli,Chris van der Heide,Liam Hodgkinson,Fred Roosta,Michael W. Mahoney
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Calculating or accurately estimating log-determinants of large positive semi-definite matrices is of fundamental importance in many machine learning tasks. While its cubic computational complexity can already be prohibitive, in modern applications, even storing the matrices themselves can pose a memory bottleneck. To address this, we derive a novel hierarchical algorithm based on block-wise computation of the LDL decomposition for large-scale log-determinant calculation in memory-constrained settings. In extreme cases where matrices are highly ill-conditioned, accurately computing the full matrix itself may be infeasible. This is particularly relevant when considering kernel matrices at scale, including the empirical Neural Tangent Kernel (NTK) of neural networks trained on large datasets. Under the assumption of neural scaling laws in the test error, we show that the ratio of pseudo-determinants satisfies a power-law relationship, allowing us to derive corresponding scaling laws. This enables accurate estimation of NTK log-determinants from a tiny fraction of the full dataset; in our experiments, this results in a \sim 100,000 \times speedup with improved accuracy over competing approximations. Using these techniques, we successfully estimate log-determinants for dense matrices of extreme sizes, which were previously deemed intractable and inaccessible due to their enormous scale and computational demands.

[LG-67] me-varying Factor Augmented Vector Autoregression with Grouped Sparse Autoencoder

链接: https://arxiv.org/abs/2503.04386
作者: Yiyong Luo,Brooks Paige,Jim Griffin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent economic events, including the global financial crisis and COVID-19 pandemic, have exposed limitations in linear Factor Augmented Vector Autoregressive (FAVAR) models for forecasting and structural analysis. Nonlinear dimension techniques, particularly autoencoders, have emerged as promising alternatives in a FAVAR framework, but challenges remain in identifiability, interpretability, and integration with traditional nonlinear time series methods. We address these challenges through two contributions. First, we introduce a Grouped Sparse autoencoder that employs the Spike-and-Slab Lasso prior, with parameters under this prior being shared across variables of the same economic category, thereby achieving semi-identifiability and enhancing model interpretability. Second, we incorporate time-varying parameters into the VAR component to better capture evolving economic dynamics. Our empirical application to the US economy demonstrates that the Grouped Sparse autoencoder produces more interpretable factors through its parsimonious structure; and its combination with time-varying parameter VAR shows superior performance in both point and density forecasting. Impulse response analysis reveals that monetary policy shocks during recessions generate more moderate responses with higher uncertainty compared to expansionary periods.

[LG-68] Learning Causal Response Representations through Direct Effect Analysis

链接: https://arxiv.org/abs/2503.04358
作者: Homer Durand,Gherardo Varando,Gustau Camps-Valls
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注: 32 pages, 15 figures, stat.ML

点击查看摘要

Abstract:We propose a novel approach for learning causal response representations. Our method aims to extract directions in which a multidimensional outcome is most directly caused by a treatment variable. By bridging conditional independence testing with causal representation learning, we formulate an optimisation problem that maximises the evidence against conditional independence between the treatment and outcome, given a conditioning set. This formulation employs flexible regression models tailored to specific applications, creating a versatile framework. The problem is addressed through a generalised eigenvalue decomposition. We show that, under mild assumptions, the distribution of the largest eigenvalue can be bounded by a known F -distribution, enabling testable conditional independence. We also provide theoretical guarantees for the optimality of the learned representation in terms of signal-to-noise ratio and Fisher information maximisation. Finally, we demonstrate the empirical effectiveness of our approach in simulation and real-world experiments. Our results underscore the utility of this framework in uncovering direct causal effects within complex, multivariate settings.

[LG-69] RANSIT your events into a new mass: Fast background interpolation for weakly-supervised anomaly searches

链接: https://arxiv.org/abs/2503.04342
作者: Ivan Oleksiyuk,Svyatoslav Voloshynovskiy,Tobias Golling
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 34 pages, 14 figures

点击查看摘要

Abstract:We introduce a new model for conditional and continuous data morphing called TRansport Adversarial Network for Smooth InTerpolation (TRANSIT). We apply it to create a background data template for weakly-supervised searches at the LHC. The method smoothly transforms sideband events to match signal region mass distributions. We demonstrate the performance of TRANSIT using the LHC Olympics R\D dataset. The model captures non-linear mass correlations of features and produces a template that offers a competitive anomaly sensitivity compared to state-of-the-art transport-based template generators. Moreover, the computational training time required for TRANSIT is an order of magnitude lower than that of competing deep learning methods. This makes it ideal for analyses that iterate over many signal regions and signal models. Unlike generative models, which must learn a full probability density distribution, i.e., the correlations between all the variables, the proposed transport model only has to learn a smooth conditional shift of the distribution. This allows for a simpler, more efficient residual architecture, enabling mass uncorrelated features to pass the network unchanged while the mass correlated features are adjusted accordingly. Furthermore, we show that the latent space of the model provides a set of mass decorrelated features useful for anomaly detection without background sculpting.

[LG-70] Generalization in Federated Learning: A Conditional Mutual Information Framework

链接: https://arxiv.org/abs/2503.04091
作者: Ziqiao Wang,Cheng Long,Yongyi Mao
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:Federated Learning (FL) is a widely adopted privacy-preserving distributed learning framework, yet its generalization performance remains less explored compared to centralized learning. In FL, the generalization error consists of two components: the out-of-sample gap, which measures the gap between the empirical and true risk for participating clients, and the participation gap, which quantifies the risk difference between participating and non-participating clients. In this work, we apply an information-theoretic analysis via the conditional mutual information (CMI) framework to study FL’s two-level generalization. Beyond the traditional supersample-based CMI framework, we introduce a superclient construction to accommodate the two-level generalization setting in FL. We derive multiple CMI-based bounds, including hypothesis-based CMI bounds, illustrating how privacy constraints in FL can imply generalization guarantees. Furthermore, we propose fast-rate evaluated CMI bounds that recover the best-known convergence rate for two-level FL generalization in the small empirical risk regime. For specific FL model aggregation strategies and structured loss functions, we refine our bounds to achieve improved convergence rates with respect to the number of participating clients. Empirical evaluations confirm that our evaluated CMI bounds are non-vacuous and accurately capture the generalization behavior of FL algorithms.

[LG-71] Conformal Prediction with Upper and Lower Bound Models

链接: https://arxiv.org/abs/2503.04071
作者: Miao Li,Michael Klamkin,Mathieu Tanneau,Reza Zandehshahvar,Pascal Van Hentenryck
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies a Conformal Prediction (CP) methodology for building prediction intervals in a regression setting, given only deterministic lower and upper bounds on the target variable. It proposes a new CP mechanism (CPUL) that goes beyond post-processing by adopting a model selection approach over multiple nested interval construction methods. Paradoxically, many well-established CP methods, including CPUL, may fail to provide adequate coverage in regions where the bounds are tight. To remedy this limitation, the paper proposes an optimal thresholding mechanism, OMLT, that adjusts CPUL intervals in tight regions with undercoverage. The combined CPUL-OMLT is validated on large-scale learning tasks where the goal is to bound the optimal value of a parametric optimization problem. The experimental results demonstrate substantial improvements over baseline methods across various datasets.

[LG-72] Quantitative Flow Approximation Properties of Narrow Neural ODEs

链接: https://arxiv.org/abs/2503.04068
作者: Karthik Elamvazhuthi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this note, we revisit the problem of flow approximation properties of neural ordinary differential equations (NODEs). The approximation properties have been considered as a flow controllability problem in recent literature. The neural ODE is considered \it narrow when the parameters have dimension equal to the input of the neural network, and hence have limited width. We derive the relation of narrow NODEs in approximating flows of shallow but wide NODEs. Due to existing results on approximation properties of shallow neural networks, this facilitates understanding which kind of flows of dynamical systems can be approximated using narrow neural ODEs. While approximation properties of narrow NODEs have been established in literature, the proofs often involve extensive constructions or require invoking deep controllability theorems from control theory. In this paper, we provide a simpler proof technique that involves only ideas from ODEs and Grönwall’s lemma. Moreover, we provide an estimate on the number of switches needed for the time dependent weights of the narrow NODE to mimic the behavior of a NODE with a single layer wide neural network as the velocity field.

[LG-73] Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

链接: https://arxiv.org/abs/2503.04047
作者: Muheng Li,Ruqi Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ‘‘wandering in contours’’. This behavior refers to sampling new different solutions that share very similar objective values for a long time, leading to computational inefficiency and suboptimal exploration of potential solutions. In this paper, we introduce a novel reheating mechanism inspired by the concept of critical temperature and specific heat in physics, aimed at overcoming this limitation. Empirically, our method demonstrates superiority over existing sampling-based and data-driven algorithms across a diverse array of CO problems.

[LG-74] Machine learning driven search of hydrogen storag e materials

链接: https://arxiv.org/abs/2503.04027
作者: Tanumoy Banerjee,Kevin Ji,Weiyi Xia,Gaoyuan Ouyang,Tyler Del Rose,Ihor Z. Hlova,Benjamin Ueland,Duane D. Johnson,Cai-Zhuan Wang,Ganesh Balasubramanian,Prashant Singh
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 34 pages, 12 figures, 87 references

点击查看摘要

Abstract:The transition to a low-carbon economy demands efficient and sustainable energy-storage solutions, with hydrogen emerging as a promising clean-energy carrier and with metal hydrides recognized for their hydrogen-storage capacity. Here, we leverage machine learning (ML) to predict hydrogen-to-metal (H/M) ratios and solution energy by incorporating thermodynamic parameters and local lattice distortion (LLD) as key features. Our best-performing ML model provides improvements to H/M ratios and solution energies over a broad class of ternary alloys (easily extendable to multi-principal-element alloys), such as Ti-Nb-X (X = Mo, Cr, Hf, Ta, V, Zr) and Co-Ni-X (X = Al, Mg, V). Ti-Nb-Mo alloys reveal compositional effects in H-storage behavior, in particular Ti, Nb, and V enhance H-storage capacity, while Mo reduces H/M and hydrogen weight percent by 40-50%. We attributed to slow hydrogen kinetics in molybdenum rich alloys, which is validated by our pressure-composition isotherm (PCT) experiments on pure Ti and Ti5Mo95 alloys. Density functional theory (DFT) and molecular simulations also confirm that Ti and Nb promote H diffusion, whereas Mo hinders it, highlighting the interplay between electronic structure, lattice distortions, and hydrogen uptake. Notably, our Gradient Boosting Regression model identifies LLD as a critical factor in H/M predictions. To aid material selection, we present two periodic tables illustrating elemental effects on (a) H2 wt% and (b) solution energy, derived from ML, and provide a reference for identifying alloying elements that enhance hydrogen solubility and storage.

[LG-75] Data-Driven Probabilistic Air-Sea Flux Parameterization

链接: https://arxiv.org/abs/2503.03990
作者: Jiarong Wu,Pavel Perezhogin,David John Gagne,Brandon Reichl,Aneesh C. Subramanian,Elizabeth Thompson,Laure Zanna
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Accurately quantifying air-sea fluxes is important for understanding air-sea interactions and improving coupled weather and climate systems. This study introduces a probabilistic framework to represent the highly variable nature of air-sea fluxes, which is missing in deterministic bulk algorithms. Assuming Gaussian distributions conditioned on the input variables, we use artificial neural networks and eddy-covariance measurement data to estimate the mean and variance by minimizing negative log-likelihood loss. The trained neural networks provide alternative mean flux estimates to existing bulk algorithms, and quantify the uncertainty around the mean estimates. Stochastic parameterization of air-sea turbulent fluxes can be constructed by sampling from the predicted distributions. Tests in a single-column forced upper-ocean model suggest that changes in flux algorithms influence sea surface temperature and mixed layer depth seasonally. The ensemble spread in stochastic runs is most pronounced during spring restratification.

[LG-76] Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows ICLR2025

链接: https://arxiv.org/abs/2503.03989
作者: Xiangxin Zhou,Yi Xiao,Haowei Lin,Xinheng He,Jiaqi Guan,Yang Wang,Qiang Liu,Feng Zhou,Liang Wang,Jianzhu Ma
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD approaches, playing a significant role in practical drug discovery.

[LG-77] Image Data Augmentation for the TAIGA-IACT Experiment with Conditional Generative Adversarial Networks

链接: https://arxiv.org/abs/2503.03982
作者: Yu. Yu. Dubenskaya,A. P. Kryukov,E. O. Gres,S. P. Polyakov,E. B. Postnikov,P. A. Volchugov,A. A. Vlaskina,D. P. Zhurov
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 19 pages, 10 figures, Proceedings of The 8th International Conference on Deep Learning in Computational Physics, June 19-21, 2024, Moscow, Russia

点击查看摘要

Abstract:Modern Imaging Atmospheric Cherenkov Telescopes (IACTs) generate a huge amount of data that must be classified automatically, ideally in real time. Currently, machine learning-based solutions are increasingly being used to solve classification problems. However, these classifiers require proper training data sets to work correctly. The problem with training neural networks on real IACT data is that these data need to be pre-labeled, whereas such labeling is difficult and its results are estimates. In addition, the distribution of incoming events is highly imbalanced. Firstly, there is an imbalance in the types of events, since the number of detected gamma quanta is significantly less than the number of protons. Secondly, the energy distribution of particles of the same type is also imbalanced, since high-energy particles are extremely rare. This imbalance results in poorly trained classifiers that, once trained, do not handle rare events correctly. Using only conventional Monte Carlo event simulation methods to solve this problem is possible, but extremely resource-intensive and time-consuming. To address this issue, we propose to perform data augmentation with artificially generated events of the desired type and energy using conditional generative adversarial networks (cGANs), distinguishing classes by energy values. In the paper, we describe a simple algorithm for generating balanced data sets using cGANs. Thus, the proposed neural network model produces both imbalanced data sets for physical analysis as well as balanced data sets suitable for training other neural networks.

[LG-78] Improving the Temporal Resolution of SOHO/MDI Magnetograms of Solar Active Regions Using a Deep Generative Model

链接: https://arxiv.org/abs/2503.03959
作者: Jialiang Li,Vasyl Yurchyshyn,Jason T. L. Wang,Haimin Wang,Yasser Abduallah,Khalid A. Alobaid,Chunhui Xu,Ruizhu Chen,Yan Xu
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:We present a novel deep generative model, named GenMDI, to improve the temporal resolution of line-of-sight (LOS) magnetograms of solar active regions (ARs) collected by the Michelson Doppler Imager (MDI) on board the Solar and Heliospheric Observatory (SOHO). Unlike previous studies that focus primarily on spatial super-resolution of MDI magnetograms, our approach can perform temporal super-resolution, which generates and inserts synthetic data between observed MDI magnetograms, thus providing finer temporal structure and enhanced details in the LOS data. The GenMDI model employs a conditional diffusion process, which synthesizes images by considering both preceding and subsequent magnetograms, ensuring that the generated images are not only of high-quality, but also temporally coherent with the surrounding data. Experimental results show that the GenMDI model performs better than the traditional linear interpolation method, especially in ARs with dynamic evolution in magnetic fields.

[LG-79] he optical and infrared are connected

链接: https://arxiv.org/abs/2503.03816
作者: Christian K. Jespersen,Peter Melchior,David N. Spergel,Andy D. Goulding,ChangHoon Hahn,Kartheik G. Iyer
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 17 pages, 14 figures. 12 pages of Appendix. Submitted to ApJ

点击查看摘要

Abstract:Galaxies are often modelled as composites of separable components with distinct spectral signatures, implying that different wavelength ranges are only weakly correlated. They are not. We present a data-driven model which exploits subtle correlations between physical processes to accurately predict infrared (IR) WISE photometry from a neural summary of optical SDSS spectra. The model achieves accuracies of \chi^2_N \approx 1 for all photometric bands in WISE, as well as good colors. We are also able to tightly constrain typically IR-derived properties, e.g. the bolometric luminosities of AGN and dust parameters such as \mathrmq_PAH . We find that current SED-fitting methods are incapable of making comparable predictions, and that model misspecification often leads to correlated biases in star-formation rates and AGN luminosities. To help improve SED models, we determine what features of the optical spectrum are responsible for our improved predictions, and identify several lines (CaII, SrII, FeI, [OII] and H \alpha ), which point to the complex chronology of star formation and chemical enrichment being incorrectly modelled.

[LG-80] Non-Gaussianities in Collider Metric Binning

链接: https://arxiv.org/abs/2503.03809
作者: Andrew J. Larkoski
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Metrics for rigorously defining a distance between two events have been used to study the properties of the dataspace manifold of particle collider physics. The probability distribution of pairwise distances on this dataspace is unique with probability 1, and so this suggests a method to search for and identify new physics by the deviation of measurement from a null hypothesis prediction. To quantify the deviation statistically, we directly calculate the probability distribution of the number of event pairs that land in the bin a fixed distance apart. This distribution is not generically Gaussian and the ratio of the standard deviation to the mean entries in a bin scales inversely with the square-root of the number of events in the data ensemble. If the dataspace manifold exhibits some enhanced symmetry, the number of entries is Gaussian, and further fluctuations about the mean scale away like the inverse of the number of events. We define a robust measure of the non-Gaussianity of the bin-by-bin statistics of the distance distribution, and demonstrate in simulated data of jets from quantum chromodynamics sensitivity to the parton-to-hadron transition and that the manifold of events enjoys enhanced symmetries as their energy increases.

[LG-81] Neural Models of Task Adaptation: A Tutorial on Spiking Networks for Executive Control

链接: https://arxiv.org/abs/2503.03784
作者: Ashwin Viswanathan Kannan,Madhumitha Ganesan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages

点击查看摘要

Abstract:Understanding cognitive flexibility and task-switching mechanisms in neural systems requires biologically plausible computational models. This tutorial presents a step-by-step approach to constructing a spiking neural network (SNN) that simulates task-switching dynamics within the cognitive control network. The model incorporates biologically realistic features, including lateral inhibition, adaptive synaptic weights through unsupervised Spike Timing-Dependent Plasticity (STDP), and precise neuronal parameterization within physiologically relevant ranges. The SNN is implemented using Leaky Integrate-and-Fire (LIF) neurons, which represent excitatory (glutamatergic) and inhibitory (GABAergic) populations. We utilize two real-world datasets as tasks, demonstrating how the network learns and dynamically switches between them. Experimental design follows cognitive psychology paradigms to analyze neural adaptation, synaptic weight modifications, and emergent behaviors such as Long-Term Potentiation (LTP), Long-Term Depression (LTD), and Task-Set Reconfiguration (TSR). Through a series of structured experiments, this tutorial illustrates how variations in task-switching intervals affect performance and multitasking efficiency. The results align with empirically observed neuronal responses, offering insights into the computational underpinnings of executive function. By following this tutorial, researchers can develop and extend biologically inspired SNN models for studying cognitive processes and neural adaptation.

[LG-82] A Phylogenetic Approach to Genomic Language Modeling

链接: https://arxiv.org/abs/2503.03773
作者: Carlos Albors,Jianan Canal Li,Gonzalo Benegas,Chengzhong Ye,Yun S. Song
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model’s applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.

[LG-83] Fusion of Various Optimization Based Feature Smoothing Methods for Wearable and Non-invasive Blood Glucose Estimation

链接: https://arxiv.org/abs/2503.03770
作者: Yiting Wei(1),Bingo Wing-Kuen Ling(1),Danni Chen(1),Yuheng Dai(1),Qing Liu(1) ((1) Guangdong University of Technology, Guangzhou, China)
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注: This version corrects several typos

点击查看摘要

Abstract:Recently, the wearable and non-invasive blood glucose estimation approach has been proposed. However, due to the unreliability of the acquisition device, the presence of the noise and the variations of the acquisition environments, the obtained features and the reference blood glucose values are highly unreliable. To address this issue, this paper proposes a polynomial fitting approach to smooth the obtained features or the reference blood glucose values. First, the blood glucose values are estimated based on the individual optimization approaches. Second, the absolute difference values between the estimated blood glucose values and the actual blood glucose values based on each optimization approach are computed. Third, these absolute difference values for each optimization approach are sorted in the ascending order. Fourth, for each sorted blood glucose value, the optimization method corresponding to the minimum absolute difference value is selected. Fifth, the accumulate probability of each selected optimization method is computed. If the accumulate probability of any selected optimization method at a point is greater than a threshold value, then the accumulate probabilities of these three selected optimization methods at that point are reset to zero. A range of the sorted blood glucose values are defined as that with the corresponding boundaries points being the previous reset point and this reset point. Hence, after performing the above procedures for all the sorted reference blood glucose values in the validation set, the regions of the sorted reference blood glucose values and the corresponding optimization methods in these regions are determined. The computer numerical simulation results show that our proposed method yields the mean absolute relative deviation (MARD) at 0.0930 and the percentage of the test data falling in the zone A of the Clarke error grid at 94.1176%.

信息检索

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-03-07

目录

概览 (2025-03-07)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载