Arxiv今日论文 | 2025-05-05

本篇博文主要内容为 2025-05-05 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决神经网络解释性研究中缺乏统一评估方法的问题，即如何判断一个解释是否有效。其解决方案的关键在于引入一种基于科学哲学多元视角的解释性美德框架（Explanatory Virtues Framework），从贝叶斯、库恩主义、德希特和诺模论四个角度系统地评估和改进机制可解释性（Mechanistic Interpretability, MI）中的解释质量。该框架强调紧凑证明（Compact Proofs）在综合多种解释美德方面的潜力，为未来研究指明了方向，包括明确解释简洁性的定义、统一解释方法以及推导神经网络的普遍原则。

链接: https://arxiv.org/abs/2505.01372
作者: Kola Ayonrinde,Louis Jaburi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 13 pages (plus appendices), 5 figures

点击查看摘要

Abstract:Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question “What makes a good explanation?” We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.
zh

[NLP-1] RAVELER: A Benchmark for Evaluating Temporal Reasoning across Vague Implicit and Explicit References

【速读】：该论文旨在解决时间指代理解在自然语言理解中的不足，特别是在系统对显性、隐性和模糊时间指代的解析能力方面缺乏系统的评估。其解决方案的关键在于引入TRAVELER，一个新颖的合成基准数据集，采用问答范式，包含涉及时间指代的问题及其正确答案，从而全面评估模型在不同时间指代类型和事件集长度下的表现。

链接: https://arxiv.org/abs/2505.01325
作者: Svenja Kenneweg,Jörg Deigmöller,Philipp Cimiano,Julian Eggert
机构: Technische Universität Chemnitz(德累斯顿工业大学); Honda Research Institute Europe GmbH(本田研究欧洲有限公司)
类目: Computation and Language (cs.CL)
备注: 24 pages, 6 figures, submitted to Springer Nature Computer Science

点击查看摘要

Abstract:Understanding and resolving temporal references is essential in Natural Language Understanding as we often refer to the past or future in daily communication. Although existing benchmarks address a system’s ability to reason about and resolve temporal references, systematic evaluation of specific temporal references remains limited. Towards closing this gap, we introduce TRAVELER, a novel synthetic benchmark dataset that follows a Question Answering paradigm and consists of questions involving temporal references with the corresponding correct answers. TRAVELER assesses models’ abilities to resolve explicit, implicit relative to speech time, and vague temporal references. Beyond investigating the performance of state-of-the-art LLMs depending on the type of temporal reference, our benchmark also allows evaluation of performance in relation to the length of the set of events. For the category of vague temporal references, ground-truth answers were established via human surveys on Prolific, following a procedure similar to the one from Kenneweg et al. To demonstrate the benchmark’s applicability, we evaluate four state-of-the-art LLMs using a question-answering task encompassing 3,300 questions. Our findings show that while the benchmarked LLMs can answer questions over event sets with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event set length and when temporal references get less explicit. Notably, the vague question category exhibits the lowest performance across all models. The benchmark is publicly available at: this https URL Comments: 24 pages, 6 figures, submitted to Springer Nature Computer Science Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.01325 [cs.CL] (or arXiv:2505.01325v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.01325 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-2] Helping Big Language Models Protect Themselves: An Enhanced Filtering and Summarization System

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在面对复杂对抗攻击、操纵性提示和编码恶意输入时的安全性问题。现有防御措施通常需要重新训练模型，这在计算上成本高昂且难以部署。该研究提出了一种无需重新训练或微调的防御范式，其关键在于构建一个包含提示过滤模块和摘要模块的框架：提示过滤模块利用零样本分类、关键词分析和编码内容检测等自然语言处理（Natural Language Processing, NLP）技术，实现对有害输入的检测、解码与分类；摘要模块则通过处理对抗性研究文献，为LLM提供上下文感知的防御知识，从而提升其对对抗性滥用的抵抗能力。

链接: https://arxiv.org/abs/2505.01315
作者: Sheikh Samit Muhaimin,Spyridon Mastorakis
机构: University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent growth in the use of Large Language Models has made them vulnerable to sophisticated adversarial assaults, manipulative prompts, and encoded malicious inputs. Existing countermeasures frequently necessitate retraining models, which is computationally costly and impracticable for deployment. Without the need for retraining or fine-tuning, this study presents a unique defense paradigm that allows LLMs to recognize, filter, and defend against adversarial or malicious inputs on their own. There are two main parts to the suggested framework: (1) A prompt filtering module that uses sophisticated Natural Language Processing (NLP) techniques, including zero-shot classification, keyword analysis, and encoded content detection (e.g. base64, hexadecimal, URL encoding), to detect, decode, and classify harmful inputs; and (2) A summarization module that processes and summarizes adversarial research literature to give the LLM context-aware defense knowledge. This approach strengthens LLMs’ resistance to adversarial exploitation by fusing text extraction, summarization, and harmful prompt analysis. According to experimental results, this integrated technique has a 98.71% success rate in identifying harmful patterns, manipulative language structures, and encoded prompts. By employing a modest amount of adversarial research literature as context, the methodology also allows the model to react correctly to harmful inputs with a larger percentage of jailbreak resistance and refusal rate. While maintaining the quality of LLM responses, the framework dramatically increases LLM’s resistance to hostile misuse, demonstrating its efficacy as a quick and easy substitute for time-consuming, retraining-based defenses.
zh

[NLP-3] A Transformer-based Neural Architecture Search Method GECCO2023

【速读】：该论文试图解决神经机器翻译中如何搜索出具有更好翻译效果的神经网络结构的问题，其解决方案的关键在于基于Transformer架构的神经架构搜索方法，通过探索不同编码器与解码器组合下的多头注意力计算方式，并结合困惑度（perplexity）作为辅助评估指标，与BLEU分数共同指导多目标遗传算法对种群中的每个神经网络进行迭代优化，从而获得性能更优的模型。

链接: https://arxiv.org/abs/2505.01314
作者: Shang Wang,Huanrong Tang,Jianquan Ouyang
机构: Xiangtan University (湘潭大学)
类目: Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: GECCO 2023

点击查看摘要

Abstract:This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.
zh

[NLP-4] A Factorized Probabilistic Model of the Semantics of Vague Temporal Adverbials Relative to Different Event Types

【速读】：该论文试图解决模糊时间副词（vague temporal adverbials）在语义表示上的不确定性问题，这些副词如“recently”（最近）、“just”（刚刚）和“a long time ago”（很久以前）描述过去事件与话语时间之间的时序距离，但不明确具体持续时间。解决方案的关键在于引入一个因子化模型，将这些副词的语义建模为概率分布，并将其与事件特定的分布结合，从而为特定事件中的副词提供上下文化的语义表示。该模型通过现有数据拟合参数，数据来源于母语者对特定时间前发生的事件是否适用这些模糊时间副词的判断。

链接: https://arxiv.org/abs/2505.01311
作者: Svenja Kenneweg,Jörg Deigmöller,Julian Eggert,Philipp Cimiano
机构: Bielefeld University (比勒费尔德大学); Honda Research Institute Europe (本田研究欧洲研究所)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figure, to be published in CogSci Proceedings 2025

点击查看摘要

Abstract:Vague temporal adverbials, such as recently, just, and a long time ago, describe the temporal distance between a past event and the utterance time but leave the exact duration underspecified. In this paper, we introduce a factorized model that captures the semantics of these adverbials as probabilistic distributions. These distributions are composed with event-specific distributions to yield a contextualized meaning for an adverbial applied to a specific event. We fit the model’s parameters using existing data capturing judgments of native speakers regarding the applicability of these vague temporal adverbials to events that took place a given time ago. Comparing our approach to a non-factorized model based on a single Gaussian distribution for each pair of event and temporal adverbial, we find that while both models have similar predictive power, our model is preferable in terms of Occam’s razor, as it is simpler and has better extendability.
zh

[NLP-5] Anti-adversarial Learning: Desensitizing Prompts for Large Language Models

【速读】：该论文试图解决在大规模语言模型（Large Language Models, LLMs）广泛应用背景下，用户提示（prompt）中隐私和敏感数据可能被云端LLMs暴露的问题。传统方法如同态加密、安全多方计算和联邦学习由于计算成本高和用户参与需求大，在LLMs场景中应用受限。论文提出的解决方案是PromptObfus，其关键在于采用“反对抗”学习机制，通过扰动提示中的隐私词汇来隐藏敏感信息，同时保持模型预测的稳定性。具体而言，PromptObfus将提示去敏建模为掩码语言建模任务，用[MASK]标记替换敏感词，并训练去敏模型生成候选替换项，随后根据替代模型的梯度反馈选择最优替换，从而在保护隐私的同时维持任务性能。

链接: https://arxiv.org/abs/2505.01273
作者: Xuan Li,Zhe Yin,Xiaodong Gu,Beijun Shen
机构: School of Computer Science, Shanghai Jiao Tong University, Shanghai, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread use of LLMs, preserving privacy in user prompts has become crucial, as prompts risk exposing privacy and sensitive data to the cloud LLMs. Traditional techniques like homomorphic encryption, secure multi-party computation, and federated learning face challenges due to heavy computational costs and user participation requirements, limiting their applicability in LLM scenarios. In this paper, we propose PromptObfus, a novel method for desensitizing LLM prompts. The core idea of PromptObfus is “anti-adversarial” learning, which perturbs privacy words in the prompt to obscure sensitive information while retaining the stability of model predictions. Specifically, PromptObfus frames prompt desensitization as a masked language modeling task, replacing privacy-sensitive terms with a [MASK] token. A desensitization model is trained to generate candidate replacements for each masked position. These candidates are subsequently selected based on gradient feedback from a surrogate model, ensuring minimal disruption to the task output. We demonstrate the effectiveness of our approach on three NLP tasks. Results show that PromptObfus effectively prevents privacy inference from remote LLMs while preserving task performance.
zh

[NLP-6] PREMISE: Matching-based Prediction for Accurate Review Recommendation

【速读】：该论文旨在解决多模态评论有用性预测（Multimodal Review Helpfulness Prediction, MRHP）任务中，如何有效融合多模态信息以提升模型性能的问题。其解决方案的关键在于提出PREMISE架构，该架构通过计算多尺度、多领域的表示，过滤重复语义，并生成匹配得分作为特征向量，从而在保持较低计算成本的同时显著提升模型性能。

链接: https://arxiv.org/abs/2505.01255
作者: Wei Han,Hui Chen,Soujanya Poria
机构: Singapore University of Technology and Design (新加坡科技设计大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 19 pages, 16 figures

点击查看摘要

Abstract:We present PREMISE (PREdict with Matching ScorEs), a new architecture for the matching-based learning in the multimodal fields for the multimodal review helpfulness (MRHP) task. Distinct to previous fusion-based methods which obtains multimodal representations via cross-modal attention for downstream tasks, PREMISE computes the multi-scale and multi-field representations, filters duplicated semantics, and then obtained a set of matching scores as feature vectors for the downstream recommendation task. This new architecture significantly boosts the performance for such multimodal tasks whose context matching content are highly correlated to the targets of that task, compared to the state-of-the-art fusion-based methods. Experimental results on two publicly available datasets show that PREMISE achieves promising performance with less computational cost.
zh

[NLP-7] EvalxNLP: A Framework for Benchmarking Post-Hoc Explainability Methods on NLP Models

【速读】：该论文旨在解决自然语言处理（Natural Language Processing, NLP）模型在高风险应用场景中可解释性不足的问题，以及面对多样化的可解释性方法和利益相关者需求时，如何选择适合特定使用场景的解释方法。其解决方案的关键在于提出EvalxNLP框架，该框架集成了八种广泛认可的可解释性技术，支持基于忠实性、合理性及复杂性等关键属性生成和评估特征归因方法，并通过基于大语言模型（Large Language Model, LLM）的交互式文本解释增强用户对解释结果的理解与评估，从而推动可解释人工智能（Explainable AI, XAI）技术在NLP领域的系统化比较与进步。

链接: https://arxiv.org/abs/2505.01238
作者: Mahdi Dhaini,Kafaite Zahra Hussain,Efstratios Zaradoukas,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学); School of Computation, Information and Technology (计算、信息与技术学院); Department of Computer Science (计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the xAI World Conference (2025) - System Demonstration

点击查看摘要

Abstract:As Natural Language Processing (NLP) models continue to evolve and become integral to high-stakes applications, ensuring their interpretability remains a critical challenge. Given the growing variety of explainability methods and diverse stakeholder requirements, frameworks that help stakeholders select appropriate explanations tailored to their specific use cases are increasingly important. To address this need, we introduce EvalxNLP, a Python framework for benchmarking state-of-the-art feature attribution methods for transformer-based NLP models. EvalxNLP integrates eight widely recognized explainability techniques from the Explainable AI (XAI) literature, enabling users to generate and evaluate explanations based on key properties such as faithfulness, plausibility, and complexity. Our framework also provides interactive, LLM-based textual explanations, facilitating user understanding of the generated explanations and evaluation outcomes. Human evaluation results indicate high user satisfaction with EvalxNLP, suggesting it is a promising framework for benchmarking explanation methods across diverse user groups. By offering a user-friendly and extensible platform, EvalxNLP aims at democratizing explainability tools and supporting the systematic comparison and advancement of XAI techniques in NLP.
zh

[NLP-8] Gender Bias in Explainability: Investigating Performance Disparity in Post-hoc Methods

【速读】：该论文试图解决解释方法在不同子群体间性能差异的公平性问题（fairness），特别是针对生成式 AI (Generative AI) 中广泛使用的后验特征归因方法在忠实性、鲁棒性和复杂性方面存在的显著性别差异。解决方案的关键在于揭示这些差异不仅源于训练数据的偏见，还可能在模型预训练或微调后依然存在，从而强调在开发和应用可解释性方法时需关注解释的公平性，以避免对特定子群体产生偏见结果，并建议将解释公平性纳入监管框架的要求中。

链接: https://arxiv.org/abs/2505.01198
作者: Mahdi Dhaini,Ege Erdogan,Nils Feldhus,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学); School of Computation, Information and Technology (计算、信息与技术学院); Department of Computer Science (计算机科学系); Technische Universität Berlin (柏林工业大学); BIFOLD – Berlin Institute for the Foundations of Learning and Data (BIFOLD – 柏林学习与数据基础研究所); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心（DFKI）)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2025

点击查看摘要

Abstract:While research on applications and evaluations of explanation methods continues to expand, fairness of the explanation methods concerning disparities in their performance across subgroups remains an often overlooked aspect. In this paper, we address this gap by showing that, across three tasks and five language models, widely used post-hoc feature attribution methods exhibit significant gender disparity with respect to their faithfulness, robustness, and complexity. These disparities persist even when the models are pre-trained or fine-tuned on particularly unbiased datasets, indicating that the disparities we observe are not merely consequences of biased training data. Our results highlight the importance of addressing disparities in explanations when developing and applying explainability methods, as these can lead to biased outcomes against certain subgroups, with particularly critical implications in high-stakes contexts. Furthermore, our findings underscore the importance of incorporating the fairness of explanations, alongside overall model fairness and explainability, as a requirement in regulatory frameworks.
zh

[NLP-9] On the Limitations of Steering in Language Model Alignment

【速读】：该论文试图解决生成式 AI (Generative AI) 在推理模型中通过转向向量（steering vectors）进行对齐的有效性与局限性问题，特别是其在复杂场景下的适用性。解决方案的关键在于提出一种框架，结合变压器钩子干预（transformer hook interventions）和基于反义词的功能向量（antonym-based function vectors），以评估提示结构和上下文复杂性对转向效果的影响，从而为未来研究推理模型的转向能力奠定方法论基础。

链接: https://arxiv.org/abs/2505.01162
作者: Chebrolu Niranjan,Kokil Jaidka,Gerard Christopher Yeo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Steering vectors are a promising approach to aligning language model behavior at inference time. In this paper, we propose a framework to assess the limitations of steering vectors as alignment mechanisms. Using a framework of transformer hook interventions and antonym-based function vectors, we evaluate the role of prompt structure and context complexity in steering effectiveness. Our findings indicate that steering vectors are promising for specific alignment tasks, such as value alignment, but may not provide a robust foundation for general-purpose alignment in LLMs, particularly in complex scenarios. We establish a methodological foundation for future investigations into steering capabilities of reasoning models.
zh

[NLP-10] MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning

【速读】：该论文试图解决大规模上下文学习（large-scale In-Context Learning, ICL）中因上下文长度增加导致的注意力分散（attention dispersion）问题，该问题限制了模型在扩展上下文时的有效性。解决方案的关键在于引入Mitigating Attention Dispersion in large-scale ICL (MateICL)，通过将上下文划分为多个窗口并分别处理，随后引入一个额外层来重新校准注意力权重，优先关注查询标记，从而在上下文增长时保持有效的自注意力机制。

链接: https://arxiv.org/abs/2505.01110
作者: Murtadha Ahmed,Wenbo,Liu yunfeng
机构: Zhuiyi AI Lab (追忆人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in In-Context Learning (ICL). However, the fixed position length constraints in pre-trained models limit the number of demonstration examples. Recent efforts to extend context suffer from attention dispersion as the number of demonstrations increases. In this paper, we introduce Mitigating Attention Dispersion in large-scale ICL (MateICL) that enables LLMs to maintain effective self-attention as the context size grows. We first split the context into multiple windows, each filled to the model’s context capacity, which are processed separately. Then, we introduce an additional layer to recalibrate the attention weights, prioritizing the query tokens as the number of demonstrations increases. Our empirical results show that MateICL can effectively leverage larger contexts to improve ICL performance. Compared to retrieval-based baselines, MateICL consistently achieves better performance without requiring an externally trained retrieval model. Despite recent advances in inference strategies (e.g., 32k token contexts), our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings. The code is publicly available at this https URL.
zh

[NLP-11] Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

【速读】：该论文试图解决在低资源语言环境下，生成准确且上下文相关的放射学报告的挑战，特别是在医疗领域中，如何提升视觉-语言模型（Vision-Language Models, VLMs）的性能。解决方案的关键在于通过语言特定的微调和领域特定的训练，提升模型在多语言医疗场景下的表现，研究结果表明，语言特定模型在生成放射学报告方面显著优于通用和领域特定模型，同时，结合医学术语的微调进一步提升了模型性能，强调了语言和领域适应性在提高报告质量和准确性中的关键作用。

链接: https://arxiv.org/abs/2505.01096
作者: Marco Salmè,Rosa Sicilia,Paolo Soda,Valerio Guarrasi
机构: Università Campus Bio-Medico di Roma(罗马生物医学大学); Umeå University(于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaVA architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. We also explored the influence of the temperature parameter on the coherence of report generation, providing insights for optimal model settings. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations.
zh

[NLP-12] Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs

【速读】：该论文旨在解决多模态情感分析（Multimodal Sentiment Analysis, MSA）中多模态融合的效率问题，传统多模态Transformer（MulTs）虽然作为主流方法取得了显著进展，但存在计算效率低下的问题。解决方案的关键在于将MulTs建模为分层模态异质图（Hierarchical Modal-wise Heterogeneous Graphs, HMHGs），并提出一种基于图结构表示的Interlaced Mask（IM）机制，从而设计出Graph-Structured and Interlaced-Masked Multimodal Transformer（GsiT）。GsiT在保持与MulTs形式等价性的同时，通过IM实现高效的权重共享，避免信息紊乱，并在仅使用纯MulTs 1/3参数的情况下完成全模态一体化融合，同时提升了性能。

链接: https://arxiv.org/abs/2505.01068
作者: Yijie Jin,Junjie Peng,Xuanchao Lin,Haochen Yuan,Lan Wang,Cangzhi Zheng
机构: Shanghai University (上海大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A Triton kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets.
zh

[NLP-13] Do We Need a Detailed Rubric for Automated Essay Scoring using Large Language Models ?

【速读】：该论文试图解决在基于大语言模型（Large Language Models, LLMs）的自动作文评分（Automated Essay Scoring, AES）中，详细评分标准（rubric）的必要性及其对评分准确性的影响问题。其解决方案的关键在于通过实验比较不同详细程度的评分标准（包括完整评分标准、简化评分标准和无评分标准）对多个LLMs在TOEFL11数据集上的评分准确性的效果，从而评估简化评分标准是否能够在降低token使用量的同时保持较高的评分准确性。

链接: https://arxiv.org/abs/2505.01035
作者: Lui Yoshida
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted in AIED 2025. This preprint has not undergone any post-submission improvements or corrections

点击查看摘要

Abstract:This study investigates the necessity and impact of a detailed rubric in automated essay scoring (AES) using large language models (LLMs). While using rubrics are standard in LLM-based AES, creating detailed rubrics requires substantial ef-fort and increases token usage. We examined how different levels of rubric detail affect scoring accuracy across multiple LLMs using the TOEFL11 dataset. Our experiments compared three conditions: a full rubric, a simplified rubric, and no rubric, using four different LLMs (Claude 3.5 Haiku, Gemini 1.5 Flash, GPT-4o-mini, and Llama 3 70B Instruct). Results showed that three out of four models maintained similar scoring accuracy with the simplified rubric compared to the detailed one, while significantly reducing token usage. However, one model (Gemini 1.5 Flash) showed decreased performance with more detailed rubrics. The findings suggest that simplified rubrics may be sufficient for most LLM-based AES applications, offering a more efficient alternative without compromis-ing scoring accuracy. However, model-specific evaluation remains crucial as per-formance patterns vary across different LLMs.
zh

[NLP-14] Value Portrait: Understanding Values of LLM s with Human-aligned Benchmark

【速读】：该论文试图解决现有语言模型评估基准在价值对齐性评估中存在的偏差问题，这些问题源于依赖人类或机器标注的脆弱性以及测试场景与实际应用环境的脱节。解决方案的关键在于提出Value Portrait基准，该基准通过捕捉真实用户-大语言模型（LLM）交互的项目来增强评估结果的生态效度，并利用人类受试者对其自身思维相似性的评分，建立评分与实际价值分数之间的相关性，从而确保评估项目的可靠性。

链接: https://arxiv.org/abs/2505.01015
作者: Jongwook Han,Dongmin Choi,Woojung Song,Eun-Ju Lee,Yohan Jo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages, 7 figures

点击查看摘要

Abstract:The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs’ value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage and thus ecological validity. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects’ actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 27 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.
zh

[NLP-15] owards the Resistance of Neural Network Watermarking to Fine-tuning

【速读】：该论文试图解决在深度神经网络（Deep Neural Network, DNN）中嵌入所有权信息的水印问题，且要求该水印在微调（fine-tuning）过程中具有鲁棒性。解决方案的关键在于利用卷积层输入特征仅包含低频成分时，特定频率成分在梯度下降过程中不会发生变化的特性，并通过改进的傅里叶变换提取卷积核中的频率成分。此外，这些频率成分对权重缩放和权重排列具有等变性，从而使得水印信息能够被稳定地编码到卷积核的特定频率成分中。

链接: https://arxiv.org/abs/2505.01007
作者: Ling Tang,Yuefeng Chen,Hui Xue,Quanshi Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proves a new watermarking method to embed the ownership information into a deep neural network (DNN), which is robust to fine-tuning. Specifically, we prove that when the input feature of a convolutional layer only contains low-frequency components, specific frequency components of the convolutional filter will not be changed by gradient descent during the fine-tuning process, where we propose a revised Fourier transform to extract frequency components from the convolutional filter. Additionally, we also prove that these frequency components are equivariant to weight scaling and weight permutations. In this way, we design a watermark module to encode the watermark information to specific frequency components in a convolutional filter. Preliminary experiments demonstrate the effectiveness of our method.
zh

[NLP-16] oken-free Models for Sarcasm Detection

【速读】：该论文试图解决自然语言处理（NLP）中由于分词（tokenization）引入的词汇表不匹配和未登录词（out-of-vocabulary）问题。其解决方案的关键在于采用无需分词的模型，如ByT5和CANINE，这些模型直接在原始文本的字节或字符级别上运行，从而避免了传统分词方法带来的限制。实验结果表明，这两种模型在社交媒体和非社交媒体领域的讽刺检测任务中均取得了优于基于分词的基线模型和当前最先进方法的性能。

链接: https://arxiv.org/abs/2505.01006
作者: Sumit Mamtani,Maitreya Sonawane,Kanika Agarwal,Nishanth Sanjeev
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is a foundational step in most natural language processing (NLP) pipelines, yet it introduces challenges such as vocabulary mismatch and out-of-vocabulary issues. Recent work has shown that models operating directly on raw text at the byte or character level can mitigate these limitations. In this paper, we evaluate two token-free models, ByT5 and CANINE, on the task of sarcasm detection in both social media (Twitter) and non-social media (news headlines) domains. We fine-tune and benchmark these models against token-based baselines and state-of-the-art approaches. Our results show that ByT5-small and CANINE outperform token-based counterparts and achieve new state-of-the-art performance, improving accuracy by 0.77% and 0.49% on the News Headlines and Twitter Sarcasm datasets, respectively. These findings underscore the potential of token-free models for robust NLP in noisy and informal domains such as social media.
zh

[NLP-17] VTS-LLM : Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language ITSC2025

【速读】：该论文旨在解决传统船舶交通服务（Vessel Traffic Services, VTS）系统在处理日益复杂的交通环境和异构多模态数据时，存在的时空推理能力不足以及人机交互不直观的问题。其解决方案的关键在于提出VTS-LLM Agent，这是一个针对VTS操作中交互式决策支持的领域自适应大语言模型（Large Language Model, LLM）代理。该代理通过将风险船舶识别建模为知识增强的Text-to-SQL任务，并结合结构化船舶数据库与外部海事知识，提升了领域适配性和上下文感知理解能力。此外，框架中引入了基于命名实体识别的关系推理、基于代理的领域知识注入、语义代数中间表示及查询重思机制，以增强模型的领域基础和交互性能。

链接: https://arxiv.org/abs/2505.00989
作者: Sijin Sun,Liangbin Zhao,Ming Deng,Xiuju Fu
机构: Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR IHPC), Singapore; National University of Singapore, Singapore; Shanghai University, China
类目: Computation and Language (cs.CL)
备注: 8 pages, 5 figures, 7 tablels, submitted to ITSC2025

点击查看摘要

Abstract:Vessel Traffic Services (VTS) are essential for maritime safety and regulatory compliance through real-time traffic management. However, with increasing traffic complexity and the prevalence of heterogeneous, multimodal data, existing VTS systems face limitations in spatiotemporal reasoning and intuitive human interaction. In this work, we propose VTS-LLM Agent, the first domain-adaptive large LLM agent tailored for interactive decision support in VTS operations. We formalize risk-prone vessel identification as a knowledge-augmented Text-to-SQL task, combining structured vessel databases with external maritime knowledge. To support this, we construct a curated benchmark dataset consisting of a custom schema, domain-specific corpus, and a query-SQL test set in multiple linguistic styles. Our framework incorporates NER-based relational reasoning, agent-based domain knowledge injection, semantic algebra intermediate representation, and query rethink mechanisms to enhance domain grounding and context-aware understanding. Experimental results show that VTS-LLM outperforms both general-purpose and SQL-focused baselines under command-style, operational-style, and formal natural language queries, respectively. Moreover, our analysis provides the first empirical evidence that linguistic style variation introduces systematic performance challenges in Text-to-SQL modeling. This work lays the foundation for natural language interfaces in vessel traffic services and opens new opportunities for proactive, LLM-driven maritime real-time traffic management.
zh

[NLP-18] Position: Enough of Scaling LLM s! Lets Focus on Downscaling

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）开发中过度依赖神经规模定律（neural scaling laws）所带来的计算效率低下、环境影响大以及部署限制等问题。其解决方案的关键在于提出一个全面的降规模框架，旨在保持模型性能的同时显著降低资源需求，从而推动更可持续、高效和可访问的LLM发展路径。

链接: https://arxiv.org/abs/2505.00985
作者: Ayan Sengupta,Yash Goel,Tanmoy Chakraborty
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We challenge the dominant focus on neural scaling laws and advocate for a paradigm shift toward downscaling in the development of large language models (LLMs). While scaling laws have provided critical insights into performance improvements through increasing model and dataset size, we emphasize the significant limitations of this approach, particularly in terms of computational inefficiency, environmental impact, and deployment constraints. To address these challenges, we propose a holistic framework for downscaling LLMs that seeks to maintain performance while drastically reducing resource demands. This paper outlines practical strategies for transitioning away from traditional scaling paradigms, advocating for a more sustainable, efficient, and accessible approach to LLM development.
zh

[NLP-19] Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在小规模、专业语料库中学习时的数据效率问题，特别是现有合成数据生成方法在跨文档知识关联上的不足，导致生成内容的多样性与深度受限。其解决方案的关键在于提出Synthetic-on-Graph (SoG)框架，通过构建上下文图来捕捉跨文档的知识关联，并采用图遍历策略进行知识相关采样，从而提升合成数据的多样性和连贯性，使模型能够学习复杂的知识结构并处理罕见知识。

链接: https://arxiv.org/abs/2505.00979
作者: Xuhui Jiang,Shengjie Ma,Chengjin Xu,Cehao Yang,Liyu Zhang,Jian Guo
机构: DataArc Tech Ltd.(数据弧科技有限公司); IDEA Research(IDEA研究院); Gaoling School of Artificial Intelligence(人工智能学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success but remain data-inefficient, especially when learning from small, specialized corpora with limited and proprietary data. Existing synthetic data generation methods for continue pre-training focus on intra-document content and overlook cross-document knowledge associations, limiting content diversity and depth. We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion. SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations, and employing a graph walk strategy for knowledge-associated sampling. This enhances synthetic data diversity and coherence, enabling models to learn complex knowledge structures and handle rare knowledge. To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power. Experiments show that SoG outperforms the state-of-the-art (SOTA) method in a multi-hop document QA dataset while performing comparably to the SOTA method on the reading comprehension task datasets, which also underscores the better generalization capability of SoG. Our work advances synthetic data generation and provides practical solutions for efficient knowledge acquisition in LLMs, especially in domains with limited data availability.
zh

[NLP-20] A Character-based Diffusion Embedding Algorithm for Enhancing the Generation Quality of Generative Linguistic Steganographic Texts

【速读】：该论文试图解决生成高质量隐写文本（steganographic text）的问题，这一问题主要源于现有文本生成模型的能力有限以及嵌入算法无法有效缓解敏感信息属性（如语义内容或随机性）对隐写文本质量的负面影响。解决方案的关键在于提出一种新的嵌入算法——基于字符的扩散嵌入算法（Character-based Diffusion Embedding Algorithm, CDEA），该算法通过利用敏感信息的属性，增强候选词池中高概率候选词的选择频率，并降低低概率候选词的选择频率，从而在保持语义连贯性和逻辑流畅性的同时提升隐写文本的整体质量。

链接: https://arxiv.org/abs/2505.00977
作者: Yingquan Chen,Qianmu Li,Xiaocong Wu,Huifeng Li,Qing Chang
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Generating high-quality steganographic text is a fundamental challenge in the field of generative linguistic steganography. This challenge arises primarily from two aspects: firstly, the capabilities of existing models in text generation are limited; secondly, embedding algorithms fail to effectively mitigate the negative impacts of sensitive information’s properties, such as semantic content or randomness. Specifically, to ensure that the recipient can accurately extract hidden information, embedding algorithms often have to consider selecting candidate words with relatively low probabilities. This phenomenon leads to a decrease in the number of high-probability candidate words and an increase in low-probability candidate words, thereby compromising the semantic coherence and logical fluency of the steganographic text and diminishing the overall quality of the generated steganographic material. To address this issue, this paper proposes a novel embedding algorithm, character-based diffusion embedding algorithm (CDEA). Unlike existing embedding algorithms that strive to eliminate the impact of sensitive information’s properties on the generation process, CDEA leverages sensitive information’s properties. It enhances the selection frequency of high-probability candidate words in the candidate pool based on general statistical properties at the character level and grouping methods based on power-law distributions, while reducing the selection frequency of low-probability candidate words in the candidate pool. Furthermore, to ensure the effective transformation of sensitive information in long sequences, we also introduce the XLNet model. Experimental results demonstrate that the combination of CDEA and XLNet significantly improves the quality of generated steganographic text, particularly in terms of perceptual-imperceptibility.
zh

[NLP-21] Attack and defense techniques in large language models : A survey and new perspectives

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在安全性和伦理方面面临的漏洞问题，这些问题对实际应用中的可靠性构成了重大挑战。其解决方案的关键在于系统性地梳理攻击与防御技术，分类分析对抗性提示攻击、优化攻击、模型窃取以及针对LLMs应用场景的攻击机制，并探讨基于预防和检测的防御策略。论文强调了应对动态威胁环境、平衡可用性与鲁棒性以及解决防御实施中的资源约束等挑战的重要性，并提出了可扩展、可解释的安全技术和标准化评估框架作为未来研究方向。

链接: https://arxiv.org/abs/2505.00976
作者: Zhiyu Liao,Kang Chen,Yuanguo Lin,Kangkang Li,Yunxuan Liu,Hefeng Chen,Xingwang Huang,Yuanhui Yu
机构: Jimei University (集美大学); Wenzhou-Kean University (温州肯恩大学); Jiangsu Normal University (江苏师范大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become central to numerous natural language processing tasks, but their vulnerabilities present significant security and ethical challenges. This systematic survey explores the evolving landscape of attack and defense techniques in LLMs. We classify attacks into adversarial prompt attack, optimized attacks, model theft, as well as attacks on application of LLMs, detailing their mechanisms and implications. Consequently, we analyze defense strategies, including prevention-based and detection-based defense methods. Although advances have been made, challenges remain to adapt to the dynamic threat landscape, balance usability with robustness, and address resource constraints in defense implementation. We highlight open problems, including the need for adaptive scalable defenses, explainable security techniques, and standardized evaluation frameworks. This survey provides actionable insights and directions for developing secure and resilient LLMs, emphasizing the importance of interdisciplinary collaboration and ethical considerations to mitigate risks in real-world applications.
zh

[NLP-22] Llama-Nemotron: Efficient Reasoning Models

【速读】：该论文旨在解决大规模语言模型在推理能力、推理效率以及企业级应用灵活性方面的挑战，其解决方案的关键在于构建一个异构推理模型系列——Llama-Nemotron，并通过神经架构搜索、知识蒸馏、持续预训练以及专注于推理的后训练阶段（包括监督微调和大规模强化学习）来优化模型性能。此外，该系列模型首次支持动态推理切换功能，使用户能够在推理过程中灵活切换标准对话模式与推理模式，从而提升应用场景的适应性与效率。

链接: https://arxiv.org/abs/2505.00949
作者: Akhiad Bercovich,Itay Levy,Izik Golan,Mohammad Dabbah,Ran El-Yaniv,Omri Puny,Ido Galil,Zach Moshe,Tomer Ronen,Najeeb Nabwani,Ido Shahaf,Oren Tropp,Ehud Karpas,Ran Zilberstein,Jiaqi Zeng,Soumye Singhal,Alexander Bukharin,Yian Zhang,Tugrul Konuk,Gerald Shen,Ameya Sunil Mahabaleshwarkar,Bilal Kartal,Yoshi Suhara,Olivier Delalleau,Zijia Chen,Zhilin Wang,David Mosallanezhad,Adi Renduchintala,Haifeng Qian,Dima Rekesh,Fei Jia,Somshubra Majumdar,Vahid Noroozi,Wasi Uddin Ahmad,Sean Narenthiran,Aleksander Ficek,Mehrzad Samadi,Jocelyn Huang,Siddhartha Jain,Igor Gitman,Ivan Moshkov,Wei Du,Shubham Toshniwal,George Armstrong,Branislav Kisacanin,Matvei Novikov,Daria Gitman,Evelina Bakhturina,Jane Polak Scowcroft,John Kamalu,Dan Su,Kezhi Kong,Markus Kliegl,Rabeeh Karimi,Ying Lin,Sanjeev Satheesh,Jupinder Parmar,Pritam Gundecha,Brandon Norick,Joseph Jennings,Shrimai Prabhumoye,Syeda Nahida Akter,Mostofa Patwary,Abhinav Khattar,Deepak Narayanan,Roger Waleffe,Jimmy Zhang,Bor-Yiing Su,Guyue Huang,Terry Kong,Parth Chadha,Sahil Jain,Christine Harvey,Elad Segal,Jining Huang,Sergey Kashirsky,Robert McQueen,Izzy Putterman,George Lam,Arun Venkatesan,Sherry Wu,Vinh Nguyen,Manoj Kilaru,Andrew Wang,Anna Warno,Abhilash Somasamudramath,Sandip Bhaskar,Maka Dong,Nave Assaf,Shahar Mor,Omer Ullman Argov,Scot Junkin,Oleksandr Romanenko,Pedro Larroy,Monika Katariya,Marco Rovinelli,Viji Balas,Nicholas Edelman,Anahita Bhiwandiwalla,Muthu Subramaniam
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Llama-Nemotron series of models, an open family of heterogeneous reasoning models that deliver exceptional reasoning capabilities, inference efficiency, and an open license for enterprise use. The family comes in three sizes – Nano (8B), Super (49B), and Ultra (253B) – and performs competitively with state-of-the-art reasoning models such as DeepSeek-R1 while offering superior inference throughput and memory efficiency. In this report, we discuss the training procedure for these models, which entails using neural architecture search from Llama 3 models for accelerated inference, knowledge distillation, and continued pretraining, followed by a reasoning-focused post-training stage consisting of two main parts: supervised fine-tuning and large scale reinforcement learning. Llama-Nemotron models are the first open-source models to support a dynamic reasoning toggle, allowing users to switch between standard chat and reasoning modes during inference. To further support open research and facilitate model development, we provide the following resources: 1. We release the Llama-Nemotron reasoning models – LN-Nano, LN-Super, and LN-Ultra – under the commercially permissive NVIDIA Open Model License Agreement. 2. We release the complete post-training dataset: Llama-Nemotron-Post-Training-Dataset. 3. We also release our training codebases: NeMo, NeMo-Aligner, and Megatron-LM.
zh

[NLP-23] Large Language Model-Driven Dynamic Assessment of Grammatical Accuracy in English Language Learner Writing

【速读】：该论文试图解决如何利用大型语言模型（Large Language Models, LLMs）扩展动态评估（Dynamic Assessment, DA）的问题，以实现语言学习中更广泛的应用。解决方案的关键在于开发了一个模块化、基于微服务的语法辅导应用DynaWrite，该应用支持多种LLMs生成动态反馈，并通过实验验证了GPT-4o在动态评估中的表现优于其他模型，特别是在提示的清晰性、一致性和逐步明确性方面，从而证明LLMs能够有效支持大规模动态评估的实施。

链接: https://arxiv.org/abs/2505.00931
作者: Timur Jaganov,John Blake,Julián Villegas,Nicholas Carr
机构: University of Aizu (会津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 8 Figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:This study investigates the potential for Large Language Models (LLMs) to scale-up Dynamic Assessment (DA). To facilitate such an investigation, we first developed DynaWrite-a modular, microservices-based grammatical tutoring application which supports multiple LLMs to generate dynamic feedback to learners of English. Initial testing of 21 LLMs, revealed GPT-4o and neural chat to have the most potential to scale-up DA in the language learning classroom. Further testing of these two candidates found both models performed similarly in their ability to accurately identify grammatical errors in user sentences. However, GPT-4o consistently outperformed neural chat in the quality of its DA by generating clear, consistent, and progressively explicit hints. Real-time responsiveness and system stability were also confirmed through detailed performance testing, with GPT-4o exhibiting sufficient speed and stability. This study shows that LLMs can be used to scale-up dynamic assessment and thus enable dynamic assessment to be delivered to larger groups than possible in traditional teacher-learner settings.
zh

[NLP-24] How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias ICML2025

【速读】：该论文旨在解决两类典型的正则语言识别任务——“偶对”（even pairs）和“奇偶校验”（parity check），其核心是判断给定序列中某些子序列的出现次数是否为偶数。针对这些任务，研究提出了一种基于单层Transformer（由注意力层和线性层组成）的解决方案，并通过理论分析其在梯度下降下的训练动态。对于“偶对”任务，单层Transformer可以直接解决；而对于“奇偶校验”任务，则需要引入思维链（Chain-of-Thought, CoT）机制， either 在已训练完成的“偶对”任务Transformer的推理阶段集成，或在单层Transformer的训练过程中集成。关键在于分析注意力层与线性层的联合训练过程，发现其包含两个显著阶段：第一阶段注意力层快速增长并映射数据序列至可分向量，第二阶段注意力层趋于稳定，而线性层以对数速率增长并趋近于最大间隔超平面，从而实现分类性能的提升。

链接: https://arxiv.org/abs/2505.00926
作者: Ruiquan Huang,Yingbin Liang,Jing Yang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: accepted by ICML 2025

点击查看摘要

Abstract:Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check’, the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of O(1/t) . Our experiments validate those theoretical results.
zh

[NLP-25] NeMo-Inspector: A Visualization Tool for LLM Generation Analysis NAACL2025

【速读】：该论文旨在解决合成数据集质量保障的问题，特别是在大规模语言模型（Large Language Models, LLMs）适应新任务和提升整体能力时，如何高效地分析和清理合成数据。解决方案的关键在于提出NeMo-Inspector，一个开源工具，它集成了推理能力，能够简化对合成数据集的分析过程，从而显著降低低质量样本的比例并提高生成模型的准确性。

链接: https://arxiv.org/abs/2505.00903
作者: Daria Gitman,Igor Gitman,Evelina Bakhturina
机构: NVIDIA Corporation(英伟达)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at the NAACL 2025 conference

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.
zh

[NLP-26] SmallPlan: Leverag e Small Language Models for Sequential Path Planning with Simulation-Powered LLM -Guided Distillation

【速读】：该论文旨在解决机器人在大规模、动态环境中高效路径规划的问题，这一问题因传统方法在计算成本和动态适应性方面的局限性而显得尤为棘手。论文提出的解决方案关键在于构建SmallPlan框架，该框架利用大型语言模型（Large Language Models, LLMs）作为教师模型，训练轻量级小语言模型（Small Language Models, SLMs）以完成高层路径规划任务。通过模拟驱动的交替训练策略，结合LLM引导的监督微调（SFT）与强化学习（RL），SLMs不仅能够成功完成导航任务，还能感知如行程距离和尝试次数等关键因素，从而实现资源高效的路径规划，适用于边缘设备部署。

链接: https://arxiv.org/abs/2505.00831
作者: Quang P. M. Pham,Khoi T. N. Nguyen,Nhi H. Doan,Cuong A. Pham,Kentaro Inui,Dezhen Song
机构: MBZUAI - Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); VinUniversity (维大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: Paper is under review

点击查看摘要

Abstract:Efficient path planning in robotics, particularly within large-scale, dynamic environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability in dynamic scenarios hinder real-time deployment on edge devices. We present SmallPlan – a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like travel distance and number of trials. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics.
zh

[NLP-27] Knowledge-augmented Pre-trained Language Models for Biomedical Relation Extraction

【速读】：该论文旨在解决生物医学文献中自动关系抽取（Automatic Relationship Extraction, RE）的性能提升问题，特别是在不同预训练语言模型（PLMs）和上下文信息融合策略下的可比性和泛化性问题。其解决方案的关键在于在一个统一的评估框架下对多种PLMs进行系统性评估，并通过全面的超参数优化选择最优模型，随后引入文本实体描述、知识图谱中的关系信息以及分子结构编码等外部上下文数据来增强模型性能。研究结果表明，底层语言模型的选择和超参数优化是实现强抽取性能的核心因素，而外部上下文信息在小规模PLMs上的微调中展现出显著优势。

链接: https://arxiv.org/abs/2505.00814
作者: Mario Sänger,Ulf Leser
机构: Humboldt-Universität zu Berlin (洪堡大学); AstraZeneca (阿斯利康)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic relationship extraction (RE) from biomedical literature is critical for managing the vast amount of scientific knowledge produced each year. In recent years, utilizing pre-trained language models (PLMs) has become the prevalent approach in RE. Several studies report improved performance when incorporating additional context information while fine-tuning PLMs for RE. However, variations in the PLMs applied, the databases used for augmentation, hyper-parameter optimization, and evaluation methods complicate direct comparisons between studies and raise questions about the generalizability of these findings. Our study addresses this research gap by evaluating PLMs enhanced with contextual information on five datasets spanning four relation scenarios within a consistent evaluation framework. We evaluate three baseline PLMs and first conduct extensive hyperparameter optimization. After selecting the top-performing model, we enhance it with additional data, including textual entity descriptions, relational information from knowledge graphs, and molecular structure encodings. Our findings illustrate the importance of i) the choice of the underlying language model and ii) a comprehensive hyperparameter optimization for achieving strong extraction performance. Although inclusion of context information yield only minor overall improvements, an ablation study reveals substantial benefits for smaller PLMs when such external data was included during fine-tuning.
zh

[NLP-28] A Mathematical Philosophy of Explanations in Mechanistic Interpretability – The Strange Science Part I.i

【速读】：该论文试图解决神经网络可解释性问题，特别是通过因果解释来理解神经网络的机制可解释性（Mechanistic Interpretability, MI）问题。其解决方案的关键在于提出“解释性忠实度”（Explanatory Faithfulness）的概念，并将MI定义为生成模型层面、本体论、因果机制且可证伪的解释，从而区分MI与其他可解释性范式并明确其内在局限性。此外，论文还提出了“解释性乐观原则”，作为机制可解释性成功所需的必要前提。

链接: https://arxiv.org/abs/2505.00808
作者: Kola Ayonrinde,Louis Jaburi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages (plus appendices), 2 figures

点击查看摘要

Abstract:Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.
zh

[NLP-29] Reasoning Capabilities and Invariability of Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在处理简单推理任务时表现不足的问题，特别是其对提示（prompt）的依赖性。论文的关键解决方案是引入一个新的基准数据集，该数据集包含一系列需要浅层逻辑推理的简单问题，问题内容围绕几何图形展开，以确保回答仅依赖于演绎推理而非先验知识。通过在24个不同规模的LLMs上进行零样本和少样本提示的实证分析，以及在22个LLMs上使用思维链（chain-of-thought）提示的额外测试，揭示了提示策略对模型性能的影响。

链接: https://arxiv.org/abs/2505.00776
作者: Alessandro Raganato,Rafael Peñaloza,Marco Viviani,Gabriella Pasi
机构: University of Milano-Bicocca (米兰大学博科尼分校)
类目: Computation and Language (cs.CL)
备注: Accepted for publication in the Proceedings of the 23rd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2024)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in manipulating natural language across multiple applications, but their ability to handle simple reasoning tasks is often questioned. In this work, we aim to provide a comprehensive analysis of LLMs’ reasoning competence, specifically focusing on their prompt dependency. In particular, we introduce a new benchmark dataset with a series of simple reasoning questions demanding shallow logical reasoning. Aligned with cognitive psychology standards, the questions are confined to a basic domain revolving around geometric figures, ensuring that responses are independent of any pre-existing intuition about the world and rely solely on deduction. An empirical analysis involving zero-shot and few-shot prompting across 24 LLMs of different sizes reveals that, while LLMs with over 70 billion parameters perform better in the zero-shot setting, there is still a large room for improvement. An additional test with chain-of-thought prompting over 22 LLMs shows that this additional prompt can aid or damage the performance of models, depending on whether the rationale is required before or after the answer.
zh

[NLP-30] Multi-Modal Language Models as Text-to-Image Model Evaluators

【速读】：该论文试图解决传统文本到图像（text-to-image, T2I）生成模型评估基准因静态数据集而逐渐失效的问题，旨在寻找更有效的评估方法。其解决方案的关键在于利用多模态大语言模型（multimodal large language models, MLLMs）作为评估代理，通过迭代生成提示词对T2I模型进行评估，从而实现提示词生成一致性与图像美学的评估，并且在大幅减少提示词数量的情况下保持与现有基准相当的模型排名效果。

链接: https://arxiv.org/abs/2505.00759
作者: Jiahui Chen,Candace Ross,Reyhane Askari-Hemmat,Koustuv Sinha,Melissa Hall,Michal Drozdzal,Adriana Romero-Soriano
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE’s prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.
zh

[NLP-31] A Survey on Large Language Model based Human-Agent Systems

【速读】：该论文试图解决基于大语言模型（Large Language Models, LLMs）的完全自主代理在实际应用中面临的可靠性不足、复杂任务处理困难以及安全和伦理风险等问题。其解决方案的关键在于构建LLM-人类代理系统（LLM-based Human-Agent Systems, LLM-HAS），通过引入人类提供的信息、反馈或控制来提升系统的性能、可靠性和安全性。

链接: https://arxiv.org/abs/2505.00753
作者: Henry Peng Zou,Wei-Chieh Huang,Yaozu Wu,Yankai Chen,Chunyu Miao,Hoang Nguyen,Yue Zhou,Weizhi Zhang,Liancheng Fang,Langzhou He,Yangning Li,Yuwei Cao,Dongyuan Li,Renhe Jiang,Philip S. Yu
机构: University of Illinois Chicago(伊利诺伊大学芝加哥分校); The University of Tokyo(东京大学); Tsinghua University(清华大学); Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper lists and resources are available at \url{ this https URL }

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at this https URL.
zh

[NLP-32] FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models

【速读】：该论文旨在解决金融领域中数据稀缺性和语言特殊性带来的挑战，以提升问答（Question Answering, QA）系统的性能，从而为金融顾问提供更有效的决策支持。其解决方案的关键在于采用基于Transformer的预训练BERT语言模型，并通过迁移学习与微调策略进行优化，特别是在金融非事实性答案选择任务中，将答案选择问题建模为重排序问题，结合BM25检索器与BERT变体模型实现高效且精准的答案筛选。

链接: https://arxiv.org/abs/2505.00725
作者: Bithiah Yuan
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Submitted in partial fulfillment of the requirements for the Master of Science degree in Computer Science at the University of Freiburg, July 31, 2020

点击查看摘要

Abstract:Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, Question Answering (QA) systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. Consequently, we propose a novel financial QA system using the transformer-based pre-trained BERT language model to address the limitations of data scarcity and language specificity in the financial domain. Our system focuses on financial non-factoid answer selection, which retrieves a set of passage-level texts and selects the most relevant as the answer. To increase efficiency, we formulate the answer selection task as a re-ranking problem, in which our system consists of an Answer Retriever using BM25, a simple information retrieval approach, to first return a list of candidate answers, and an Answer Re-ranker built with variants of pre-trained BERT language models to re-rank and select the most relevant answers. We investigate various learning, further pre-training, and fine-tuning approaches for BERT. Our experiments suggest that FinBERT-QA, a model built from applying the Transfer and Adapt further fine-tuning and pointwise learning approach, is the most effective, improving the state-of-the-art results of task 2 of the FiQA dataset by 16% on MRR, 17% on NDCG, and 21% on Precision@1.
zh

[NLP-33] F1-EN-3M: Three Million Synthetic Moral Fables for Training Small Open Language Models

【速读】：该论文试图解决现代自然语言处理（Natural Language Processing, NLP）领域缺乏大规模、结构化的道德故事语料库的问题，这种语料库需要将连贯的叙事与明确的伦理教训相结合。解决方案的关键在于构建TF1-EN-3M数据集，这是首个由不超过8B参数的指令调优模型生成的三百万条英语寓言数据集，每个故事遵循六槽结构化模板（角色 - 品质 - 场景 - 冲突 - 解决方案 - 道德），并通过组合提示引擎确保体裁一致性与主题多样性。

链接: https://arxiv.org/abs/2504.20605
作者: Mihai Nadas,Laura Diosan,Andrei Piscoran,Andreea Tomescu
机构: Babe\textcommabelows-Bolyai University (巴贝什-博尔雅伊大学); KlusAI Labs (KlusAI 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character - trait - setting - conflict - resolution - moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Machine Learning (cs.LG) Cite as: arXiv:2504.20605 [cs.CL] (or arXiv:2504.20605v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.20605 Focus to learn more arXiv-issued DOI via DataCite
zh

计算机视觉

[CV-0] GENMO: A GENeralist Model for Human MOtion

【速读】：该论文试图解决人体运动建模中运动生成与运动估计任务分离导致的知识迁移受限和模型维护复杂的问题（Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models）。解决方案的关键在于提出GENMO，一个统一的人体运动通用模型，通过将运动估计重新定义为受约束的运动生成，使输出运动精确满足观测到的条件信号，从而在单一框架内融合运动估计与生成。

链接: https://arxiv.org/abs/2505.01425
作者: Jiefeng Li,Jinkun Cao,Haotian Zhang,Davis Rempe,Jan Kautz,Umar Iqbal,Ye Yuan
机构: NVIDIA(英伟达)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO’s effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.
zh

[CV-1] VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

【速读】：该论文试图解决视频生成模型中内容真实性、来源追踪和滥用问题，特别是现有水印技术在面对视频特有的操作（如帧插入、丢弃或重新排序）时表现不佳且会降低视觉质量。解决方案的关键在于提出VIDSTAMP框架，该框架通过在时间感知视频扩散模型的潜在空间中直接嵌入每帧或每段信息，结合两阶段微调策略（首先在静态图像数据集上提升空间信息分离能力，然后在合成视频序列上恢复时间一致性），实现高容量、灵活且感知影响最小的水印嵌入。

链接: https://arxiv.org/abs/2505.01406
作者: Mohammadreza Teymoorianfard,Shiqing Ma,Amir Houmansadr
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model’s decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \urlthis https URL
zh

[CV-2] Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

【速读】：该论文试图解决非小细胞肺癌患者在新辅助治疗中病理反应预测的准确性与可解释性问题。现有影像组学和单模态深度学习方法存在局限性，因此该研究提出了一种结合多模态深度学习与内在可解释人工智能（eXplainable Artificial Intelligence）技术的新型方法。其解决方案的关键在于引入中间融合策略，整合影像与临床数据，实现模态间高效交互，并通过“多模态医生在回路”方法将临床专家领域知识嵌入训练过程，从而提升模型的临床相关性与解释性。

链接: https://arxiv.org/abs/2505.01390
作者: Alice Natalina Caragliano,Claudia Tacconi,Carlo Greco,Lorenzo Nibid,Edy Ippolito,Michele Fiore,Giuseppe Perrone,Sara Ramella,Paolo Soda,Valerio Guarrasi
机构: Università Campus Bio-Medico di Roma(罗马大学生物医学校园大学); Fondazione Policlinico Universitario Campus Bio-Medico(校园生物医学大学医院基金会); Umeå University(于默奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2502.17503

点击查看摘要

Abstract:This study proposes a novel approach combining Multimodal Deep Learning with intrinsic eXplainable Artificial Intelligence techniques to predict pathological response in non-small cell lung cancer patients undergoing neoadjuvant therapy. Due to the limitations of existing radiomics and unimodal deep learning approaches, we introduce an intermediate fusion strategy that integrates imaging and clinical data, enabling efficient interaction between data modalities. The proposed Multimodal Doctor-in-the-Loop method further enhances clinical relevance by embedding clinicians’ domain knowledge directly into the training process, guiding the model’s focus gradually from broader lung regions to specific lesions. Results demonstrate improved predictive accuracy and explainability, providing insights into optimal data integration strategies for clinical applications.
zh

[CV-3] Global Collinearity-aware Polygonizer for Polygonal Building Mapping in Remote Sensing

【速读】：该论文旨在解决从遥感图像中映射多边形建筑的问题，提出了一种新颖的算法——全局共线性感知多边形化器（Global Collinearity-aware Polygonizer, GCP）。GCP 的关键在于其基于实例分割框架，通过处理任意实例分割模型生成的二值掩码，提取轮廓上的折线，并利用基于 Transformer 的回归模块进行优化，确保折线准确拟合目标建筑的轮廓。随后，通过共线性感知的多边形简化模块，结合动态规划技术优化目标函数，实现多边形的简洁性和保真度之间的平衡，从而生成全局最优的多边形表示。此外，该模块的优化目标被无缝集成到网络训练中，提升了整个流程的一致性。

链接: https://arxiv.org/abs/2505.01385
作者: Fahong Zhang,Yilei Shi,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity-aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer-based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity-aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity-aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity-aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas-Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at this https URL.
zh

[CV-4] Monitoring morphometric drift in lifelong learning segmentation of the spinal cord

【速读】：该论文旨在解决脊髓分割模型在持续更新过程中其形态测量结果稳定性的问题，尤其是在利用新数据集进行模型迭代时，如何确保从健康参与者中推导出的正常值的可靠性。解决方案的关键在于提出一种终身学习框架，用于自动监控模型更新过程中的形态漂移，并通过自动化GitHub Actions工作流记录模型预测所得的形态测量值随时间的变化，从而为未来分割模型的开发提供快速反馈机制。

链接: https://arxiv.org/abs/2505.01364
作者: Enamundram Naga Karthik,Sandrine Bédard,Jan Valošek,Christoph S. Aigner,Elise Bannier,Josef Bednařík,Virginie Callot,Anna Combes,Armin Curt,Gergely David,Falk Eippert,Lynn Farner,Michael G Fehlings,Patrick Freund,Tobias Granberg,Cristina Granziera,RHSCIR Network Imaging Group,Ulrike Horn,Tomáš Horák,Suzanne Humphreys,Markus Hupp,Anne Kerbrat,Nawal Kinany,Shannon Kolind,Petr Kudlička,Anna Lebret,Lisa Eunyoung Lee,Caterina Mainero,Allan R. Martin,Megan McGrath,Govind Nair,Kristin P. O’Grady,Jiwon Oh,Russell Ouellette,Nikolai Pfender,Dario Pfyffer,Pierre-François Pradat,Alexandre Prat,Emanuele Pravatà,Daniel S. Reich,Ilaria Ricchi,Naama Rotem-Kohavi,Simon Schading-Sassenhausen,Maryam Seif,Andrew Smith,Seth A Smith,Grace Sweeney,Roger Tam,Anthony Traboulsee,Constantina Andrada Treaba,Charidimos Tsagkas,Zachary Vavasour,Dimitri Van De Ville,Kenneth Arnold Weber II,Sarath Chandar,Julien Cohen-Adad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Morphometric measures derived from spinal cord segmentations can serve as diagnostic and prognostic biomarkers in neurological diseases and injuries affecting the spinal cord. While robust, automatic segmentation methods to a wide variety of contrasts and pathologies have been developed over the past few years, whether their predictions are stable as the model is updated using new datasets has not been assessed. This is particularly important for deriving normative values from healthy participants. In this study, we present a spinal cord segmentation model trained on a multisite (n=75) dataset, including 9 different MRI contrasts and several spinal cord pathologies. We also introduce a lifelong learning framework to automatically monitor the morphometric drift as the model is updated using additional datasets. The framework is triggered by an automatic GitHub Actions workflow every time a new model is created, recording the morphometric values derived from the model’s predictions over time. As a real-world application of the proposed framework, we employed the spinal cord segmentation model to update a recently-introduced normative database of healthy participants containing commonly used measures of spinal cord morphometry. Results showed that: (i) our model outperforms previous versions and pathology-specific models on challenging lumbar spinal cord cases, achieving an average Dice score of 0.95 \pm 0.03 ; (ii) the automatic workflow for monitoring morphometric drift provides a quick feedback loop for developing future segmentation models; and (iii) the scaling factor required to update the database of morphometric measures is nearly constant among slices across the given vertebral levels, showing minimum drift between the current and previous versions of the model monitored by the framework. The model is freely available in Spinal Cord Toolbox v7.0.
zh

[CV-5] FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors

【速读】：该论文旨在解决文本驱动的3D场景中物体插入（text-driven object insertion in 3D scenes）问题，现有方法依赖于2D掩码或3D边界框等空间先验，难以保证插入物体的一致性，限制了实际应用中的灵活性和可扩展性。其解决方案的关键在于提出FreeInsert框架，该框架利用基础模型（包括多模态大语言模型、语言-图像生成模型和扩散模型）将物体生成与空间定位解耦，从而实现无需空间先验的无监督且灵活的3D场景物体插入。

链接: https://arxiv.org/abs/2505.01322
作者: Chenxi Li,Weijie Wang,Qiang Li,Bruno Lepri,Nicu Sebe,Weizhi Nie
机构: Tianjin University(天津大学); University of Trento(特伦托大学); Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven object insertion in 3D scenes is an emerging task that enables intuitive scene editing through natural language. However, existing 2D editing-based methods often rely on spatial priors such as 2D masks or 3D bounding boxes, and they struggle to ensure consistency of the inserted object. These limitations hinder flexibility and scalability in real-world applications. In this paper, we propose FreeInsert, a novel framework that leverages foundation models including MLLMs, LGMs, and diffusion models to disentangle object generation from spatial placement. This enables unsupervised and flexible object insertion in 3D scenes without spatial priors. FreeInsert starts with an MLLM-based parser that extracts structured semantics, including object types, spatial relationships, and attachment regions, from user instructions. These semantics guide both the reconstruction of the inserted object for 3D consistency and the learning of its degrees of freedom. We leverage the spatial reasoning capabilities of MLLMs to initialize object pose and scale. A hierarchical, spatially aware refinement stage further integrates spatial semantics and MLLM-inferred priors to enhance placement. Finally, the appearance of the object is improved using the inserted-object image to enhance visual fidelity. Experimental results demonstrate that FreeInsert achieves semantically coherent, spatially precise, and visually realistic 3D insertions without relying on spatial priors, offering a user-friendly and flexible editing experience.
zh

[CV-6] A Neural Architecture Search Method using Auxiliary Evaluation Metric based on ResNet Architecture GECCO2023

【速读】：该论文试图解决神经网络架构搜索（Neural Architecture Search, NAS）中如何有效寻找高性能网络结构的问题，其解决方案的关键在于构建一个以ResNet为框架的搜索空间，并将卷积、池化、全连接层参数以及残差网络的连接性作为搜索目标。此外，除了识别准确率外，还引入验证集上的损失值作为优化的次要目标，从而提升搜索效率和模型性能。

链接: https://arxiv.org/abs/2505.01313
作者: Shang Wang,Huanrong Tang,Jianquan Ouyang
机构: Xiangtan University(湘潭大学)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: GECCO 2023

点击查看摘要

Abstract:This paper proposes a neural architecture search space using ResNet as a framework, with search objectives including parameters for convolution, pooling, fully connected layers, and connectivity of the residual network. In addition to recognition accuracy, this paper uses the loss value on the validation set as a secondary objective for optimization. The experimental results demonstrate that the search space of this paper together with the optimisation approach can find competitive network architectures on the MNIST, Fashion-MNIST and CIFAR100 datasets.
zh

[CV-7] Diffusion-based Adversarial Purification from the Perspective of the Frequency Domain

【速读】：该论文试图解决对抗性扰动在图像净化过程中对正常语义造成不可逆损害的问题（adversarial perturbations），传统基于扩散的对抗净化方法由于缺乏像素域中对抗扰动的分布信息，往往导致图像内容和结构的过度破坏。解决方案的关键在于从频域视角出发，将图像分解为幅度谱和相位谱，并发现对抗扰动对两个谱的破坏均随频率单调增加，因此可从受损程度较低的频率成分中提取原始图像的内容和结构信息。通过在逆向过程中分别对幅度谱和相位谱进行针对性处理，实现对抗扰动的消除与图像内容和结构的最大化保留。

链接: https://arxiv.org/abs/2505.01267
作者: Gaozheng Pei,Ke Ma,Yingfei Sun,Qianqian Xu,Qingming Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image’s amplitude spectrum with the corresponding parts of the adversarial image. For the phase spectrum, we project the phase of the estimated image into a designated range of the adversarial image’s phase spectrum, focusing on the low frequencies. Empirical evidence from extensive experiments demonstrates that our method significantly outperforms most current defense methods.
zh

[CV-8] FlowDubber: Movie Dubbing with LLM -based Semantic-aware Learning and Flow Matching based Voice Enhancing

【速读】：该论文旨在解决电影配音（Movie Dubbing）中语音与视频同步性不足以及声学质量不佳的问题，现有方法主要关注降低词错误率，而忽视了唇形同步（lip-sync）和音频质量的重要性。其解决方案的关键在于提出一种基于大语言模型（LLM）的流匹配架构FlowDubber，通过引入Qwen2.5作为LLM主干网络、语义感知学习、双对比对齐（DCA）以及基于流的语音增强（FVE）技术，实现了高质量的音画同步和发音准确性，同时提升了声学质量。

链接: https://arxiv.org/abs/2505.01263
作者: Gaoxiang Cong,Liang Li,Jiadong Pan,Zhedong Zhang,Amin Beheshti,Anton van den Hengel,Yuankai Qi,Qingming Huang
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Hangzhou Dianzi University (杭州电子科技大学); Macquarie University (麦考瑞大学); University of Adelaide (阿德莱德大学); UCAS (中国科学院大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks. The demos are available at \hrefthis https URL\textcolorredthis https URL.
zh

[CV-9] CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking

【速读】：该论文旨在解决在线多目标跟踪中由于依赖人工设计的规则进行时间关联而导致的复杂跟踪线索交互建模能力受限的问题。解决方案的关键在于提出CAMEL（Context-Aware Multi-Cue ExpLoitation），一个基于Transformer的关联模块，它通过数据驱动的方式学习鲁棒的关联策略，摆脱了传统手工规则的限制，同时保持了Tracking-by-Detection（TbD）方法的模块化优势。CAMEL利用两个基于Transformer的模块和一种新颖的以关联为中心的训练方案，有效建模跟踪目标与其多种关联线索之间的复杂交互。

链接: https://arxiv.org/abs/2505.01257
作者: Vladimir Somers,Baptiste Standaert,Victor Joos,Alexandre Alahi,Christophe De Vleeschouwer
机构: UCLouvain(乌特勒支大学); EPFL(瑞士联邦理工学院); Sportradar(体育雷达)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online multi-object tracking has been recently dominated by tracking-by-detection (TbD) methods, where recent advances rely on increasingly sophisticated heuristics for tracklet representation, feature fusion, and multi-stage matching. The key strength of TbD lies in its modular design, enabling the integration of specialized off-the-shelf models like motion predictors and re-identification. However, the extensive usage of human-crafted rules for temporal associations makes these methods inherently limited in their ability to capture the complex interplay between various tracking cues. In this work, we introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation, that learns resilient association strategies directly from data, breaking free from hand-crafted heuristics while maintaining TbD’s valuable modularity. At its core, CAMEL employs two transformer-based modules and relies on a novel association-centric training scheme to effectively model the complex interactions between tracked targets and their various association cues. Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models. Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks. Our code is available at this https URL.
zh

[CV-10] Fusing Foveal Fixations Using Linear Retinal Transformations and Bayesian Experimental Design

【速读】：该论文试图解决人类（以及许多脊椎动物）在整合多个场景注视点以获得整体表征时所面临的问题，其中每个注视点使用高分辨率的黄斑区和逐渐降低的周边分辨率。解决方案的关键在于将单个注视点的视网膜变换显式地表示为场景高分辨率潜在图像的线性下采样，利用已知的几何结构。这一线性变换使得能够在因子分析（Factor Analysis, FA）及其混合模型中进行精确推断，并将“下一步应看向何处”的选择问题形式化为一个基于期望信息增益准则的贝叶斯实验设计问题。

链接: https://arxiv.org/abs/2505.01249
作者: Christopher K. I. Williams
机构: University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Humans (and many vertebrates) face the problem of fusing together multiple fixations of a scene in order to obtain a representation of the whole, where each fixation uses a high-resolution fovea and decreasing resolution in the periphery. In this paper we explicitly represent the retinal transformation of a fixation as a linear downsampling of a high-resolution latent image of the scene, exploiting the known geometry. This linear transformation allows us to carry out exact inference for the latent variables in factor analysis (FA) and mixtures of FA models of the scene. Further, this allows us to formulate and solve the choice of “where to look next” as a Bayesian experimental design problem using the Expected Information Gain criterion. Experiments on the Frey faces and MNIST datasets demonstrate the effectiveness of our models.
zh

[CV-11] CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment CVPR2025

【速读】：该论文旨在解决多模态音频-视觉学习中存在的时间粒度不匹配、优化目标冲突以及空间定位能力不足的问题（audio-visual learning challenges）。其解决方案的关键在于：首先，将音频视为与视频帧对齐的时序序列，而非依赖全局表示以解决模态间的粒度差异；其次，通过引入专用的全局标记分离对比学习与重建目标，以缓解优化目标的冲突；最后，引入可学习的注册标记以减轻补丁标记的语义负担，从而提升空间定位性能。

链接: https://arxiv.org/abs/2505.01237
作者: Edson Araujo,Andrew Rouditchenko,Yuan Gong,Saurabhchand Bhati,Samuel Thomas,Brian Kingsbury,Leonid Karlinsky,Rogerio Feris,James R. Glass
机构: Goethe University of Frankfurt(法兰克福歌德大学); MIT(麻省理工学院); IBM Research(IBM研究院); MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室); Tuebingen AI Center/University of Tuebingen(图宾根人工智能中心/图宾根大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To be published at CVPR 2025, code available at this https URL

点击查看摘要

Abstract:Recent advances in audio-visual learning have shown promising results in learning representations across modalities. However, most approaches rely on global audio representations that fail to capture fine-grained temporal correspondences with visual frames. Additionally, existing methods often struggle with conflicting optimization objectives when trying to jointly learn reconstruction and cross-modal alignment. In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. We address three key challenges: First, we tackle the granularity mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations. Second, we resolve conflicting optimization goals by separating contrastive and reconstruction objectives through dedicated global tokens. Third, we improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens. We evaluate the proposed approach on AudioSet, VGG Sound, and the ADE20K Sound dataset on zero-shot retrieval, classification and localization tasks demonstrating state-of-the-art performance and outperforming more complex architectures.
zh

[CV-12] Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting SIGGRAPH2025

【速读】：该论文旨在解决在线动态场景重建中时间一致性不足的问题，即现有方法在处理实时视频流时，其重建结果在静态区域常出现明显伪影。论文指出，现实录制中的噪声等误差会影响时间一致性，因此提出一种增强时间一致性的方法，其关键在于通过减去学习到的误差来恢复理想的观测数据，从而提升重建结果的时间一致性和渲染质量。

链接: https://arxiv.org/abs/2505.01235
作者: Youngsik Yun,Jeongmin Bae,Hyunseung Son,Seoha Kim,Hahyun Lee,Gun Bang,Youngjung Uh
机构: Yonsei University (延世大学); Electronics and Telecommunications Research Institute (电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025, Project page: this https URL

点击查看摘要

Abstract:Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at this https URL.
zh

[CV-13] Core-Set Selection for Data-efficient Land Cover Segmentation

【速读】：该论文试图解决传统深度学习模型在遥感图像分割任务中依赖大规模数据集所带来的复杂性、潜在偏差与噪声以及计算资源消耗过大的问题。其解决方案的关键在于提出六种新颖的核心子集选择方法，旨在从仅依赖影像、仅依赖标签或两者结合的遥感图像分割数据集中选取具有代表性的样本子集，从而提升模型训练效果。实验结果表明，基于这些方法选择的子集在多个常用土地覆盖分类数据集上的表现优于随机选择基线，甚至部分方法优于使用全部数据进行训练的效果，突显了数据驱动学习在遥感领域的潜力与重要性。

链接: https://arxiv.org/abs/2505.01225
作者: Keiller Nogueira,Akram Zaytar,Wanli Ma,Ribana Roscher,Ronny Hänsch,Caleb Robinson,Anthony Ortiz,Simone Nsutezo,Rahul Dodhia,Juan M. Lavista Ferres,Oktay Karakuş,Paul L. Rosin
机构: University of Liverpool (利物浦大学); Microsoft AI for Good Research Lab (微软AI for Good研究实验室); Cardiff University (卡迪夫大学); Forschungszentrum Jülich GmbH (弗劳恩霍夫研究所); University of Bonn (波恩大学); German Aerospace Center (DLR) (德国航空航天中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing accessibility of remotely sensed data and the potential of such data to inform large-scale decision-making has driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models must be trained on large datasets. However, the common assumption that broadly larger datasets lead to better outcomes tends to overlook the complexities of the data distribution, the potential for introducing biases and noise, and the computational resources required for processing and storing vast datasets. Therefore, effective solutions should consider both the quantity and quality of data. In this paper, we propose six novel core-set selection methods for selecting important subsets of samples from remote sensing image segmentation datasets that rely on imagery only, labels only, and a combination of each. We benchmark these approaches against a random-selection baseline on three commonly used land cover classification datasets: DFC2022, Vaihingen, and Potsdam. In each of the datasets, we demonstrate that training on a subset of samples outperforms the random baseline, and some approaches outperform training on all available data. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at this https URL.
zh

[CV-14] RD-UIE: Relation-Driven State Space Modeling for Underwater Image Enhancement

【速读】：该论文旨在解决水下图像增强（Underwater Image Enhancement, UIE）中由于波长依赖性衰减导致的内容退化和颜色失真问题。现有状态空间模型如Mamba虽在长程依赖建模方面表现出潜力，但其展开操作和固定扫描路径无法适应局部对象语义与全局关系建模，从而限制了其在复杂水下环境中的效果。论文提出的解决方案关键在于引入基于排序的扫描机制，动态调整扫描序列以优先处理结构和语义特征，并结合视觉自适应状态块（Visually Self-adaptive State Block, VSSB）实现动态排序与输入相关卷积的协同整合，从而消除全局关注偏差并提升全局上下文与局部关系线索的融合效果。

链接: https://arxiv.org/abs/2505.01224
作者: Kui Jiang,Yan Luo,Junjun Jiang,Xin Xu,Fei Ma,Fei Yu
机构: Harbin Institute of Technology (哈尔滨工业大学); Wuhan University of Science and Technology (武汉科技大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济发展实验室（深圳）); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components–structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at this https URL
zh

[CV-15] High Dynamic Range Novel View Synthesis with Single Exposure ICML2025

【速读】：该论文试图解决高动态范围新视角合成（HDR-NVS）中依赖多曝光低动态范围（LDR）图像所带来的运动伪影、高采集与存储成本等问题。其解决方案的关键在于首次提出单曝光HDR-NVS问题，并引入一种名为Mono-HDR-3D的新方法，该方法包含两个基于LDR图像形成原理设计的专用模块：一个用于将LDR颜色转换为HDR对应值，另一个用于将HDR图像转换回LDR格式，从而在闭环中实现无监督学习。

链接: https://arxiv.org/abs/2505.01212
作者: Kaixuan Zhang,Hu Wang,Minxian Li,Mingwu Ren,Mao Ye,Xiatian Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: It has been accepted by ICML 2025

点击查看摘要

Abstract:High Dynamic Range Novel View Synthesis (HDR-NVS) aims to establish a 3D scene HDR model from Low Dynamic Range (LDR) imagery. Typically, multiple-exposure LDR images are employed to capture a wider range of brightness levels in a scene, as a single LDR image cannot represent both the brightest and darkest regions simultaneously. While effective, this multiple-exposure HDR-NVS approach has significant limitations, including susceptibility to motion artifacts (e.g., ghosting and blurring), high capture and storage costs. To overcome these challenges, we introduce, for the first time, the single-exposure HDR-NVS problem, where only single exposure LDR images are available during training. We further introduce a novel approach, Mono-HDR-3D, featuring two dedicated modules formulated by the LDR image formation principles, one for converting LDR colors to HDR counterparts, and the other for transforming HDR images to LDR format so that unsupervised learning is enabled in a closed loop. Designed as a meta-algorithm, our approach can be seamlessly integrated with existing NVS models. Extensive experiments show that Mono-HDR-3D significantly outperforms previous methods. Source code will be released.
zh

[CV-16] -Graph: Enhancing Sparse-view Camera Pose Estimation by Pairwise Translation Graph

【速读】：该论文旨在解决稀疏视角下相机位姿估计（sparse-view camera pose estimation）中存在的性能不足问题，特别是现有方法忽视了不同视角之间的平移信息，导致在有限视角情况下效果欠佳。其解决方案的关键在于提出T-Graph模块，该模块通过多层感知机（MLP）处理图像特征对，并构建一个全连接的平移图（translation graph），其中节点表示相机，边编码其平移关系，同时引入两种不同的平移表示形式——相对平移（relative-t）和成对平移（pair-t），以增强模型在不同场景下的适应性和鲁棒性。

链接: https://arxiv.org/abs/2505.01207
作者: Qingyu Xian,Weiqin Jiao,Hao Cheng,Berend Jan van der Zwaag,Yanqiu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparse-view camera pose estimation, which aims to estimate the 6-Degree-of-Freedom (6-DoF) poses from a limited number of images captured from different viewpoints, is a fundamental yet challenging problem in remote sensing applications. Existing methods often overlook the translation information between each pair of viewpoints, leading to suboptimal performance in sparse-view scenarios. To address this limitation, we introduce T-Graph, a lightweight, plug-and-play module to enhance camera pose estimation in sparse-view settings. T-graph takes paired image features as input and maps them through a Multilayer Perceptron (MLP). It then constructs a fully connected translation graph, where nodes represent cameras and edges encode their translation relationships. It can be seamlessly integrated into existing models as an additional branch in parallel with the original prediction, maintaining efficiency and ease of use. Furthermore, we introduce two pairwise translation representations, relative-t and pair-t, formulated under different local coordinate systems. While relative-t captures intuitive spatial relationships, pair-t offers a rotation-disentangled alternative. The two representations contribute to enhanced adaptability across diverse application scenarios, further improving our module’s robustness. Extensive experiments on two state-of-the-art methods (RelPose++ and Forge) using public datasets (C03D and IMC PhotoTourism) validate both the effectiveness and generalizability of T-Graph. The results demonstrate consistent improvements across various metrics, notably camera center accuracy, which improves by 1% to 6% from 2 to 8 viewpoints.
zh

[CV-17] Efficient Vision-based Vehicle Speed Estimation

【速读】：该论文试图解决从交通摄像头视频中高效估算车辆速度的问题，其核心挑战在于实现实时性能的同时保持较高的检测与速度估计精度。解决方案的关键在于对基于2D检测和消失点几何的3D边界框方法进行改进，并通过后训练量化（post-training quantization）技术优化模型，以在计算成本与精度之间取得最佳平衡，从而实现更高效的实时部署。

链接: https://arxiv.org/abs/2505.01203
作者: Andrej Macko,Lukáš Gajdošech,Viktor Kocur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Journal of Real-Time Image Processing (JRTIP)

点击查看摘要

Abstract:This paper presents a computationally efficient method for vehicle speed estimation from traffic camera footage. Building upon previous work that utilizes 3D bounding boxes derived from 2D detections and vanishing point geometry, we introduce several improvements to enhance real-time performance. We evaluate our method in several variants on the BrnoCompSpeed dataset in terms of vehicle detection and speed estimation accuracy. Our extensive evaluation across various hardware platforms, including edge devices, demonstrates significant gains in frames per second (FPS) compared to the prior state-of-the-art, while maintaining comparable or improved speed estimation accuracy. We analyze the trade-off between accuracy and computational cost, showing that smaller models utilizing post-training quantization offer the best balance for real-world deployment. Our best performing model beats previous state-of-the-art in terms of median vehicle speed estimation error (0.58 km/h vs. 0.60 km/h), detection precision (91.02% vs 87.08%) and recall (91.14% vs. 83.32%) while also being 5.5 times faster.
zh

[CV-18] STMotion: Training-free Scene-awarenText-to-motion Generation ICME2025

【速读】：该论文试图解决在多样化3D场景中生成符合文本描述的运动序列的问题，传统方法依赖于大规模真实运动序列数据，这在实际应用中存在成本高昂的挑战。解决方案的关键在于提出一种无需训练的场景感知文本到运动框架（Training-free Scene-aware Text-to-Motion, TSTMotion），通过结合基础模型推理、预测和验证场景感知的运动引导，并将其融入预训练的空白背景运动生成器中，从而实现具有场景感知能力的文本驱动运动序列生成。

链接: https://arxiv.org/abs/2505.01182
作者: Ziyan Guo,Haoxuan Qu,Hossein Rahmani,Dewen Soh,Ping Hu,Qiuhong Ke,Jun Liu
机构: Singapore University of Technology and Design (新加坡科技设计大学); Lancaster University (兰卡斯特大学); University of Electronic Science and Technology of China (中国电子科技大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a \textbfTraining-free \textbfScene-aware \textbfText-to-\textbfMotion framework, dubbed as \textbfTSTMotion, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in \hrefthis https URLProject Page.
zh

[CV-19] FreePCA: Integrating Consistency Information across Long-short Frames in Training-free Long Video Generation via Principal Component Analysis CVPR2025

【速读】：该论文试图解决长视频生成中由于帧数变化导致的分布偏移问题，该问题使得生成视频在视觉质量和运动一致性方面表现不佳。其关键解决方案是通过应用主成分分析（PCA）将全局信息与局部信息精确解耦为一致的外观和运动强度信息，从而实现全局一致性与局部质量的精细互补整合。

链接: https://arxiv.org/abs/2505.01172
作者: Jiangtong Tan,Hu Yu,Jie Huang,Jie Xiao,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Long video generation involves generating extended videos using models trained on short videos, suffering from distribution shifts due to varying frame counts. It necessitates the use of local information from the original short frames to enhance visual and motion quality, and global information from the entire long frames to ensure appearance consistency. Existing training-free methods struggle to effectively integrate the benefits of both, as appearance and motion in videos are closely coupled, leading to motion inconsistency and visual quality. In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. With this insight, we propose FreePCA, a training-free long video generation paradigm based on PCA that simultaneously achieves high consistency and quality. Concretely, we decouple consistent appearance and motion intensity features by measuring cosine similarity in the principal component space. Critically, we progressively integrate these features to preserve original quality and ensure smooth transitions, while further enhancing consistency by reusing the mean statistics of the initial noise. Experiments demonstrate that FreePCA can be applied to various video diffusion models without requiring training, leading to substantial improvements. Code is available at this https URL.
zh

[CV-20] NeuroLoc: Encoding Navigation Cells for 6-DOF Camera Localization

【速读】：该论文旨在解决在未知环境中进行自主导航时，相机定位所面临的场景模糊、环境干扰和动态物体变换等问题。其解决方案的关键在于借鉴生物大脑的导航机制（如网格细胞、位置细胞和头方向细胞），提出一种新型的神经生物学相机定位方法——NeuroLoc。该方法通过设计由位置细胞驱动的赫布学习模块以保存和回放历史信息，利用头方向细胞启发的内部方向学习作为多头注意力嵌入以恢复真实朝向，并在位姿回归模块中引入3D网格中心预测以减少错误预测，从而提升复杂环境下的鲁棒性和位姿回归性能。

链接: https://arxiv.org/abs/2505.01113
作者: Xun Li,Jian Yang,Fenli Jia,Muyu Wang,Qi Wu,Jun Wu,Jinpeng Mi,Jilin Hu,Peidong Liang,Xuan Tang,Ke Li,Xiong You,Xian Wei
机构: Software Engineering Institute, East China Normal University (软件工程学院，华东师范大学); School of Geospatial Information, Information Engineering University (地理空间信息学院，信息工程大学); University of Shanghai for Science and Technology (上海理工大学); Fujian (Quanzhou) Institute of Advanced Manufacturing Technology, China (福建省（泉州）智能制造技术研究院，中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the biological brain navigation mechanism (such as grid cells, place cells, and head direction cells), we propose a novel neurobiological camera location method, namely NeuroLoc. Firstly, we designed a Hebbian learning module driven by place cells to save and replay historical information, aiming to restore the details of historical representations and solve the issue of scene fuzziness. Secondly, we utilized the head direction cell-inspired internal direction learning as multi-head attention embedding to help restore the true orientation in similar scenes. Finally, we added a 3D grid center prediction in the pose regression module to reduce the final wrong prediction. We evaluate the proposed NeuroLoc on commonly used benchmark indoor and outdoor datasets. The experimental results show that our NeuroLoc can enhance the robustness in complex environments and improve the performance of pose regression by using only a single image.
zh

[CV-21] Self-Supervision Enhances Instance-based Multiple Instance Learning Methods in Digital Pathology: A Benchmark Study

【速读】：该论文试图解决在全切片图像（Whole Slide Image, WSI）分类中，如何选择更优的多实例学习（Multiple Instance Learning, MIL）方法的问题。尽管基于实例的MIL方法在可解释性方面具有优势，但过去嵌入式MIL方法因其对特征提取器的鲁棒性而被广泛采用。然而，随着自监督学习（Self-Supervised Learning, SSL）技术的发展，特征嵌入的质量显著提升。论文通过大规模实验表明，使用高质量SSL特征提取器时，简单基于实例的MIL方法在参数量极少的情况下，能够达到甚至超越复杂嵌入式MIL方法的性能，因此其关键在于开发适用于WSI的高效SSL方法而非过度依赖复杂的嵌入式MIL结构。

链接: https://arxiv.org/abs/2505.01109
作者: Ali Mammadov,Loic Le Folgoc,Julien Adam,Anne Buronfosse,Gilles Hayem,Guillaume Hocquet,Pietro Gori
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Journal of Medical Imaging (SPIE)

点击查看摘要

Abstract:Multiple Instance Learning (MIL) has emerged as the best solution for Whole Slide Image (WSI) classification. It consists of dividing each slide into patches, which are treated as a bag of instances labeled with a global label. MIL includes two main approaches: instance-based and embedding-based. In the former, each patch is classified independently, and then the patch scores are aggregated to predict the bag label. In the latter, bag classification is performed after aggregating patch embeddings. Even if instance-based methods are naturally more interpretable, embedding-based MILs have usually been preferred in the past due to their robustness to poor feature extractors. However, recently, the quality of feature embeddings has drastically increased using self-supervised learning (SSL). Nevertheless, many authors continue to endorse the superiority of embedding-based MIL. To investigate this further, we conduct 710 experiments across 4 datasets, comparing 10 MIL strategies, 6 self-supervised methods with 4 backbones, 4 foundation models, and various pathology-adapted techniques. Furthermore, we introduce 4 instance-based MIL methods never used before in the pathology domain. Through these extensive experiments, we show that with a good SSL feature extractor, simple instance-based MILs, with very few parameters, obtain similar or better performance than complex, state-of-the-art (SOTA) embedding-based MIL methods, setting new SOTA results on the BRACS and Camelyon16 datasets. Since simple instance-based MIL methods are naturally more interpretable and explainable to clinicians, our results suggest that more effort should be put into well-adapted SSL methods for WSI rather than into complex embedding-based MIL methods.
zh

[CV-22] VSC: Visual Search Compositional Text-to-Image Diffusion Model

【速读】：该论文试图解决文本到图像扩散模型在处理包含多个属性-对象对的提示时，难以准确绑定属性与对应对象的问题（attribute-object binding）。现有方法受限于文本编码器如CLIP在表达复杂语言关系和修饰语方面的不足，导致在提示复杂度增加时性能下降。该研究的关键解决方案是引入一种基于成对图像嵌入的组合生成方法，通过将复杂提示分解为子提示、生成对应图像并计算融合文本嵌入的视觉原型，结合基于分割的定位训练以解决跨注意力错位问题，从而提升多属性与对象绑定的准确性。

链接: https://arxiv.org/abs/2505.01104
作者: Do Huu Dat,Nam Hyeonu,Po-Yuan Mao,Tae-Hyun Oh
机构: VinUniversity(维琴大学); POSTECH(浦项科技大学); Academia Sinica(中央研究院); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
zh

[CV-23] Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

【速读】：该论文旨在解决将生成式AI模型适配到医疗领域所面临的独特挑战，特别是医学数据的复杂性和对临床准确性的严格需求。其解决方案的关键在于提出一个专门针对多模态医学数据生成的框架，该框架能够生成多视角胸部X光图像及其相关临床报告，从而弥合通用视觉-语言模型与医疗专业需求之间的差距。通过利用MIMIC-CXR数据集，该框架在生成高质量图像和语义连贯报告方面表现出色，并在下游疾病分类任务中展现出与真实数据相当甚至更优的性能。

链接: https://arxiv.org/abs/2505.01091
作者: Daniele Molino,Francesco di Feola,Linlin Shen,Paolo Soda,Valerio Guarrasi
机构: Università Campus Bio-Medico di Roma(罗马大学校园生物医学大学); Umeå University(于默奥大学); Shenzhen University(深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2501.04614

点击查看摘要

Abstract:Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.
zh

[CV-24] Improving Editability in Image Generation with Layer-wise Memory CVPR2025

【速读】：该论文旨在解决多对象图像编辑中序列编辑的挑战，即在多次修改过程中保持已有编辑结果并自然融入新对象的问题（sequential image editing）。其解决方案的关键在于提出两种核心方法：一是支持粗略掩码输入以保留现有内容并自然整合新元素，二是通过分层记忆机制存储先前编辑的潜在表示和提示嵌入，实现跨多次修改的一致性编辑。此外，还引入了背景一致性引导和跨注意力中的多查询解缠机制，以确保场景连贯性和对现有内容的自然适应。

链接: https://arxiv.org/abs/2505.01079
作者: Daneul Kim,Jaeah Lee,Jaesik Park
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: CVPR 2025. Project page : this https URL

点击查看摘要

Abstract:Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining previous edits along with adapting new objects naturally into the existing content. These limitations significantly hinder complex editing scenarios where multiple objects need to be modified while preserving their contextual relationships. We address this fundamental challenge through two key proposals: enabling rough mask inputs that preserve existing content while naturally integrating new elements and supporting consistent editing across multiple modifications. Our framework achieves this through layer-wise memory, which stores latent representations and prompt embeddings from previous edits. We propose Background Consistency Guidance that leverages memorized latents to maintain scene coherence and Multi-Query Disentanglement in cross-attention that ensures natural adaptation to existing content. To evaluate our method, we present a new benchmark dataset incorporating semantic alignment metrics and interactive editing scenarios. Through comprehensive experiments, we demonstrate superior performance in iterative image editing tasks with minimal user effort, requiring only rough masks while maintaining high-quality results throughout multiple editing steps.
zh

[CV-25] Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLM s NEURIPS2024

【速读】：该论文试图解决在缺乏标注数据的场景下，进行细粒度视觉识别（Fine-grained Visual Recognition, FGVR）的问题，特别是在医学影像等领域中，由于隐私和标注成本等问题导致无法获取专家标注的大型数据集。传统FGVR模型依赖预定义的标签集，而在这种情况下无法适用，因此需要一种能够在无约束输出空间中预测标签的方法，即无词汇表细粒度视觉识别（Vocabulary-Free FGVR, VF-FGVR）。解决方案的关键在于提出一种名为Nearest-Neighbor Label Refinement (NeaR) 的新方法，该方法通过使用多模态大语言模型（Multimodal Large Language Models, MLLMs）生成的标签来微调下游CLIP模型，构建一个弱监督数据集，从而有效处理MLLM生成标签中的噪声、随机性和开放性问题，并为高效的VF-FGVR建立新的基准。

链接: https://arxiv.org/abs/2505.01064
作者: Hari Chandana Kuchibhotla,Sai Srinivas Kancheti,Abbavaram Gowtham Reddy,Vineeth N Balasubramanian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: preprint; earlier version accepted at NeurIPS 2024 Workshop on Adaptive Foundation Models

点击查看摘要

Abstract:Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce \textbfNearest-Neighbor Label \textbfRefinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.
zh

[CV-26] GeloVec: Higher Dimensional Geometric Smoothing for Coherent Visual Feature Extraction in Image Segmentation

【速读】：该论文试图解决传统语义分割方法在特征映射过程中存在的边界不稳定和上下文不连续问题。其解决方案的关键在于提出了一种基于卷积神经网络（CNN）的注意力平滑框架GeloVec，该框架通过高维几何平滑方法建立视觉连贯区域之间的鲁棒流形关系，并结合改进的切比雪夫距离度量与多空间变换，以稳定特征提取提升分割精度。核心创新在于自适应采样权重系统，该系统在n维特征空间中计算几何距离，实现了优异的边缘保持同时维持类内同质性。

链接: https://arxiv.org/abs/2505.01057
作者: Boris Kriuk,Matey Yordanov
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This paper introduces GeloVec, a new CNN-based attention smoothing framework for semantic segmentation that addresses critical limitations in conventional approaches. While existing attention-backed segmentation methods suffer from boundary instability and contextual discontinuities during feature mapping, our framework implements a higher-dimensional geometric smoothing method to establish a robust manifold relationships between visually coherent regions. GeloVec combines modified Chebyshev distance metrics with multispatial transformations to enhance segmentation accuracy through stabilized feature extraction. The core innovation lies in the adaptive sampling weights system that calculates geometric distances in n-dimensional feature space, achieving superior edge preservation while maintaining intra-class homogeneity. The multispatial transformation matrix incorporates tensorial projections with orthogonal basis vectors, creating more discriminative feature representations without sacrificing computational efficiency. Experimental validation across multiple benchmark datasets demonstrates significant improvements in segmentation performance, with mean Intersection over Union (mIoU) gains of 2.1%, 2.7%, and 2.4% on Caltech Birds-200, LSDSC, and FSSD datasets respectively compared to state-of-the-art methods. GeloVec’s mathematical foundation in Riemannian geometry provides theoretical guarantees on segmentation stability. Importantly, our framework maintains computational efficiency through parallelized implementation of geodesic transformations and exhibits strong generalization capabilities across disciplines due to the absence of information loss during transformations.
zh

[CV-27] ransferable Adversarial Attacks on Black-Box Vision-Language Models

【速读】：该论文试图解决视觉大语言模型（Vision Large Language Models, VLLMs）在面对对抗样本时的安全性问题，特别是针对其在多模态输入（文本与图像结合）场景下的脆弱性。研究发现，针对文本或视觉单一模态的对抗攻击可以迁移至专有黑盒VLLMs，而该论文进一步揭示了目标性对抗样本在多种主流VLLMs中的高度可迁移性。解决方案的关键在于识别并利用通用扰动（universal perturbations），这些扰动能够对大量图像进行修改，从而诱导模型产生特定的错误解释，如将危险内容误判为安全、忽略敏感信息或生成符合攻击者意图的错误响应。研究结果表明，这一漏洞广泛存在于当前最先进的VLLMs中，凸显了构建鲁棒防御机制的紧迫性。

链接: https://arxiv.org/abs/2505.01050
作者: Kai Hu,Weichen Yu,Li Zhang,Alexander Robey,Andy Zou,Chengming Xu,Haoqi Hu,Matt Fredrikson
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) are increasingly deployed to offer advanced capabilities on inputs comprising both text and images. While prior research has shown that adversarial attacks can transfer from open-source to proprietary black-box models in text-only and vision-only contexts, the extent and effectiveness of such vulnerabilities remain underexplored for VLLMs. We present a comprehensive analysis demonstrating that targeted adversarial examples are highly transferable to widely-used proprietary VLLMs such as GPT-4o, Claude, and Gemini. We show that attackers can craft perturbations to induce specific attacker-chosen interpretations of visual information, such as misinterpreting hazardous content as safe, overlooking sensitive or restricted material, or generating detailed incorrect responses aligned with the attacker’s intent. Furthermore, we discover that universal perturbations – modifications applicable to a wide set of images – can consistently induce these misinterpretations across multiple proprietary VLLMs. Our experimental results on object recognition, visual question answering, and image captioning show that this vulnerability is common across current state-of-the-art models, and underscore an urgent need for robust mitigations to ensure the safe and secure deployment of VLLMs.
zh

[CV-28] Edge Detection based on Channel Attention and Inter-region Independence Test

【速读】：该论文旨在解决现有边缘检测方法在高精度工业场景中因噪声放大和非显著细节过度保留而导致的应用限制问题。其解决方案的关键在于提出CAM-EDIT框架，该框架结合了通道注意力机制（Channel Attention Mechanism, CAM）与基于独立性检验的边缘检测（Edge Detection via Independence Testing, EDIT）。CAM模块通过多通道融合自适应增强区分性边缘特征，而EDIT模块则利用区域统计独立性分析（采用Fisher精确检验和卡方检验）抑制无关特征，从而提升检测精度与噪声鲁棒性。

链接: https://arxiv.org/abs/2505.01040
作者: Ru-yu Yan,Da-Qing Zhang
机构: University of Science and Technology Liaoning(辽宁科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing edge detection methods often suffer from noise amplification and excessive retention of non-salient details, limiting their applicability in high-precision industrial scenarios. To address these challenges, we propose CAM-EDIT, a novel framework that integrates Channel Attention Mechanism (CAM) and Edge Detection via Independence Testing (EDIT). The CAM module adaptively enhances discriminative edge features through multi-channel fusion, while the EDIT module employs region-wise statistical independence analysis (using Fisher’s exact test and chi-square test) to suppress uncorrelated this http URL experiments on BSDS500 and NYUDv2 datasets demonstrate state-of-the-art performance. Among the nine comparison algorithms, the F-measure scores of CAM-EDIT are 0.635 and 0.460, representing improvements of 19.2% to 26.5% over traditional methods (Canny, CannySR), and better than the latest learning based methods (TIP2020, MSCNGP). Noise robustness evaluations further reveal a 2.2% PSNR improvement under Gaussian noise compared to baseline methods. Qualitative results exhibit cleaner edge maps with reduced artifacts, demonstrating its potential for high-precision industrial applications.
zh

[CV-29] Edge-preserving Image Denoising via Multi-scale Adaptive Statistical Independence Testing

【速读】：该论文旨在解决传统边缘检测方法在生成过于详细的边缘图时影响清晰度的问题，以及固定窗口统计检验中存在的尺度不匹配和计算冗余问题。其解决方案的关键在于提出一种基于多尺度自适应独立性检验的边缘检测与去噪方法（Multi-scale Adaptive Independence Testing-based Edge Detection and Denoising, EDD-MAIT），该方法结合了通道注意力机制与独立性检验，并采用梯度驱动的自适应窗口策略，动态调整窗口大小以提升细节保留和噪声抑制效果。

链接: https://arxiv.org/abs/2505.01032
作者: Ruyu Yan,Da-Qing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Edge detection is crucial in image processing, but existing methods often produce overly detailed edge maps, affecting clarity. Fixed-window statistical testing faces issues like scale mismatch and computational redundancy. To address these, we propose a novel Multi-scale Adaptive Independence Testing-based Edge Detection and Denoising (EDD-MAIT), a Multi-scale Adaptive Statistical Testing-based edge detection and denoising method that integrates a channel attention mechanism with independence testing. A gradient-driven adaptive window strategy adjusts window sizes dynamically, improving detail preservation and noise suppression. EDD-MAIT achieves better robustness, accuracy, and efficiency, outperforming traditional and learning-based methods on BSDS500 and BIPED datasets, with improvements in F-score, MSE, PSNR, and reduced runtime. It also shows robustness against Gaussian noise, generating accurate and clean edge maps in noisy environments.
zh

[CV-30] Fine-Tuning Without Forgetting: Adaptation of YOLOv8 Preserves COCO Performance

【速读】：该论文试图解决预训练目标检测模型在细粒度领域适应过程中如何平衡模型专精化与保持原有通用能力的问题，具体而言，是探究在不引发灾难性遗忘的情况下，应将预训练主干网络微调到何种深度以优化特定任务性能。解决方案的关键在于系统性地评估不同微调深度对模型性能的影响，通过逐步解冻主干网络层（冻结点分别为第22层、第15层和第10层），并在目标细粒度水果检测数据集及原始COCO验证集上进行严格评估，结果表明深入微调主干网络（解冻至第10层）能够显著提升细粒度任务性能，同时几乎不会影响模型在通用基准上的表现，从而证明了在不产生灾难性遗忘的前提下，利用中后期主干特征进行专精化是有效的。

链接: https://arxiv.org/abs/2505.01016
作者: Vishal Gandhi,Sagar Gandhi
机构: Joyspace AI(乔伊斯空间人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of large pre-trained object detectors hinges on their adaptability to diverse downstream tasks. While fine-tuning is the standard adaptation method, specializing these models for challenging fine-grained domains necessitates careful consideration of feature granularity. The critical question remains: how deeply should the pre-trained backbone be fine-tuned to optimize for the specialized task without incurring catastrophic forgetting of the original general capabilities? Addressing this, we present a systematic empirical study evaluating the impact of fine-tuning depth. We adapt a standard YOLOv8n model to a custom, fine-grained fruit detection dataset by progressively unfreezing backbone layers (freeze points at layers 22, 15, and 10) and training. Performance was rigorously evaluated on both the target fruit dataset and, using a dual-head evaluation architecture, on the original COCO validation set. Our results demonstrate unequivocally that deeper fine-tuning (unfreezing down to layer 10) yields substantial performance gains (e.g., +10% absolute mAP50) on the fine-grained fruit task compared to only training the head. Strikingly, this significant adaptation and specialization resulted in negligible performance degradation (0.1% absolute mAP difference) on the COCO benchmark across all tested freeze levels. We conclude that adapting mid-to-late backbone features is highly effective for fine-grained specialization. Critically, our results demonstrate this adaptation can be achieved without the commonly expected penalty of catastrophic forgetting, presenting a compelling case for exploring deeper fine-tuning strategies, particularly when targeting complex domains or when maximizing specialized performance is paramount.
zh

[CV-31] 3D Human Pose Estimation via Spatial Graph Order Attention and Temporal Body Aware Transformer

【速读】：该论文试图解决基于Transformer和图卷积网络（GCN）的3D人体姿态估计方法中存在的问题，即Transformer方法在骨骼表示中忽略关节之间的空间邻域关系或局部时间模式，而GCN方法则常常缺乏针对姿态特异性的表示。解决方案的关键在于利用GCN的图建模能力，通过多个不同阶数的图来表示每个骨骼，并引入一种新的图阶注意力模块，动态强调每个关节最具有代表性的阶数。此外，还提出了一种时间上的身体感知Transformer，以全局身体特征依赖性建模并考虑关节间的局部骨骼特征依赖性。

链接: https://arxiv.org/abs/2505.01003
作者: Kamel Aouaidjia,Aofan Li,Wenhao Zhang,Chongsheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, 7 tables

点击查看摘要

Abstract:Nowadays, Transformers and Graph Convolutional Networks (GCNs) are the prevailing techniques for 3D human pose estimation. However, Transformer-based methods either ignore the spatial neighborhood relationships between the joints when used for skeleton representations or disregard the local temporal patterns of the local joint movements in skeleton sequence modeling, while GCN-based methods often neglect the need for pose-specific representations. To address these problems, we propose a new method that exploits the graph modeling capability of GCN to represent each skeleton with multiple graphs of different orders, incorporated with a newly introduced Graph Order Attention module that dynamically emphasizes the most representative orders for each joint. The resulting spatial features of the sequence are further processed using a proposed temporal Body Aware Transformer that models the global body feature dependencies in the sequence with awareness of the local inter-skeleton feature dependencies of joints. Given that our 3D pose output aligns with the central 2D pose in the sequence, we improve the self-attention mechanism to be aware of the central pose while diminishing its focus gradually towards the first and the last poses. Extensive experiments on Human3.6m, MPIINF-3DHP, and HumanEva-I datasets demonstrate the effectiveness of the proposed method. Code and models are made available on Github.
zh

[CV-32] Deterministic-to-Stochastic Diverse Latent Feature Mapping for Human Motion Synthesis

【速读】：该论文旨在解决人类运动合成中由于基于分数的生成模型（Score-Based Generative Models, SGMs）训练过程中涉及复杂曲率轨迹而导致的训练不稳定问题。其解决方案的关键在于提出一种确定性到随机的多样化潜在特征映射（Deterministic-to-Stochastic Diverse Latent Feature Mapping, DSDFM）方法，该方法通过两个阶段实现：第一阶段学习人类运动的潜在空间分布，第二阶段通过设计的确定性特征映射和随机多样化输出生成过程，建立高斯分布与潜在空间分布之间的联系，从而提升生成运动的多样性和准确性。相较于以往的SGMs方法，DSDFM训练更为稳定且无需引入额外训练步骤。

链接: https://arxiv.org/abs/2505.00998
作者: Yu Hua,Weiming Liu,Gui Xu,Yaqing Hou,Yew-Soon Ong,Qiang Zhang
机构: Nanyang Technological University (南洋理工大学); ByteDance Inc. (字节跳动公司); Dalian University (大连大学); Dalian University of Technology (大连理工大学); A*STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion synthesis aims to generate plausible human motion sequences, which has raised widespread attention in computer animation. Recent score-based generative models (SGMs) have demonstrated impressive results on this task. However, their training process involves complex curvature trajectories, leading to unstable training process. In this paper, we propose a Deterministic-to-Stochastic Diverse Latent Feature Mapping (DSDFM) method for human motion synthesis. DSDFM consists of two stages. The first human motion reconstruction stage aims to learn the latent space distribution of human motions. The second diverse motion generation stage aims to build connections between the Gaussian distribution and the latent space distribution of human motions, thereby enhancing the diversity and accuracy of the generated human motions. This stage is achieved by the designed deterministic feature mapping procedure with DerODE and stochastic diverse output generation procedure with this http URL is easy to train compared to previous SGMs-based methods and can enhance diversity without introducing additional training this http URL qualitative and quantitative experiments, DSDFM achieves state-of-the-art results surpassing the latest methods, validating its superiority in human motion synthesis.
zh

[CV-33] Optimizing Indoor Farm Monitoring Efficiency Using UAV: Yield Estimation in a GNSS-Denied Cherry Tomato Greenhouse ICRA

【速读】：该论文旨在解决农业劳动力减少和人工成本上升背景下，温室中机器人产量估算的挑战。其关键解决方案是开发一种轻量级无人机（UAV），配备RGB-D相机、3D LiDAR和IMU传感器，并采用LiDAR-惯性里程计算法实现GNSS拒止环境下的精确定位，以及使用3D多目标跟踪算法来估算樱桃番茄的数量和重量。

链接: https://arxiv.org/abs/2505.00995
作者: Taewook Park,Jinwoo Lee,Hyondong Oh,Won-Jae Yun,Kyu-Wha Lee
机构: Ulsan National Institute of Science and Technology (UNIST)(蔚山科学技术大学); Metafarmers(元农场)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 2025 ICRA workshop on field robotics

点击查看摘要

Abstract:As the agricultural workforce declines and labor costs rise, robotic yield estimation has become increasingly important. While unmanned ground vehicles (UGVs) are commonly used for indoor farm monitoring, their deployment in greenhouses is often constrained by infrastructure limitations, sensor placement challenges, and operational inefficiencies. To address these issues, we develop a lightweight unmanned aerial vehicle (UAV) equipped with an RGB-D camera, a 3D LiDAR, and an IMU sensor. The UAV employs a LiDAR-inertial odometry algorithm for precise navigation in GNSS-denied environments and utilizes a 3D multi-object tracking algorithm to estimate the count and weight of cherry tomatoes. We evaluate the system using two dataset: one from a harvesting row and another from a growing row. In the harvesting-row dataset, the proposed system achieves 94.4% counting accuracy and 87.5% weight estimation accuracy within a 13.2-meter flight completed in 10.5 seconds. For the growing-row dataset, which consists of occluded unripened fruits, we qualitatively analyze tracking performance and highlight future research directions for improving perception in greenhouse with strong occlusions. Our findings demonstrate the potential of UAVs for efficient robotic yield estimation in commercial greenhouses.
zh

[CV-34] On-demand Test-time Adaptation for Edge Devices

【速读】：该论文试图解决持续测试时适应（Continual Test-time adaptation, CTTA）在资源受限的边缘设备上应用时存在的高内存开销和能量消耗问题。其解决方案的关键在于提出一种按需测试时适应（on-demand TTA）框架OD-TTA，该框架通过三种创新技术实现高效且准确的适应：轻量级领域偏移检测机制以减少计算开销，源领域选择模块以确保准确性和鲁棒性，以及解耦的批归一化（Batch Normalization）更新方案以实现内存高效的适应。

链接: https://arxiv.org/abs/2505.00986
作者: Xiao Ma,Young D. Kwon,Dong Ma
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm – on-demand TTA – which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.
zh

[CV-35] LMDepth: Lightweight Mamba-based Monocular Depth Estimation for Real-World Deployment

【速读】：该论文旨在解决单目深度估计中性能与计算效率难以平衡的问题，从而提升在资源受限设备上的部署可行性。其解决方案的关键在于提出LMDepth网络，该网络基于Mamba结构，通过引入改进的金字塔空间池化模块以实现多尺度特征聚合与上下文提取，并在解码器中集成多个深度Mamba块，利用线性计算替代复杂的注意力机制，从而在保持高精度深度信息重建的同时显著降低计算开销。

链接: https://arxiv.org/abs/2505.00980
作者: Jiahuan Long,Xin Zhou
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation provides an additional depth dimension to RGB images, making it widely applicable in various fields such as virtual reality, autonomous driving and robotic navigation. However, existing depth estimation algorithms often struggle to effectively balance performance and computational efficiency, which poses challenges for deployment on resource-constrained devices. To address this, we propose LMDepth, a lightweight Mamba-based monocular depth estimation network, designed to reconstruct high-precision depth information while maintaining low computational overhead. Specifically, we propose a modified pyramid spatial pooling module that serves as a multi-scale feature aggregator and context extractor, ensuring global spatial information for accurate depth estimation. Moreover, we integrate multiple depth Mamba blocks into the decoder. Designed with linear computations, the Mamba Blocks enable LMDepth to efficiently decode depth information from global features, providing a lightweight alternative to Transformer-based architectures that depend on complex attention mechanisms. Extensive experiments on the NYUDv2 and KITTI datasets demonstrate the effectiveness of our proposed LMDepth. Compared to previous lightweight depth estimation methods, LMDepth achieves higher performance with fewer parameters and lower computational complexity (measured by GFLOPs). We further deploy LMDepth on an embedded platform with INT8 quantization, validating its practicality for real-world edge applications.
zh

[CV-36] Generating Animated Layouts as Structured Text Representations CVPR2025

【速读】：该论文试图解决在文本到视频模型中对文本元素和动画图形进行精确控制的挑战，特别是在视频广告等应用中。解决方案的关键在于引入了Animated Layout Generation（动画布局生成），通过结构化文本表示实现对视频内容的细粒度控制，并提出了VAKER（Video Ad maKER）框架，该框架结合三阶段生成流程与非结构化文本推理，实现了动态布局轨迹在特定视频帧中的自动化生成。

链接: https://arxiv.org/abs/2505.00975
作者: Yeonsang Shin,Jihwan Kim,Yumin Song,Kyungseung Lee,Hyunhee Chung,Taeyoung Na
机构: Seoul National University (首尔国立大学); SK telecom (SK电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AI for Content Creation (AI4CC) Workshop at CVPR 2025

点击查看摘要

Abstract:Despite the remarkable progress in text-to-video models, achieving precise control over text elements and animated graphics remains a significant challenge, especially in applications such as video advertisements. To address this limitation, we introduce Animated Layout Generation, a novel approach to extend static graphic layouts with temporal dynamics. We propose a Structured Text Representation for fine-grained video control through hierarchical visual elements. To demonstrate the effectiveness of our approach, we present VAKER (Video Ad maKER), a text-to-video advertisement generation pipeline that combines a three-stage generation process with Unstructured Text Reasoning for seamless integration with LLMs. VAKER fully automates video advertisement generation by incorporating dynamic layout trajectories for objects and graphics across specific video frames. Through extensive evaluations, we demonstrate that VAKER significantly outperforms existing methods in generating video advertisements. Project Page: this https URL
zh

[CV-37] CDFormer: Cross-Domain Few-Shot Object Detection Transformer Against Feature Confusion

【速读】：该论文旨在解决跨域小样本目标检测（CD-FSOD）中的特征混淆问题，包括物体-背景混淆和物体-物体混淆。为应对这些挑战，作者提出了CDFormer，其核心解决方案是通过两个关键模块：物体-背景区分（OBD）和物体-物体区分（OOD）。OBD模块利用可学习的背景标记来区分物体与背景，而OOD模块则增强了不同类别物体之间的区分能力。

链接: https://arxiv.org/abs/2505.00938
作者: Boyuan Meng,Xiaohan Zhang,Peilin Li,Zhe Wu,Yiming Li,Wenkai Zhao,Beinan Yu,Hui-Liang Shen
机构: Zhejiang University(浙江大学); Jinhua Institute of Zhejiang University(浙江大学金华研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain few-shot object detection (CD-FSOD) aims to detect novel objects across different domains with limited class instances. Feature confusion, including object-background confusion and object-object confusion, presents significant challenges in both cross-domain and few-shot settings. In this work, we introduce CDFormer, a cross-domain few-shot object detection transformer against feature confusion, to address these challenges. The method specifically tackles feature confusion through two key modules: object-background distinguishing (OBD) and object-object distinguishing (OOD). The OBD module leverages a learnable background token to differentiate between objects and background, while the OOD module enhances the distinction between objects of different classes. Experimental results demonstrate that CDFormer outperforms previous state-of-the-art approaches, achieving 12.9% mAP, 11.0% mAP, and 10.4% mAP improvements under the 1/5/10 shot settings, respectively, when fine-tuned.
zh

[CV-38] Autonomous Embodied Agents : When Robotics Meets Deep Learning Reasoning

【速读】：该论文旨在解决如何构建能够自主执行任务的智能体（intelligent agents）在室内环境中的问题，特别是在复杂和可能未知的环境中进行有效交互与决策。其解决方案的关键在于通过深度学习和强化学习方法，使智能体在基于3D模型的高保真机器人仿真环境中进行训练，从而实现对环境信息的感知、编码与利用，以及对目标导向行为的优化。该过程涵盖了从概念设计到实际部署的完整流程，并通过详细的实验验证了所提出方法的有效性。

链接: https://arxiv.org/abs/2505.00935
作者: Roberto Bigazzi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Ph.D. Dissertation

点击查看摘要

Abstract:The increase in available computing power and the Deep Learning revolution have allowed the exploration of new topics and frontiers in Artificial Intelligence research. A new field called Embodied Artificial Intelligence, which places at the intersection of Computer Vision, Robotics, and Decision Making, has been gaining importance during the last few years, as it aims to foster the development of smart autonomous robots and their deployment in society. The recent availability of large collections of 3D models for photorealistic robotic simulation has allowed faster and safe training of learning-based agents for millions of frames and a careful evaluation of their behavior before deploying the models on real robotic platforms. These intelligent agents are intended to perform a certain task in a possibly unknown environment. To this end, during the training in simulation, the agents learn to perform continuous interactions with the surroundings, such as gathering information from the environment, encoding and extracting useful cues for the task, and performing actions towards the final goal; where every action of the agent influences the interactions. This dissertation follows the complete creation process of embodied agents for indoor environments, from their concept to their implementation and deployment. We aim to contribute to research in Embodied AI and autonomous agents, in order to foster future work in this field. We present a detailed analysis of the procedure behind implementing an intelligent embodied agent, comprehending a thorough description of the current state-of-the-art in literature, technical explanations of the proposed methods, and accurate experimental studies on relevant robotic tasks.
zh

[CV-39] Are Minimal Radial Distortion Solvers Really Necessary for Relative Pose Estimation?

【速读】：该论文试图解决在存在径向畸变的相机之间进行相对位姿估计的问题，传统方法通常依赖于最小化径向畸变的求解器，但这类求解器在计算效率和实现复杂度上均显著高于针孔相机求解器。论文提出两种简化方案：第一种方案将高效的针孔求解器与采样的径向去畸变参数结合，在应用针孔求解器前先对图像进行去畸变；第二种方案则采用先进的神经网络直接估计畸变参数，而非从预设值中采样。关键在于通过简单的方法替代复杂的最小化径向畸变求解器，实验表明在实际应用中复杂求解器并非必需。

链接: https://arxiv.org/abs/2505.00866
作者: Viktor Kocur,Charalambos Tzamos,Yaqing Ding,Zuzana Berger Haladova,Torsten Sattler,Zuzana Kukelova
机构: FMPh, Univerzita Komenského v Bratislave (FMPh, 斯洛伐克科门斯基大学); Faculty of Electrical Engineering, Czech Technical University (电气工程学院，捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2410.05984

点击查看摘要

Abstract:Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with two simple-to-implement approaches that do not use minimal radial distortion solvers: The first approach combines an efficient pinhole solver with sampled radial undistortion parameters, where the sampled parameters are used for undistortion prior to applying the pinhole solver. The second approach uses a state-of-the-art neural network to estimate the distortion parameters rather than sampling them from a set of potential values. Extensive experiments on multiple datasets, and different camera setups, show that complex minimal radial distortion solvers are not necessary in practice. We discuss under which conditions a simple sampling of radial undistortion parameters is preferable over calibrating cameras using a learning-based prior approach. Code and newly created benchmark for relative pose estimation under radial distortion are available at this https URL.
zh

[CV-40] he Comparability of Model Fusion to Measured Data in Confuser Rejection

【速读】：该论文试图解决在大规模深度学习网络建模与训练中数据收集不足的问题，特别是在合成孔径雷达（Synthetic Aperture Radar, SAR）领域，由于数据采集成本高，导致可观察到的独特目标和工作条件有限。解决方案的关键在于利用计算能力替代质量不佳的实测数据，通过集成多个在合成数据上训练的模型来提高模型性能。此外，为应对实际环境中可能存在的未知目标，集成技术需结合混淆拒绝机制，使模型能够拒绝识别未训练过的未知目标，仅对已知目标进行分类。

链接: https://arxiv.org/abs/2505.00836
作者: Conor Flynn,Christopher Ebersole,Edmund Zelnio
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Air Force Research Laboratory (美国空军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference paper for SPIE Defense and Commercial Sensing Algorithms for Synthetic Aperture Radar Imagery XXXII. 14 pages, 9 figures

点击查看摘要

Abstract:Data collection has always been a major issue in the modeling and training of large deep learning networks, as no dataset can account for every slight deviation we might see in live usage. Collecting samples can be especially costly for Synthetic Aperture Radar (SAR), limiting the amount of unique targets and operating conditions we are able to observe from. To counter this lack of data, simulators have been developed utilizing the shooting and bouncing ray method to allow for the generation of synthetic SAR data on 3D models. While effective, the synthetically generated data does not perfectly correlate to the measured data leading to issues when training models solely on synthetic data. We aim to use computational power as a substitution for this lack of quality measured data, by ensembling many models trained on synthetic data. Synthetic data is also not complete, as we do not know what targets might be present in a live environment. Therefore we need to have our ensembling techniques account for these unknown targets by applying confuser rejection in which our models will reject unknown targets it is presented with, and only classify those it has been trained on.
zh

[CV-41] Advancing Wheat Crop Analysis: A Survey of Deep Learning Approaches Using Hyperspectral Imaging

【速读】：该论文旨在解决小麦生产中面临的病虫害、气候变化和水资源短缺等问题，以及传统作物监测方法在早期问题检测中的不足。其解决方案的关键在于利用高光谱成像（Hyperspectral Imaging, HSI）技术进行非破坏性和高效的作物健康评估，并结合深度学习方法处理HSI数据的高维度特性及标注样本有限的问题，以提升小麦品种分类、病害检测和产量估计等应用的准确性与效率。

链接: https://arxiv.org/abs/2505.00805
作者: Fadi Abdeladhim Zidi,Abdelkrim Ouafi,Fares Bougourzi,Cosimo Distante,Abdelmalik Taleb-Ahmed
机构: Mohamed Khider University (穆罕默德·基德尔大学); Junia (朱尼亚); CNRS (法国国家科学研究中心); Centrale Lille (里尔中央理工学院); Univerity of Polytechnique Hauts-de-France (上法兰西理工大学); National Research Council of Italy (意大利国家研究委员会); Université Polytechnique Hauts-de-France (上法兰西理工大学); Université de Lille (里尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:As one of the most widely cultivated and consumed crops, wheat is essential to global food security. However, wheat production is increasingly challenged by pests, diseases, climate change, and water scarcity, threatening yields. Traditional crop monitoring methods are labor-intensive and often ineffective for early issue detection. Hyperspectral imaging (HSI) has emerged as a non-destructive and efficient technology for remote crop health assessment. However, the high dimensionality of HSI data and limited availability of labeled samples present notable challenges. In recent years, deep learning has shown great promise in addressing these challenges due to its ability to extract and analysis complex structures. Despite advancements in applying deep learning methods to HSI data for wheat crop analysis, no comprehensive survey currently exists in this field. This review addresses this gap by summarizing benchmark datasets, tracking advancements in deep learning methods, and analyzing key applications such as variety classification, disease detection, and yield estimation. It also highlights the strengths, limitations, and future opportunities in leveraging deep learning methods for HSI-based wheat crop analysis. We have listed the current state-of-the-art papers and will continue tracking updating them in the following this https URL.
zh

[CV-42] SpatialLLM : A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models CVPR2025

【速读】：该论文试图解决当前大型多模态模型（Large Multimodal Models, LMMs）在3D空间推理能力上的不足问题，这一缺陷源于3D训练数据的稀缺性以及现有模型设计对2D数据的偏向。解决方案的关键在于系统性地研究3D感知数据、架构和训练设置的影响，并引入SpatialLLM，这是一种具备先进3D空间推理能力的大型多模态模型。其核心创新包括构建两类3D感知的训练数据集：专注于物体3D位置和方向的3D感知探测数据，以及用于复杂空间关系的3D感知对话数据，并首次在真实图像上构建包含3D方向关系的VQA数据。同时，将这些数据与LMM的架构和训练设计相结合，为实现卓越的3D推理能力提供了优化路径。

链接: https://arxiv.org/abs/2505.00788
作者: Wufei Ma,Luoxin Ye,Nessa McWeeney,Celso M de Melo,Alan Yuille,Jieneng Chen
机构: Johns Hopkins University (约翰霍普金斯大学); DEVCOM Army Research Laboratory (国防部陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 highlight, camera ready version

点击查看摘要

Abstract:Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object’s 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real images. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our SpatialLLM advances machines toward highly capable 3D-informed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction.
zh

[CV-43] AI-ready Snow Radar Echogram Dataset (SRED) for climate change monitoring

【速读】：该论文试图解决在冰川雷达回波图中高精度追踪内部层的问题，这对于理解冰盖动力学和量化格陵兰岛及其他极地地区因当前全球气候变暖导致的冰量加速排放的影响至关重要。解决方案的关键在于引入了首个“适合深度学习”的雷达回波图数据集，该数据集源自2012年美国国家航空航天局（NASA）冰桥（Operation Ice Bridge, OIB）任务中获取的Snow Radar机载数据，包含13,717个标注和57,815个弱标注的回波图，覆盖多种雪区（干雪区、消融区、湿雪区）并具有不同的沿轨分辨率。该数据集与配套的基准测试框架为推进雷达回波图层追踪和积雪积累估算提供了重要资源。

链接: https://arxiv.org/abs/2505.00786
作者: Oluwanisola Ibikunle,Hara Talasila,Debvrat Varshney,Jilu Li,John Paden,Maryam Rahnemoonfar
机构: University of Kansas (堪萨斯大学); University of Maryland Baltimore County (马里兰大学巴尔的摩县分校); Lehigh University (利哈伊大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tracking internal layers in radar echograms with high accuracy is essential for understanding ice sheet dynamics and quantifying the impact of accelerated ice discharge in Greenland and other polar regions due to contemporary global climate warming. Deep learning algorithms have become the leading approach for automating this task, but the absence of a standardized and well-annotated echogram dataset has hindered the ability to test and compare algorithms reliably, limiting the advancement of state-of-the-art methods for the radar echogram layer tracking problem. This study introduces the first comprehensive ``deep learning ready’’ radar echogram dataset derived from Snow Radar airborne data collected during the National Aeronautics and Space Administration Operation Ice Bridge (OIB) mission in 2012. The dataset contains 13,717 labeled and 57,815 weakly-labeled echograms covering diverse snow zones (dry, ablation, wet) with varying along-track resolutions. To demonstrate its utility, we evaluated the performance of five deep learning models on the dataset. Our results show that while current computer vision segmentation algorithms can identify and track snow layer pixels in echogram images, advanced end-to-end models are needed to directly extract snow depth and annual accumulation from echograms, reducing or eliminating post-processing. The dataset and accompanying benchmarking framework provide a valuable resource for advancing radar echogram layer tracking and snow accumulation estimation, advancing our understanding of polar ice sheets response to climate warming.
zh

[CV-44] Person detection and re-identification in open-world settings of retail stores and public spaces

【速读】：该论文旨在解决在开放世界环境中进行人员重识别（Person Re-Identification）任务的系统设计与实现问题，特别是在多摄像头和多人场景下，如何有效检测、定位并识别不同时间和空间下的同一人员。其解决方案的关键在于结合多种计算机视觉技术，构建复杂系统的架构，并通过优化跟踪与重识别模型来提升在不同光照条件和视频源下的性能。此外，论文还通过实验分析了系统在实际应用场景中的敏感性，并提出了未来研究方向与改进策略。

链接: https://arxiv.org/abs/2505.00772
作者: Branko Brkljač,Milan Brkljač
机构: University of Novi Sad (诺维萨德大学); Alfa BK Univeristy (阿尔法BK大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 1 table, associated code implementation and accompanying test videos with experimental results are available at the following link: this https URL , paper submitted to the 2nd International Scientific Conference ‘ALFATECH - Smart Cities and modern technologies - 2025’, Belgrade, Serbia, Feb. 28, 2025

点击查看摘要

Abstract:Practical applications of computer vision in smart cities usually assume system integration and operation in challenging open-world environments. In the case of person re-identification task the main goal is to retrieve information whether the specific person has appeared in another place at a different time instance of the same video, or over multiple camera feeds. This typically assumes collecting raw data from video surveillance cameras in different places and under varying illumination conditions. In the considered open-world setting it also requires detection and localization of the person inside the analyzed video frame before the main re-identification step. With multi-person and multi-camera setups the system complexity becomes higher, requiring sophisticated tracking solutions and re-identification models. In this work we will discuss existing challenges in system design architectures, consider possible solutions based on different computer vision techniques, and describe applications of such systems in retail stores and public spaces for improved marketing analytics. In order to analyse sensitivity of person re-identification task under different open-world environments, a performance of one close to real-time solution will be demonstrated over several video captures and live camera feeds. Finally, based on conducted experiments we will indicate further research directions and possible system improvements.
zh

[CV-45] Efficient On-Chip Implementation of 4D Radar-Based 3D Object Detection on Hailo-8L

【速读】：该论文旨在解决在低功耗嵌入式环境中实现4D雷达辅助的3D目标检测实时处理的问题。其关键解决方案是提出了一种张量变换方法，该方法在编译过程中将5D输入重塑为4D格式，从而使得基于4D雷达的3D目标检测模型能够在仅支持4D张量的Hailo-8L AI加速器上直接部署，而无需修改模型结构。

链接: https://arxiv.org/abs/2505.00757
作者: Woong-Chan Byun,Dong-Hee Paek,Seung-Hyun Song,Seung-Hyun Kong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4pages, 2 figures

点击查看摘要

Abstract:4D radar has attracted attention in autonomous driving due to its ability to enable robust 3D object detection even under adverse weather conditions. To practically deploy such technologies, it is essential to achieve real-time processing within low-power embedded environments. Addressing this, we present the first on-chip implementation of a 4D radar-based 3D object detection model on the Hailo-8L AI accelerator. Although conventional 3D convolutional neural network (CNN) architectures require 5D inputs, the Hailo-8L only supports 4D tensors, posing a significant challenge. To overcome this limitation, we introduce a tensor transformation method that reshapes 5D inputs into 4D formats during the compilation process, enabling direct deployment without altering the model structure. The proposed system achieves 46.47% AP_3D and 52.75% AP_BEV, maintaining comparable accuracy to GPU-based models while achieving an inference speed of 13.76 Hz. These results demonstrate the applicability of 4D radar-based perception technologies to autonomous driving systems.
zh

[CV-46] P2P-Insole: Human Pose Estimation Using Foot Pressure Distribution and Motion Sensors

【速读】：该论文试图解决低成本、非侵入式且隐私友好的3D人体骨骼数据估计与可视化问题（3D human skeletal data estimation and visualization）。解决方案的关键在于采用集成惯性测量单元（IMU）的鞋垫式传感器（insole-type sensors），通过足底压力分布、加速度和旋转数据进行多模态信息融合，并利用Transformer模型进行高效的时间特征提取，从而提升复杂运动模式识别的准确性。

链接: https://arxiv.org/abs/2505.00755
作者: Atsuya Watanabe,Ratna Aisuwarya,Lei Jing
机构: University of Aizu (会津大学); Andalas University (安达拉斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents P2P-Insole, a low-cost approach for estimating and visualizing 3D human skeletal data using insole-type sensors integrated with IMUs. Each insole, fabricated with e-textile garment techniques, costs under USD 1, making it significantly cheaper than commercial alternatives and ideal for large-scale production. Our approach uses foot pressure distribution, acceleration, and rotation data to overcome limitations, providing a lightweight, minimally intrusive, and privacy-aware solution. The system employs a Transformer model for efficient temporal feature extraction, enriched by first and second derivatives in the input stream. Including multimodal information, such as accelerometers and rotational measurements, improves the accuracy of complex motion pattern recognition. These facts are demonstrated experimentally, while error metrics show the robustness of the approach in various posture estimation tasks. This work could be the foundation for a low-cost, practical application in rehabilitation, injury prevention, and health monitoring while enabling further development through sensor optimization and expanded datasets.
zh

[CV-47] DARTer: Dynamic Adaptive Representation Tracker for Nighttime UAV Tracking ICMR2025

【速读】：该论文旨在解决夜间无人机（UAV）跟踪中由于极端光照变化和视角变换导致的跟踪性能严重下降的问题。现有方法要么依赖计算成本高的光照增强技术，要么引入冗余的领域自适应机制，未能充分利用不同视角下的动态特征。其解决方案的关键在于提出一种端到端的跟踪框架DARTer，该框架通过动态特征融合器（Dynamic Feature Blender, DFB）有效融合静态与动态模板的多视角夜间特征，提升表示鲁棒性；同时利用动态特征激活器（Dynamic Feature Activator, DFA）根据提取特征自适应激活视觉Transformer层，显著减少冗余计算，提高效率。

链接: https://arxiv.org/abs/2505.00752
作者: Xuzhao Li,Xuchen Li,Shiyu Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICMR 2025

点击查看摘要

Abstract:Nighttime UAV tracking presents significant challenges due to extreme illumination variations and viewpoint changes, which severely degrade tracking performance. Existing approaches either rely on light enhancers with high computational costs or introduce redundant domain adaptation mechanisms, failing to fully utilize the dynamic features in varying perspectives. To address these issues, we propose \textbfDARTer (\textbfDynamic \textbfAdaptive \textbfRepresentation \textbfTracker), an end-to-end tracking framework designed for nighttime UAV scenarios. DARTer leverages a Dynamic Feature Blender (DFB) to effectively fuse multi-perspective nighttime features from static and dynamic templates, enhancing representation robustness. Meanwhile, a Dynamic Feature Activator (DFA) adaptively activates Vision Transformer layers based on extracted features, significantly improving efficiency by reducing redundant computations. Our model eliminates the need for complex multi-task loss functions, enabling a streamlined training process. Extensive experiments on multiple nighttime UAV tracking benchmarks demonstrate the superiority of DARTer over state-of-the-art trackers. These results confirm that DARTer effectively balances tracking accuracy and efficiency, making it a promising solution for real-world nighttime UAV tracking applications.
zh

[CV-48] InstructAttribute: Fine-grained Object Attributes editing with Instruction

【速读】：该论文旨在解决在文本到图像扩散模型中对细粒度属性（如颜色和材质）进行精确控制的问题，现有方法在修改对象属性的同时难以保持结构一致性和其他区域的图像一致性。解决方案的关键在于提出一种无需训练的方法——结构保持与属性增强（Structure-Preserving and Attribute Amplification, SPAA），通过编辑自注意力图（self-attention maps）和交叉注意力值（cross-attention values）实现对对象颜色和材质变换的精准控制，并构建了基于多模态大语言模型（Multimodal Large Language Models, MLLM）的自动化数据筛选与指令标注管道，从而训练出一个基于指令的细粒度属性编辑模型InstructAttribute。

链接: https://arxiv.org/abs/2505.00751
作者: Xingxi Yin,Jingfeng Zhang,Zhi Li,Yicheng Li,Yin Zhang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models, renowned for their advanced generative abilities, are extensively utilized in image editing applications, demonstrating remarkable effectiveness. However, achieving precise control over fine-grained attributes still presents considerable challenges. Existing image editing techniques either fail to modify the attributes of an object or struggle to preserve its structure and maintain consistency in other areas of the image. To address these challenges, we propose the Structure-Preserving and Attribute Amplification (SPAA), a training-free method which enables precise control over the color and material transformations of objects by editing the self-attention maps and cross-attention values. Furthermore, we constructed the Attribute Dataset, which encompasses nearly all colors and materials associated with various objects, by integrating multimodal large language models (MLLM) to develop an automated pipeline for data filtering and instruction labeling. Training on this dataset, we present our InstructAttribute, an instruction-based model designed to facilitate fine-grained editing of color and material attributes. Extensive experiments demonstrate that our method achieves superior performance in object-level color and material editing, outperforming existing instruction-based image editing approaches.
zh

[CV-49] Wireless Communication as an Information Sensor for Multi-agent Cooperative Perception: A Survey

【速读】：该论文旨在解决自动驾驶车辆在复杂交通环境中感知能力受限的问题，通过车联网（V2X）通信实现多智能体间的信息共享以扩展感知范围。其解决方案的关键在于将V2X通信视为一种动态的“信息传感器”，并从信息中心化的角度出发，重点研究信息表示、信息融合以及大规模部署三个核心维度，以应对通信带宽限制、异构性、移动性及可扩展性等挑战。

链接: https://arxiv.org/abs/2505.00747
作者: Zhiying Song,Tenghui Xie,Fuxi Wen,Jun Li
机构: Tsinghua University (清华大学)
类目: Other Computer Science (cs.OH); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Cooperative perception extends the perception capabilities of autonomous vehicles by enabling multi-agent information sharing via Vehicle-to-Everything (V2X) communication. Unlike traditional onboard sensors, V2X acts as a dynamic “information sensor” characterized by limited communication, heterogeneity, mobility, and scalability. This survey provides a comprehensive review of recent advancements from the perspective of information-centric cooperative perception, focusing on three key dimensions: information representation, information fusion, and large-scale deployment. We categorize information representation into data-level, feature-level, and object-level schemes, and highlight emerging methods for reducing data volume and compressing messages under communication constraints. In information fusion, we explore techniques under both ideal and non-ideal conditions, including those addressing heterogeneity, localization errors, latency, and packet loss. Finally, we summarize system-level approaches to support scalability in dense traffic scenarios. Compared with existing surveys, this paper introduces a new perspective by treating V2X communication as an information sensor and emphasizing the challenges of deploying cooperative perception in real-world intelligent transportation systems.
zh

[CV-50] Entropy Heat-Mapping: Localizing GPT -Based OCR Errors with Sliding-Window Shannon Analysis

【速读】：该论文试图解决基于生成式 AI (Generative AI) 的光学字符识别 (OCR) 中局部识别错误的定位问题，即如何利用模型输出的 token 级别置信度信号来识别可能的错误区域。解决方案的关键在于通过 Shannon 熵构建热力图，将每个 token 的不确定性转化为可视化的“不确定性地形”，并使用固定长度滑动窗口扫描熵序列以识别高熵热点区域，这些区域很可能包含 OCR 错误，如缺失符号、括号不匹配或混乱的文本。

链接: https://arxiv.org/abs/2505.00746
作者: Alexei Kaltchenko
机构: Wilfrid Laurier University (威尔弗里德·劳里埃大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages

点击查看摘要

Abstract:Vision-language models such as OpenAI GPT-4o can transcribe mathematical documents directly from images, yet their token-level confidence signals are seldom used to pinpoint local recognition mistakes. We present an entropy-heat-mapping proof-of-concept that turns per-token Shannon entropy into a visual ‘‘uncertainty landscape’’. By scanning the entropy sequence with a fixed-length sliding window, we obtain hotspots that are likely to contain OCR errors such as missing symbols, mismatched braces, or garbled prose. Using a small, curated set of scanned research pages rendered at several resolutions, we compare the highlighted hotspots with the actual transcription errors produced by GPT-4o. Our analysis shows that the vast majority of true errors are indeed concentrated inside the high-entropy regions. This study demonstrates–in a minimally engineered setting–that sliding-window entropy can serve as a practical, lightweight aid for post-editing GPT-based OCR. All code, sample data, and annotation guidelines are released to encourage replication and further research.
zh

[CV-51] Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations

【速读】：该论文旨在解决移动视频分析系统在面对不同部署环境时，现有模型适应框架因依赖云端计算而导致的适应响应延迟和性能下降问题。其解决方案的关键在于提出MOCHA框架，通过移动设备与云端资源之间的分层协作，优化持续模型适应的响应性，具体包括：在请求云端模型检索和端到端微调之前，利用设备端的模型复用和快速微调减少适应响应延迟；通过云端基础模型分析领域语义构建结构化分类体系以加速历史专家模型的检索；并通过维护车载专家模型缓存实现频繁场景下的高效本地模型复用，从而显著提升模型适应过程中的准确性并降低响应延迟和重新训练时间。

链接: https://arxiv.org/abs/2505.00745
作者: Maozhe Zhao,Shengzhong Liu,Fan Wu,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Sensys 2025 final version

点击查看摘要

Abstract:Mobile video analysis systems often encounter various deploying environments, where environment shifts present greater demands for responsiveness in adaptations of deployed “expert DNN models”. Existing model adaptation frameworks primarily operate in a cloud-centric way, exhibiting degraded performance during adaptation and delayed reactions to environment shifts. Instead, this paper proposes MOCHA, a novel framework optimizing the responsiveness of continuous model adaptation through hierarchical collaborations between mobile and cloud resources. Specifically, MOCHA (1) reduces adaptation response delays by performing on-device model reuse and fast fine-tuning before requesting cloud model retrieval and end-to-end retraining; (2) accelerates history expert model retrieval by organizing them into a structured taxonomy utilizing domain semantics analyzed by a cloud foundation model as indices; (3) enables efficient local model reuse by maintaining onboard expert model caches for frequent scenes, which proactively prefetch model weights from the cloud model database. Extensive evaluations with real-world videos on three DNN tasks show MOCHA improves the model accuracy during adaptation by up to 6.8% while saving the response delay and retraining time by up to 35.5x and 3.0x respectively.
zh

[CV-52] Localizing Before Answering: A Benchmark for Grounded Medical Visual Question Answering IJCAI

【速读】：该论文试图解决医学大型多模态模型（Medical Large Multi-modal Models, LMMs）在医疗数据解释中频繁生成与源证据矛盾的幻觉问题，尤其是由于缺乏足够的定位推理能力。其解决方案的关键在于提出一种名为HEAL-MedVQA的基准测试框架，用于评估LMMs的定位能力和幻觉鲁棒性，并引入Localize-before-Answer (LobA)框架，通过训练模型先定位感兴趣区域并自我提示以强调分割出的病理性区域，从而生成基于证据且可靠的答案。

链接: https://arxiv.org/abs/2505.00744
作者: Dung Nguyen,Minh Khoi Ho,Huy Ta,Thanh Tam Nguyen,Qi Chen,Kumar Rav,Quy Duong Dang,Satwik Ramchandre,Son Lam Phung,Zhibin Liao,Minh-Son To,Johan Verjans,Phi Le Nguyen,Vu Minh Hieu Phan
机构: Hanoi University of Science and Technology (河内科技大学); Australian Institute for Machine Learning, The University of Adelaide (阿德莱德大学机器学习研究所); Griffith University (格里菲斯大学); College of Medicine and Public Health, Flinders University (弗林德斯大学医学与公共卫生学院); University of Wollongong (卧龙岗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Joint Conference on Artificial Intelligence (IJCAI) 2025

点击查看摘要

Abstract:Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs’ localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
zh

[CV-53] DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation ICMR

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）任务中两个主要问题：一是现有方法未能充分利用语言指令中的详细信息，限制了智能体在任务执行中的语言理解能力；二是当前方法忽视了跨模态物体关系的建模，未能有效利用物体间的潜在线索，影响了导航决策的准确性和鲁棒性。其解决方案的关键在于提出一种双物体感知增强网络（Dual Object Perception-Enhancement Network, DOPE），通过文本语义提取（Text Semantic Extraction, TSE）和文本-图像物体感知增强模块（Text Object Perception-Augmentation, TOPA 和 Image Object Perception-Augmentation, IOPA），分别从语言指令中提取关键信息并增强跨模态物体关系的建模，从而提升导航性能。

链接: https://arxiv.org/abs/2505.00743
作者: Yinfeng Yu,Dongsheng Yang
机构: Xinjiang University(新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Main paper (10 pages). Accepted for publication by ICMR(International Conference on Multimedia Retrieval) 2025

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) is a challenging task where an agent must understand language instructions and navigate unfamiliar environments using visual cues. The agent must accurately locate the target based on visual information from the environment and complete tasks through interaction with the surroundings. Despite significant advancements in this field, two major limitations persist: (1) Many existing methods input complete language instructions directly into multi-layer Transformer networks without fully exploiting the detailed information within the instructions, thereby limiting the agent’s language understanding capabilities during task execution; (2) Current approaches often overlook the modeling of object relationships across different modalities, failing to effectively utilize latent clues between objects, which affects the accuracy and robustness of navigation decisions. We propose a Dual Object Perception-Enhancement Network (DOPE) to address these issues to improve navigation performance. First, we design a Text Semantic Extraction (TSE) to extract relatively essential phrases from the text and input them into the Text Object Perception-Augmentation (TOPA) to fully leverage details such as objects and actions within the instructions. Second, we introduce an Image Object Perception-Augmentation (IOPA), which performs additional modeling of object information across different modalities, enabling the model to more effectively utilize latent clues between objects in images and text, enhancing decision-making accuracy. Extensive experiments on the R2R and REVERIE datasets validate the efficacy of the proposed approach.
zh

[CV-54] Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉数据时存在的问题，特别是在需要精确目标识别和细致视觉细节的任务中表现不佳的问题。受限的token数量导致关键信息被忽略，从而影响模型性能。论文提出的解决方案是\SysName，其关键在于三个创新点：一种提示感知策略，用于动态突出图像中的相关区域；一种空间保持协调方案，以维持物体完整性；以及一种预算感知提示方法，能够在全局上下文与关键视觉细节之间取得平衡。

链接: https://arxiv.org/abs/2505.00742
作者: Jiaxu Qian,Chendong Wang,Yifan Yang,Chaoyun Zhang,Huiqiang Jiang,Xufang Luo,Yu Kang,Qingwei Lin,Anlan Zhang,Shiqi Jiang,Ting Cao,Tianjun Mao,Suman Banerjee,Guyue Liu,Saravan Rajmohan,Dongmei Zhang,Yuqing Yang,Qi Zhang,Lili Qiu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce \SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. \SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that \SysName consistently outperforms baseline methods, achieving up to a 26.9% improvement in accuracy while significantly reducing token consumption.
zh

[CV-55] Detection and Classification of Diseases in Multi-Crop Leaves using LSTM and CNN Models

【速读】：该论文旨在解决植物病害对农业造成的严重挑战，通过早期检测和分类病害来减少作物损失并提升作物管理实践。其解决方案的关键在于应用卷积神经网络（Convolutional Neural Network, CNN）和长短期记忆（Long Short-Term Memory, LSTM）模型，利用包含38个病害类别的大规模图像数据集进行训练与验证，从而实现高效且准确的植物叶片病害分类。实验结果表明，CNN模型在训练和验证集上分别达到了99.1%和96.4%的高准确率，显示出深度学习方法在农业监测中的有效性与可扩展性。

链接: https://arxiv.org/abs/2505.00741
作者: Srinivas Kanakala,Sneha Ningappa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Plant diseases pose a serious challenge to agriculture by reducing crop yield and affecting food quality. Early detection and classification of these diseases are essential for minimising losses and improving crop management practices. This study applies Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) models to classify plant leaf diseases using a dataset containing 70,295 training images and 17,572 validation images across 38 disease classes. The CNN model was trained using the Adam optimiser with a learning rate of 0.0001 and categorical cross-entropy as the loss function. After 10 training epochs, the model achieved a training accuracy of 99.1% and a validation accuracy of 96.4%. The LSTM model reached a validation accuracy of 93.43%. Performance was evaluated using precision, recall, F1-score, and confusion matrix, confirming the reliability of the CNN-based approach. The results suggest that deep learning models, particularly CNN, enable an effective solution for accurate and scalable plant disease classification, supporting practical applications in agricultural monitoring.
zh

[CV-56] Fast2comm:Collaborative perception combined with prior knowledge

【速读】：该论文旨在解决现实世界中协作感知面临的两大挑战：一是感知性能与带宽限制之间的平衡问题，二是应对定位误差的影响。其解决方案的关键在于提出了一种基于先验知识的协作感知框架——Fast2comm，该框架通过三个核心策略实现优化：首先，采用先验监督的置信度特征生成方法，有效区分前景与背景；其次，引入基于真实标注框的空间先验特征选择策略，确保仅共享最具信息量的先验特征，从而减少背景噪声并提升带宽效率；最后，解耦模型训练与测试阶段的特征融合策略，以实现动态带宽适应。

链接: https://arxiv.org/abs/2505.00740
作者: Zhengbin Zhang,Yan Wu,Hongkun Zhang
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 8pages,8figures

点击查看摘要

Abstract:Collaborative perception has the potential to significantly enhance perceptual accuracy through the sharing of complementary information among agents. However, real-world collaborative perception faces persistent challenges, particularly in balancing perception performance and bandwidth limitations, as well as coping with localization errors. To address these challenges, we propose Fast2comm, a prior knowledge-based collaborative perception framework. Specifically, (1)we propose a prior-supervised confidence feature generation method, that effectively distinguishes foreground from background by producing highly discriminative confidence features; (2)we propose GT Bounding Box-based spatial prior feature selection strategy to ensure that only the most informative prior-knowledge features are selected and shared, thereby minimizing background noise and optimizing bandwidth efficiency while enhancing adaptability to localization inaccuracies; (3)we decouple the feature fusion strategies between model training and testing phases, enabling dynamic bandwidth adaptation. To comprehensively validate our framework, we conduct extensive experiments on both real-world and simulated datasets. The results demonstrate the superior performance of our model and highlight the necessity of the proposed methods. Our code is available at this https URL.
zh

[CV-57] MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

【速读】：该论文旨在解决Segment Anything Model 2 (SAM2)在视频交互式目标分割中面临的两个关键问题：一是由于仅依赖过去六帧的掩码记忆进行分割，导致在推理过程中物体可能消失，限制了其长距离目标跟踪能力；二是记忆构建于固定过去帧上，容易受到物体消失或遮挡的影响，因记忆中的分割结果可能不准确。解决方案的关键在于提出MoSAM模型，其包含两项核心策略：Motion-Guided Prompting (MGP)，通过稀疏和密集的方式表示物体运动，并将其注入SAM2以增强目标跟踪能力；以及Spatial-Temporal Memory Selection (ST-MS)，动态选择像素级和帧级上可能包含准确分割结果的帧，从而提升记忆的可靠性并改善分割效果。

链接: https://arxiv.org/abs/2505.00739
作者: Qiushi Yang,Yuan Yao,Miaomiao Cui,Liefeng Bo
机构: Institute for Intelligent Computing, Alibaba Group (阿里巴巴集团智能计算研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long-range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory. To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion-Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion-guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial-Temporal Memory Selection (ST-MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel- and frame-level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state-of-the-art results compared to other competitors.
zh

[CV-58] Unconstrained Large-scale 3D Reconstruction and Rendering across Altitudes

【速读】：该论文试图解决在灾难救援或执法场景中，由于缺乏大量精心采集的图像而难以生成逼真且可导航的3D场景模型的问题（Photorealistic, Navigable 3D Site Models）。其解决方案的关键在于构建首个公开的基准数据集，该数据集基于多校准的地面级、安防级和航拍相机，涵盖了现实世界中的挑战，如图像数量有限、相机类型多样、光照不一致以及视角差异大等。通过该数据集，研究者可以独立评估未定向相机的标定精度和新视角渲染质量，并为后续研究提供基准性能参考。

链接: https://arxiv.org/abs/2505.00734
作者: Neil Joshi,Joshua Carney,Nathanael Kuo,Homer Li,Cheng Peng,Myron Brown
机构: The Johns Hopkins University Applied Physics Laboratory (约翰霍普金斯大学应用物理实验室); Department of Computer Science, The Johns Hopkins University (计算机科学系，约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Production of photorealistic, navigable 3D site models requires a large volume of carefully collected images that are often unavailable to first responders for disaster relief or law enforcement. Real-world challenges include limited numbers of images, heterogeneous unposed cameras, inconsistent lighting, and extreme viewpoint differences for images collected from varying altitudes. To promote research aimed at addressing these challenges, we have developed the first public benchmark dataset for 3D reconstruction and novel view synthesis based on multiple calibrated ground-level, security-level, and airborne cameras. We present datasets that pose real-world challenges, independently evaluate calibration of unposed cameras and quality of novel rendered views, demonstrate baseline performance using recent state-of-practice methods, and identify challenges for further research.
zh

[CV-59] Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task CVPR

【速读】：该论文试图解决施工环境中安全违规识别的问题，当前计算机视觉领域对此研究不足。现有模型主要依赖2D目标检测，无法有效捕捉真实场景中的复杂违规情况，原因包括任务设定过于简化、缺乏真实条件下的验证、无标准化基线以及合成数据生成工具的缺失。解决方案的关键在于提出Safe-Construct框架，将其重新定义为3D多视角参与任务，结合场景级工人-物体上下文和3D空间理解，并引入合成室内施工现场生成器（SICSG）以生成多样化、可扩展的训练数据，从而提升识别性能与系统鲁棒性。

链接: https://arxiv.org/abs/2504.10880
作者: Aviral Chharia,Tianyu Ren,Tomotake Furuhata,Kenji Shimada
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR Workshop 2025; Project Website: this https URL

点击查看摘要

Abstract:Recognizing safety violations in construction environments is critical yet remains underexplored in computer vision. Existing models predominantly rely on 2D object detection, which fails to capture the complexities of real-world violations due to: (i) an oversimplified task formulation treating violation recognition merely as object detection, (ii) inadequate validation under realistic conditions, (iii) absence of standardized baselines, and (iv) limited scalability from the unavailability of synthetic dataset generators for diverse construction scenarios. To address these challenges, we introduce Safe-Construct, the first framework that reformulates violation recognition as a 3D multi-view engagement task, leveraging scene-level worker-object context and 3D spatial understanding. We also propose the Synthetic Indoor Construction Site Generator (SICSG) to create diverse, scalable training data, overcoming data limitations. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types. We rigorously evaluate our approach in near-realistic settings, incorporating four violations, four workers, 14 objects, and challenging conditions like occlusions (worker-object, worker-worker) and variable illumination (back-lighting, overexposure, sunlight). By integrating 3D multi-view spatial understanding and synthetic data generation, Safe-Construct sets a new benchmark for scalable and robust safety monitoring in high-risk industries. Project Website: this https URL
zh

[CV-60] Can Foundation Models Really Segment Tumors? A Benchmarking Odyssey in Lung CT Imaging

【速读】：该论文旨在解决肺癌肿瘤分割的准确性问题，这对于提高肿瘤诊断、治疗计划和患者预后具有重要意义。传统分割模型在处理肿瘤形态、大小和位置的复杂性方面存在显著挑战，而本文提出的解决方案关键在于评估和比较基于深度学习的分割模型，包括传统架构（如U-Net和DeepLabV3）、自配置模型（如nnUNet）以及基础模型（如MedSAM和MedSAM~2）。研究结果表明，基础模型，尤其是MedSAM~2，在分割精度和计算效率方面均优于传统模型，展示了其在临床工作流程和患者预后改善中的潜在应用价值。

链接: https://arxiv.org/abs/2505.01239
作者: Elena Mulero Ayllón,Massimiliano Mantegna,Linlin Shen,Paolo Soda,Valerio Guarrasi,Matteo Tortora
机构: Università Campus Bio-Medico di Roma (罗马生物医学大学); Shenzhen University (深圳大学); Umeå University (于默奥大学); University of Genoa (热那亚大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate lung tumor segmentation is crucial for improving diagnosis, treatment planning, and patient outcomes in oncology. However, the complexity of tumor morphology, size, and location poses significant challenges for automated segmentation. This study presents a comprehensive benchmarking analysis of deep learning-based segmentation models, comparing traditional architectures such as U-Net and DeepLabV3, self-configuring models like nnUNet, and foundation models like MedSAM, and MedSAM~2. Evaluating performance across two lung tumor segmentation datasets, we assess segmentation accuracy and computational efficiency under various learning paradigms, including few-shot learning and fine-tuning. The results reveal that while traditional models struggle with tumor delineation, foundation models, particularly MedSAM~2, outperform them in both accuracy and computational efficiency. These findings underscore the potential of foundation models for lung tumor segmentation, highlighting their applicability in improving clinical workflows and patient outcomes.
zh

[CV-61] A Survey on 3D Reconstruction Techniques in Plant Phenotyping: From Classical Methods to Neural Radiance Fields (NeRF) 3D Gaussian Splatting (3DGS) and Beyond

【速读】：该论文旨在解决植物表型分析中对高精度、自动化三维重建技术的需求，以支持精准农业和作物改良。其核心问题在于如何有效获取和表示植物形态与结构的详细信息，从而提升表型分析的准确性与效率。解决方案的关键在于综述并评估多种三维重建技术，包括传统方法、Neural Radiance Fields (NeRF) 和新兴的3D Gaussian Splatting (3DGS)，重点分析它们在植物表型中的方法学、应用及性能，探讨其优势、局限及未来发展方向。

链接: https://arxiv.org/abs/2505.00737
作者: Jiajia Li,Xinda Qi,Seyed Hamidreza Nabaei,Meiqi Liu,Dong Chen,Xin Zhang,Xunyuan Yin,Zhaojian Li
机构: Michigan State University (密歇根州立大学); University of Virginia (弗吉尼亚大学); Nanyang Technological University (南洋理工大学); Mississippi State University (密西西比州立大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Plant phenotyping plays a pivotal role in understanding plant traits and their interactions with the environment, making it crucial for advancing precision agriculture and crop improvement. 3D reconstruction technologies have emerged as powerful tools for capturing detailed plant morphology and structure, offering significant potential for accurate and automated phenotyping. This paper provides a comprehensive review of the 3D reconstruction techniques for plant phenotyping, covering classical reconstruction methods, emerging Neural Radiance Fields (NeRF), and the novel 3D Gaussian Splatting (3DGS) approach. Classical methods, which often rely on high-resolution sensors, are widely adopted due to their simplicity and flexibility in representing plant structures. However, they face challenges such as data density, noise, and scalability. NeRF, a recent advancement, enables high-quality, photorealistic 3D reconstructions from sparse viewpoints, but its computational cost and applicability in outdoor environments remain areas of active research. The emerging 3DGS technique introduces a new paradigm in reconstructing plant structures by representing geometry through Gaussian primitives, offering potential benefits in both efficiency and scalability. We review the methodologies, applications, and performance of these approaches in plant phenotyping and discuss their respective strengths, limitations, and future prospects (this https URL). Through this review, we aim to provide insights into how these diverse 3D reconstruction techniques can be effectively leveraged for automated and high-throughput plant phenotyping, contributing to the next generation of agricultural technology.
zh

[CV-62] Leverag ing Depth and Attention Mechanisms for Improved RGB Image Inpainting

【速读】：该论文试图解决传统基于深度学习的图像修复方法仅依赖RGB图像而可能忽略重要深度信息的问题，这限制了模型对场景空间和结构上下文的理解能力。解决方案的关键在于引入RGB图像与深度图的联合处理，采用双编码器架构分别处理两种模态的数据，并在解码器中通过注意力机制进行特征融合，从而提升图像修复的准确性和上下文感知能力。

链接: https://arxiv.org/abs/2505.00735
作者: Jin Hyun Park,Harine Choi,Praewa Pitiphat
机构: Texas A&M University (得克萨斯A&M大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing deep learning-based image inpainting methods typically rely on convolutional networks with RGB images to reconstruct images. However, relying exclusively on RGB images may neglect important depth information, which plays a critical role in understanding the spatial and structural context of a scene. Just as human vision leverages stereo cues to perceive depth, incorporating depth maps into the inpainting process can enhance the model’s ability to reconstruct images with greater accuracy and contextual awareness. In this paper, we propose a novel approach that incorporates both RGB and depth images for enhanced image inpainting. Our models employ a dual encoder architecture, where one encoder processes the RGB image and the other handles the depth image. The encoded features from both encoders are then fused in the decoder using an attention mechanism, effectively integrating the RGB and depth representations. We use two different masking strategies, line and square, to test the robustness of the model under different types of occlusions. To further analyze the effectiveness of our approach, we use Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations to examine the regions of interest the model focuses on during inpainting. We show that incorporating depth information alongside the RGB image significantly improves the reconstruction quality. Through both qualitative and quantitative comparisons, we demonstrate that the depth-integrated model outperforms the baseline, with attention mechanisms further enhancing inpainting performance, as evidenced by multiple evaluation metrics and visualization.
zh

人工智能

[AI-0] SIME: Enhancing Policy Self-Improvement with Modal-level Exploration

【速读】：该论文旨在解决机器人系统在自我提升过程中难以生成新且有价值数据的问题，这一问题限制了其通过环境交互持续增强能力的效率。论文提出的解决方案关键在于模态级探索与数据选择，通过在策略执行中引入模态级探索机制，机器人能够产生更多样化和多模态的交互，并从中选取最有价值的试验和高质量片段用于学习，从而实现有效的自我提升。

链接: https://arxiv.org/abs/2505.01396
作者: Yang Jin,Jun Lv,Wenye Yu,Hongjie Fang,Yong-Lu Li,Cewu Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at this https URL
zh

[AI-1] FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research

【速读】：该论文旨在解决固定翼无人机在无惯性测量单元（IMU）或运动捕捉系统支持下，实现纯视觉引导的自主着陆问题。其解决方案的关键在于提出了一种新颖的“真实-仿真-真实”学习方法，该方法通过3D高斯点云渲染构建逼真的仿真环境，从视觉估计的真实飞行数据中识别非线性动力学，并利用仅在仿真中进行模仿学习的多模态视觉变换器（Vision Transformer, ViT）策略，最终在硬件平台上实现了80%的自主着陆成功率。

链接: https://arxiv.org/abs/2505.01383
作者: Yan Miao,Will Shen,Hang Cui,Sayan Mitra
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present FalconWing – an open-source, ultra-lightweight (150 g) fixed-wing platform for autonomy research. The hardware platform integrates a small camera, a standard airframe, offboard computation, and radio communication for manual overrides. We demonstrate FalconWing’s capabilities by developing and deploying a purely vision-based control policy for autonomous landing (without IMU or motion capture) using a novel real-to-sim-to-real learning approach. Our learning approach: (1) constructs a photorealistic simulation environment via 3D Gaussian splatting trained on real-world images; (2) identifies nonlinear dynamics from vision-estimated real-flight data; and (3) trains a multi-modal Vision Transformer (ViT) policy through simulation-only imitation learning. The ViT architecture fuses single RGB image with the history of control actions via self-attention, preserving temporal context while maintaining real-time 20 Hz inference. When deployed zero-shot on the hardware platform, this policy achieves an 80% success rate in vision-based autonomous landings. Together with the hardware specifications, we also open-source the system dynamics, the software for photorealistic simulator and the learning approach.
zh

[AI-2] BalancEdit: Dynamically Balancing the Generality-Locality Trade-off in Multi-modal Model Editing

【速读】：该论文试图解决多模态模型在知识更新过程中存在的泛化性与局部性之间的权衡问题（generality-locality trade-off），即现有模型编辑技术未能充分考虑不同事实的独特影响范围，导致模型性能在泛化性和局部准确性方面受到损害。解决方案的关键在于提出BalancEdit方法，该方法通过为每个事实生成正负样本以精确确定其影响范围，并利用离散的、局部化的编辑代码本将这些信息融入模型的潜在空间，而无需修改模型权重，从而动态实现泛化性与局部性的平衡。

链接: https://arxiv.org/abs/2505.01343
作者: Dongliang Guo,Mengxuan Hu,Zihan Guan,Thomas Hartvigsen,Sheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large multi-modal models inevitably decay over time as facts change and previously learned information becomes outdated. Traditional approaches such as fine-tuning are often impractical for updating these models due to their size and complexity. Instead, direct knowledge editing within the models presents a more viable solution. Current model editing techniques, however, typically overlook the unique influence ranges of different facts, leading to compromised model performance in terms of both generality and locality. To address this issue, we introduce the concept of the generality-locality trade-off in multi-modal model editing. We develop a new model editing dataset named OKEDIT, specifically designed to effectively evaluate this trade-off. Building on this foundation, we propose BalancEdit, a novel method for balanced model editing that dynamically achieves an optimal balance between generality and locality. BalancEdit utilizes a unique mechanism that generates both positive and negative samples for each fact to accurately determine its influence scope and incorporates these insights into the model’s latent space using a discrete, localized codebook of edits, without modifying the underlying model weights. To our knowledge, this is the first approach explicitly addressing the generality-locality trade-off in multi-modal model editing. Our comprehensive results confirm the effectiveness of BalancEdit, demonstrating minimal trade-offs while maintaining robust editing capabilities. Our code and dataset will be available.
zh

[AI-3] Constrained Network Adversarial Attacks: Validity Robustness and Transferability

【速读】：该论文试图解决现有对抗攻击方法在物联网（IoT）环境中生成的对抗样本存在违反领域特定约束的问题，这导致大量对抗样本无效，从而高估了实际环境中的漏洞。解决方案的关键在于引入更简单的替代模型（如多层感知机，MLP），以生成更具有效性的对抗样本，并在此基础上分析对抗严重性在不同机器学习/深度学习（ML/DL）模型间的可迁移性，强调在评估和设计安全关键型IoT与网络应用的鲁棒模型时，需同时考虑领域约束与模型架构。

链接: https://arxiv.org/abs/2505.01328
作者: Anass Grini,Oumaima Taheri,Btissam El Khamlichi,Amal El Fallah-Seghrouchni
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:While machine learning has significantly advanced Network Intrusion Detection Systems (NIDS), particularly within IoT environments where devices generate large volumes of data and are increasingly susceptible to cyber threats, these models remain vulnerable to adversarial attacks. Our research reveals a critical flaw in existing adversarial attack methodologies: the frequent violation of domain-specific constraints, such as numerical and categorical limits, inherent to IoT and network traffic. This leads to up to 80.3% of adversarial examples being invalid, significantly overstating real-world vulnerabilities. These invalid examples, though effective in fooling models, do not represent feasible attacks within practical IoT deployments. Consequently, relying on these results can mislead resource allocation for defense, inflating the perceived susceptibility of IoT-enabled NIDS models to adversarial manipulation. Furthermore, we demonstrate that simpler surrogate models like Multi-Layer Perceptron (MLP) generate more valid adversarial examples compared to complex architectures such as CNNs and LSTMs. Using the MLP as a surrogate, we analyze the transferability of adversarial severity to other ML/DL models commonly used in IoT contexts. This work underscores the importance of considering both domain constraints and model architecture when evaluating and designing robust ML/DL models for security-critical IoT and network applications.
zh

[AI-4] Enhancing SPARQL Query Rewriting for Complex Ontology Alignments

【速读】：该论文试图解决在语义网中对异构本体进行统一查询时面临的挑战，特别是由于本体对齐的复杂性，尤其是丰富对应关系（c : c）所带来的问题。现有方法主要关注简单对应关系（s : s）和部分复杂对应关系（s : c），而忽略了更复杂对应关系的挑战。该论文提出的解决方案的关键在于基于用户用自然语言表达的需求，利用等价传递性原则以及大型语言模型（如GPT-4）的强大能力，实现从源本体到目标本体的SPARQL查询自动重写，从而高效处理复杂的对应关系，并降低非专家用户使用SPARQL查询对齐本体的门槛。

链接: https://arxiv.org/abs/2505.01309
作者: Anicet Lepetit Ondo,Laurence Capus,Mamadou Bousso
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:SPARQL query rewriting is a fundamental mechanism for uniformly querying heterogeneous ontologies in the Linked Data Web. However, the complexity of ontology alignments, particularly rich correspondences (c : c), makes this process challenging. Existing approaches primarily focus on simple (s : s) and partially complex ( s : c) alignments, thereby overlooking the challenges posed by more expressive alignments. Moreover, the intricate syntax of SPARQL presents a barrier for non-expert users seeking to fully exploit the knowledge encapsulated in ontologies. This article proposes an innovative approach for the automatic rewriting of SPARQL queries from a source ontology to a target ontology, based on a user’s need expressed in natural language. It leverages the principles of equivalence transitivity as well as the advanced capabilities of large language models such as GPT-4. By integrating these elements, this approach stands out for its ability to efficiently handle complex alignments, particularly (c : c) correspondences , by fully exploiting their expressiveness. Additionally, it facilitates access to aligned ontologies for users unfamiliar with SPARQL, providing a flexible solution for querying heterogeneous data.
zh

[AI-5] Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessments

【速读】：该论文旨在解决安全关键型软件评估中因复杂监管框架而面临的传统手动评估效率低下的问题。其解决方案的关键在于提出一种名为Document Retrieval-Augmented Fine-Tuning (DRAFT)的新方法，该方法通过引入双检索架构，同时访问软件文档和适用的参考标准，并结合半自动化数据集生成方法，增强了大型语言模型（LLM）在安全关键合规性评估中的能力。

链接: https://arxiv.org/abs/2505.01307
作者: Regan Bolton,Mohammadreza Sheikhfathollahi,Simon Parkinson,Vanessa Vulovic,Gary Bamford,Dan Basher,Howard Parkinson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety critical software assessment requires robust assessment against complex regulatory frameworks, a process traditionally limited by manual evaluation. This paper presents Document Retrieval-Augmented Fine-Tuning (DRAFT), a novel approach that enhances the capabilities of a large language model (LLM) for safety-critical compliance assessment. DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture, which simultaneously accesses both software documentation and applicable reference standards. To fine-tune DRAFT, we develop a semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors, closely mirroring real-world assessment scenarios. Experiments with GPT-4o-mini demonstrate a 7% improvement in correctness over the baseline model, with qualitative improvements in evidence handling, response structure, and domain-specific reasoning. DRAFT represents a practical approach to improving compliance assessment systems while maintaining the transparency and evidence-based reasoning essential in regulatory domains.
zh

[AI-6] Early Detection of Patient Deterioration from Real-Time Wearable Monitoring System

【速读】：该论文旨在解决患者病情恶化的早期检测问题，特别是如何从多样化的心率数据中提取有意义的见解以及处理可穿戴设备数据中的缺失值。其解决方案的关键在于提出TARL方法，该方法通过建模心率时间序列中代表性子序列（即shapelet）的结构关系，构建一个shapelet-过渡知识图谱，以捕捉病情进展和潜在未来变化，并引入一种过渡感知的知识嵌入来强化shapelet之间的关系，同时量化缺失值的影响，从而生成全面的心率表示。

链接: https://arxiv.org/abs/2505.01305
作者: Lo Pang-Yun Ting,Hong-Pei Chen,An-Shan Liu,Chun-Yin Yeh,Po-Lin Chen,Kun-Ta Chuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of patient deterioration is crucial for reducing mortality rates. Heart rate data has shown promise in assessing patient health, and wearable devices offer a cost-effective solution for real-time monitoring. However, extracting meaningful insights from diverse heart rate data and handling missing values in wearable device data remain key challenges. To address these challenges, we propose TARL, an innovative approach that models the structural relationships of representative subsequences, known as shapelets, in heart rate time series. TARL creates a shapelet-transition knowledge graph to model shapelet dynamics in heart rate time series, indicating illness progression and potential future changes. We further introduce a transition-aware knowledge embedding to reinforce relationships among shapelets and quantify the impact of missing values, enabling the formulation of comprehensive heart rate representations. These representations capture explanatory structures and predict future heart rate trends, aiding early illness detection. We collaborate with physicians and nurses to gather ICU patient heart rate data from wearables and diagnostic metrics assessing illness severity for evaluating deterioration. Experiments on real-world ICU data demonstrate that TARL achieves both high reliability and early detection. A case study further showcases TARL’s explainable detection process, highlighting its potential as an AI-driven tool to assist clinicians in recognizing early signs of patient deterioration.
zh

[AI-7] ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow

【速读】：该论文试图解决机器人在获取复杂操作技能时面临的挑战，即大规模机器人示范数据收集的成本过高。其解决方案的关键在于引入语义动作流（semantic action flow），作为一种核心的中间表示，捕捉操作器与物体之间本质的时空交互，并且对表面视觉差异具有不变性。通过ViSA-Flow框架，该表示能够从无标签的大规模视频数据中自监督学习，首先在人类-物体交互视频数据上预训练生成模型，学习操作结构的鲁棒先验，随后通过微调将该先验高效适应到目标机器人，从而实现从人类视频观察到机器人执行的知识迁移。

链接: https://arxiv.org/abs/2505.01288
作者: Changhe Chen,Quantao Yang,Xiaohao Xu,Nima Fazeli,Olov Andersson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at this https URL.
zh

[AI-8] 2DXformer: Dual Transformers for Wind Power Forecasting with Dual Exogenous Variables ICDM2024

【速读】：该论文旨在解决风功率预测中模型对变量间关系建模不足以及内生变量与外生变量处理不区分导致的复杂性增加问题。其解决方案的关键在于提出2DXformer架构，通过将输入分为三类变量（外生静态变量、外生动态变量和内生变量），并采用通道无关的嵌入方式、注意力机制捕捉外生变量间的相关性，以及带有残差连接的多层感知机建模外生变量对内生变量的影响，从而提升预测精度。

链接: https://arxiv.org/abs/2505.01286
作者: Yajuan Zhang,Jiahai Jiang,Yule Yan,Liang Yang,Ping Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICDM 2024

点击查看摘要

Abstract:Accurate wind power forecasting can help formulate scientific dispatch plans, which is of great significance for maintaining the safety, stability, and efficient operation of the power system. In recent years, wind power forecasting methods based on deep learning have focused on extracting the spatiotemporal correlations among data, achieving significant improvements in forecasting accuracy. However, they exhibit two limitations. First, there is a lack of modeling for the inter-variable relationships, which limits the accuracy of the forecasts. Second, by treating endogenous and exogenous variables equally, it leads to unnecessary interactions between the endogenous and exogenous variables, increasing the complexity of the model. In this paper, we propose the 2DXformer, which, building upon the previous work’s focus on spatiotemporal correlations, addresses the aforementioned two limitations. Specifically, we classify the inputs of the model into three types: exogenous static variables, exogenous dynamic variables, and endogenous variables. First, we embed these variables as variable tokens in a channel-independent manner. Then, we use the attention mechanism to capture the correlations among exogenous variables. Finally, we employ a multi-layer perceptron with residual connections to model the impact of exogenous variables on endogenous variables. Experimental results on two real-world large-scale datasets indicate that our proposed 2DXformer can further improve the performance of wind power forecasting. The code is available in this repository: \hrefthis https URLthis https URL.
zh

[AI-9] Reduced-order structure-property linkages for stochastic metamaterials

【速读】：该论文旨在解决机械超材料单元结构设计空间与有效力学性能之间复杂关系的高效建模与预测问题，以提升超材料的设计效率和性能评估能力。其解决方案的关键在于构建一种基于材料信息学的框架，通过主成分分析提取2-point相关函数的显著特征，并结合基于快速傅里叶变换（FFT）的均质化方法高效计算单元结构的有效弹性刚度，再利用高斯过程回归建立降阶代理模型，从而实现从单元结构设计到力学性能的映射。该方法不仅提供了高价值的低维数据表示，还通过不确定性驱动的主动学习框架显著减少了训练数据量，验证了仅需原数据集0.61%的样本即可生成准确且稳健的结构-性能图谱。

链接: https://arxiv.org/abs/2505.01283
作者: Hooman Danesh,Maruthi Annamaraju,Tim Brepols,Stefanie Reese,Surya R. Kalidindi
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The capabilities of additive manufacturing have facilitated the design and production of mechanical metamaterials with diverse unit cell geometries. Establishing linkages between the vast design space of unit cells and their effective mechanical properties is critical for the efficient design and performance evaluation of such metamaterials. However, physics-based simulations of metamaterial unit cells across the entire design space are computationally expensive, necessitating a materials informatics framework to efficiently capture complex structure-property relationships. In this work, principal component analysis of 2-point correlation functions is performed to extract the salient features from a large dataset of randomly generated 2D metamaterials. Physics-based simulations are performed using a fast Fourier transform (FFT)-based homogenization approach to efficiently compute the homogenized effective elastic stiffness across the extensive unit cell designs. Subsequently, Gaussian process regression is used to generate reduced-order surrogates, mapping unit cell designs to their homogenized effective elastic constant. It is demonstrated that the adopted workflow enables a high-value low-dimensional representation of the voluminous stochastic metamaterial dataset, facilitating the construction of robust structure-property maps. Finally, an uncertainty-based active learning framework is utilized to train a surrogate model with a significantly smaller number of data points compared to the original full dataset. It is shown that a dataset as small as 0.61% of the entire dataset is sufficient to generate accurate and robust structure-property maps.
zh

[AI-10] A Physics-preserved Transfer Learning Method for Differential Equations

【速读】：该论文试图解决数据驱动方法在求解微分方程（DEs）中因学习环境差异（如数据偏差或方程变化）导致的领域偏移问题，而现有的迁移学习（TL）方法在通用DEs问题中的泛化能力或训练过程中的物理信息保持方面存在不足。解决方案的关键在于提出一种物理信息保持的最优张量传输（POTT）方法，该方法通过将数据域建模为乘积分布，并量化本质问题中的分布偏移和算子偏移，从而自适应地校正领域偏移并保持物理信息。POTT方法利用由POTT映射产生的前向分布，使数据驱动模型适应目标领域，实验结果表明其在性能、泛化能力和物理信息保持方面均表现出色。

链接: https://arxiv.org/abs/2505.01281
作者: Hao-Ran Yang,Chuan-Xian Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While data-driven methods such as neural operator have achieved great success in solving differential equations (DEs), they suffer from domain shift problems caused by different learning environments (with data bias or equation changes), which can be alleviated by transfer learning (TL). However, existing TL methods adopted in DEs problems lack either generalizability in general DEs problems or physics preservation during training. In this work, we focus on a general transfer learning method that adaptively correct the domain shift and preserve physical information. Mathematically, we characterize the data domain as product distribution and the essential problems as distribution bias and operator bias. A Physics-preserved Optimal Tensor Transport (POTT) method that simultaneously admits generalizability to common DEs and physics preservation of specific problem is proposed to adapt the data-driven model to target domain utilizing the push-forward distribution induced by the POTT map. Extensive experiments demonstrate the superior performance, generalizability and physics preservation of the proposed POTT method.
zh

[AI-11] Enhancing Obsolescence Forecasting with Deep Generative Data Augmentation: A Semi-Supervised Framework for Low-Data Industrial Applications

【速读】：该论文试图解决电子元件过时（obsolescence）预测中因数据不足而导致的模型精度受限问题。其解决方案的关键在于引入一种基于深度学习的框架，通过深度生成建模（deep generative modeling）生成新的过时案例以扩充训练数据集，从而提升传统机器学习预测模型的性能。该框架还通过将现有监督学习分类器适配为半监督学习方式，实现了对增强数据集的有效利用。

链接: https://arxiv.org/abs/2505.01261
作者: Elie Saad,Mariem Besbes,Marc Zolghadri,Victor Czmil,Claude Baron,Vincent Bourgeois
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The challenge of electronic component obsolescence is particularly critical in systems with long life cycles. Various obsolescence management methods are employed to mitigate its impact, with obsolescence forecasting being a highly sought-after and prominent approach. As a result, numerous machine learning-based forecasting methods have been proposed. However, machine learning models require a substantial amount of relevant data to achieve high precision, which is lacking in the current obsolescence landscape in some situations. This work introduces a novel framework for obsolescence forecasting based on deep learning. The proposed framework solves the lack of available data through deep generative modeling, where new obsolescence cases are generated and used to augment the training dataset. The augmented dataset is then used to train a classical machine learning-based obsolescence forecasting model. To train classical forecasting models using augmented datasets, existing classical supervised-learning classifiers are adapted for semi-supervised learning within this framework. The proposed framework demonstrates state-of-the-art results on benchmarking datasets.
zh

[AI-12] Exploring the Impact of Explainable AI and Cognitive Capabilities on Users Decisions

【速读】：该论文试图解决在AI辅助决策场景中，如何通过不同的AI信息呈现方式和解释风格（如基于例子、特征、规则和反事实的解释）影响用户对AI的依赖度、决策准确性及认知负荷的问题，同时探讨不同人格特质（如高与低的认知需求，NFC）个体在XAI界面元素优先级、准确性及认知负荷上的差异。解决方案的关键在于系统性地评估多种解释风格和AI信息（预测、置信度、准确率）对用户行为的影响，并强调在XAI设计中引入用户中心的个性化策略，结合多样的解释方式及用户特征以优化人机协作效果。

链接: https://arxiv.org/abs/2505.01192
作者: Federico Maria Cau,Lucio Davide Spano
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 30 pages, 7 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are increasingly used for decision-making across domains, raising debates over the information and explanations they should provide. Most research on Explainable AI (XAI) has focused on feature-based explanations, with less attention on alternative styles. Personality traits like the Need for Cognition (NFC) can also lead to different decision-making outcomes among low and high NFC individuals. We investigated how presenting AI information (prediction, confidence, and accuracy) and different explanation styles (example-based, feature-based, rule-based, and counterfactual) affect accuracy, reliance on AI, and cognitive load in a loan application scenario. We also examined low and high NFC individuals’ differences in prioritizing XAI interface elements (loan attributes, AI information, and explanations), accuracy, and cognitive load. Our findings show that high AI confidence significantly increases reliance on AI while reducing cognitive load. Feature-based explanations did not enhance accuracy compared to other conditions. Although counterfactual explanations were less understandable, they enhanced overall accuracy, increasing reliance on AI and reducing cognitive load when AI predictions were correct. Both low and high NFC individuals prioritized explanations after loan attributes, leaving AI information as the least important. However, we found no significant differences between low and high NFC groups in accuracy or cognitive load, raising questions about the role of personality traits in AI-assisted decision-making. These findings highlight the need for user-centric personalization in XAI interfaces, incorporating diverse explanation styles and exploring multiple personality traits and other user characteristics to optimize human-AI collaboration.
zh

[AI-13] Secure Cluster-Based Hierarchical Federated Learning in Vehicular Networks

【速读】：该论文旨在解决车联网中分层联邦学习（Hierarchical Federated Learning, HFL）面临的对抗性和不可靠车辆带来的模型完整性与收敛性问题。其关键解决方案是提出一种融合动态车辆选择与鲁棒异常检测的防御框架，通过综合评估车辆的历史准确性、贡献频率和异常记录进行可靠性评估，并结合Z-score与余弦相似性分析实现统计异常和方向偏差的检测，同时引入自适应阈值机制和加权梯度平均机制，以提升对高可靠性车辆的严格标准并增强模型鲁棒性。此外，通过跨簇一致性检查应对协同攻击，构建多层次防御策略以有效过滤恶意贡献。

链接: https://arxiv.org/abs/2505.01186
作者: M. Saeid HaghighiFard,Sinem Coleri
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Hierarchical Federated Learning (HFL) has recently emerged as a promising solution for intelligent decision-making in vehicular networks, helping to address challenges such as limited communication resources, high vehicle mobility, and data heterogeneity. However, HFL remains vulnerable to adversarial and unreliable vehicles, whose misleading updates can significantly compromise the integrity and convergence of the global model. To address these challenges, we propose a novel defense framework that integrates dynamic vehicle selection with robust anomaly detection within a cluster-based HFL architecture, specifically designed to counter Gaussian noise and gradient ascent attacks. The framework performs a comprehensive reliability assessment for each vehicle by evaluating historical accuracy, contribution frequency, and anomaly records. Anomaly detection combines Z-score and cosine similarity analyses on model updates to identify both statistical outliers and directional deviations in model updates. To further refine detection, an adaptive thresholding mechanism is incorporated into the cosine similarity metric, dynamically adjusting the threshold based on the historical accuracy of each vehicle to enforce stricter standards for consistently high-performing vehicles. In addition, a weighted gradient averaging mechanism is implemented, which assigns higher weights to gradient updates from more trustworthy vehicles. To defend against coordinated attacks, a cross-cluster consistency check is applied to identify collaborative attacks in which multiple compromised clusters coordinate misleading updates. Together, these mechanisms form a multi-level defense strategy to filter out malicious contributions effectively. Simulation results show that the proposed algorithm significantly reduces convergence time compared to benchmark methods across both 1-hop and 3-hop topologies.
zh

[AI-14] EnviKal-Loc: Sub-10m Indoor LoRaWAN Localization using an Environmental-Aware Path Loss and Adaptive RSSI Smoothing

【速读】：该论文试图解决在复杂室内环境中实现亚10米精度的LoRaWAN定位问题，这一问题主要受到环境条件复杂性、多径衰落和瞬时遮挡的影响。解决方案的关键在于结合自适应滤波与扩展的对数距离、多墙路径损耗及阴影（PLS）模型，通过引入关键的LoRaWAN参数（如接收信号强度指示器（RSSI）、频率和信噪比（SNR））以及动态环境指标（如温度、湿度、二氧化碳、颗粒物和气压），提升定位精度。其中，自适应卡尔曼滤波器有效减少了RSSI波动，隔离了持续趋势与短暂噪声，从而显著提高了系统的鲁棒性。

链接: https://arxiv.org/abs/2505.01185
作者: Nahshon Mokua Obiri,Kristof Van Laerhoven
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:LoRaWAN technology’s extensive coverage positions it as a strong contender for large-scale IoT deployments. However, achieving sub-10 m accuracy in indoor localization remains challenging due to complex environmental conditions, multipath fading, and transient obstructions. This paper proposes a lightweight but robust approach combining adaptive filtering with an extended log-distance, multi-wall path loss and shadowing (PLS) model. Our methodology augments conventional models with critical LoRaWAN parameters (received signal strength indicator (RSSI), frequency, and signal-to-noise ratio (SNR)) and dynamic environmental indicators (temperature, humidity, carbon dioxide, particulate matter, and barometric pressure). An adaptive Kalman filter reduces RSSI fluctuations, isolating persistent trends from momentary noise. Using a six-month dataset of 1,328,334 field measurements, we evaluate three models: the baseline COST 231 multi-wall model (MWM), the baseline model augmented with environmental parameters (MWM-EP), and a forward-only adaptive Kalman-filtered RSSI version of the latter (MWM-EP-KF). Results confirm that the MWM-EP-KF achieves a mean absolute error (MAE) of 5.81 m, outperforming both the MWM-EP (10.56 m) and the baseline MWM framework (17.98 m). Environmental augmentation reduces systematic errors by 41.22%, while Kalman filtering significantly enhances robustness under high RSSI volatility by 42.63%, on average across all devices. These findings present an interpretable, efficient solution for precise indoor LoRaWAN localization in dynamically changing environments.
zh

[AI-15] Explainable AI Based Diagnosis of Poisoning Attacks in Evolutionary Swarms GECCO’25

【速读】：该论文试图解决分布式自主代理在复杂环境中进行协作任务时，因数据污染攻击导致团队级协调策略失效的问题，这可能引发代理间的不准确协作或对抗性行为。解决方案的关键在于提出一种框架，利用可解释人工智能（Explainable AI）方法分析数据污染攻击的影响，并通过进化智能建模代理间的交互，以识别最优联盟来执行协同任务，进而通过系统化的数据操纵攻击对群体模型进行污染，从而量化污染对团队策略的影响并提取诊断特征。

链接: https://arxiv.org/abs/2505.01181
作者: Mehrdad Asadi,Roxana Rădulescu,Ann Nowé
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear in short form in Genetic and Evolutionary Computation Conference (GECCO '25 Companion), 2025

点击查看摘要

Abstract:Swarming systems, such as for example multi-drone networks, excel at cooperative tasks like monitoring, surveillance, or disaster assistance in critical environments, where autonomous agents make decentralized decisions in order to fulfill team-level objectives in a robust and efficient manner. Unfortunately, team-level coordinated strategies in the wild are vulnerable to data poisoning attacks, resulting in either inaccurate coordination or adversarial behavior among the agents. To address this challenge, we contribute a framework that investigates the effects of such data poisoning attacks, using explainable AI methods. We model the interaction among agents using evolutionary intelligence, where an optimal coalition strategically emerges to perform coordinated tasks. Then, through a rigorous evaluation, the swarm model is systematically poisoned using data manipulation attacks. We showcase the applicability of explainable AI methods to quantify the effects of poisoning on the team strategy and extract footprint characterizations that enable diagnosing. Our findings indicate that when the model is poisoned above 10%, non-optimal strategies resulting in inefficient cooperation can be identified.
zh

[AI-16] LLM Security: Vulnerabilities Attacks Defenses and Countermeasures

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在训练阶段和部署后可能面临的各类安全威胁与漏洞问题。其关键解决方案是通过系统性地定义和分类针对LLMs的攻击类型，区分训练阶段攻击与已训练模型的攻击，并对这些攻击进行深入分析，同时探讨相应的防御机制。防御策略被划分为基于预防的防御和基于检测的防御两大类，论文还评估了现有防御机制的有效性，旨在为LLMs的安全提供结构化框架，并识别未来需要进一步研究的方向以增强防御能力。

链接: https://arxiv.org/abs/2505.01177
作者: Francisco Aguilera-Martínez,Fernando Berzal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to evolve, it is critical to assess the security threats and vulnerabilities that may arise both during their training phase and after models have been deployed. This survey seeks to define and categorize the various attacks targeting LLMs, distinguishing between those that occur during the training phase and those that affect already trained models. A thorough analysis of these attacks is presented, alongside an exploration of defense mechanisms designed to mitigate such threats. Defenses are classified into two primary categories: prevention-based and detection-based defenses. Furthermore, our survey summarizes possible attacks and their corresponding defense strategies. It also provides an evaluation of the effectiveness of the known defense mechanisms for the different security threats. Our survey aims to offer a structured framework for securing LLMs, while also identifying areas that require further research to improve and strengthen defenses against emerging security challenges.
zh

[AI-17] Distilling Two-Timed Flow Models by Separately Matching Initial and Terminal Velocities

【速读】：该论文试图解决生成模型中如何有效学习时间依赖的向量场以实现从噪声分布到数据分布的平滑概率路径生成问题，其核心挑战在于如何通过简化模型结构（如两时间流模型）来保持生成质量。解决方案的关键在于提出一种新的损失函数——初始/终端速度匹配（ITVM）损失，该损失通过增加冗余项以匹配初始时间点的速度、移除终端时间点速度项的导数，并利用经过指数移动平均（EMA）稳定的模型版本计算目标终端平均速度，从而提升模型在少量步骤下的生成性能。

链接: https://arxiv.org/abs/2505.01169
作者: Pramook Khungurn,Pratch Piyawongwisal,Sira Sriswadi,Supasorn Suwajanakorn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A flow matching model learns a time-dependent vector field v_t(x) that generates a probability path \ p_t _0 \leq t \leq 1 that interpolates between a well-known noise distribution ( p_0 ) and the data distribution ( p_1 ). It can be distilled into a \emphtwo-timed flow model (TTFM) \phi_s,x(t) that can transform a sample belonging to the distribution at an initial time s to another belonging to the distribution at a terminal time t in one function evaluation. We present a new loss function for TTFM distillation called the \emphinitial/terminal velocity matching (ITVM) loss that extends the Lagrangian Flow Map Distillation (LFMD) loss proposed by Boffi et al. by adding redundant terms to match the initial velocities at time s , removing the derivative from the terminal velocity term at time t , and using a version of the model under training, stabilized by exponential moving averaging (EMA), to compute the target terminal average velocity. Preliminary experiments show that our loss leads to better few-step generation performance on multiple types of datasets and model architectures over baselines.
zh

[AI-18] Harmonizing Intra-coherence and Inter-divergence in Ensemble Attacks for Adversarial Transferability

【速读】：该论文旨在解决模型集成攻击中对抗样本的迁移性提升所带来的深度神经网络安全性威胁，具体针对现有方法在捕捉跨模型共享梯度方向不足以及缺乏自适应权重分配机制这两个关键问题。其解决方案的关键在于提出了一种名为Harmonized Ensemble for Adversarial Transferability (HEAT)的新方法，该方法首次将领域泛化引入对抗样本生成，包含两个核心模块：Consensus Gradient Direction Synthesizer（通过奇异值分解合成共享梯度方向）和Dual-Harmony Weight Orchestrator（动态平衡域内一致性与域间多样性）。

链接: https://arxiv.org/abs/2505.01168
作者: Zhaoyang Ma,Zhihao Wu,Wang Lu,Xin Gao,Jinghang Yue,Taolin Zhang,Lipo Wang,Youfang Lin,Jing Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of model ensemble attacks has significantly improved the transferability of adversarial examples, but this progress also poses severe threats to the security of deep neural networks. Existing methods, however, face two critical challenges: insufficient capture of shared gradient directions across models and a lack of adaptive weight allocation mechanisms. To address these issues, we propose a novel method Harmonized Ensemble for Adversarial Transferability (HEAT), which introduces domain generalization into adversarial example generation for the first time. HEAT consists of two key modules: Consensus Gradient Direction Synthesizer, which uses Singular Value Decomposition to synthesize shared gradient directions; and Dual-Harmony Weight Orchestrator which dynamically balances intra-domain coherence, stabilizing gradients within individual models, and inter-domain diversity, enhancing transferability across models. Experimental results demonstrate that HEAT significantly outperforms existing methods across various datasets and settings, offering a new perspective and direction for adversarial attack research.
zh

[AI-19] Risk Analysis and Design Against Adversarial Actions

【速读】：该论文旨在解决机器学习模型在面对对抗性行为时如何提供可靠预测的问题（adversarial actions），这一挑战源于模型在部署时所遇到的数据往往与训练时的条件存在偏差。论文提出了一种通用且理论基础扎实的框架，用于评估模型对多种类型和强度攻击的鲁棒性。解决方案的关键在于通过松弛优化技术（relaxed optimization techniques）进行学习，并在无需额外测试数据的情况下，在分布无关（distribution-free）的设定下评估模型的脆弱性，从而增强对模型适用性的信任并支持不同方案的选择。

链接: https://arxiv.org/abs/2505.01130
作者: Marco C. Campi,Algo Carè,Luis G. Crespo,Simone Garatti,Federico A. Ramponi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning models capable of providing reliable predictions in the face of adversarial actions has become a central focus of the machine learning community in recent years. This challenge arises from observing that data encountered at deployment time often deviate from the conditions under which the model was trained. In this paper, we address deployment-time adversarial actions and propose a versatile, well-principled framework to evaluate the model’s robustness against attacks of diverse types and intensities. While we initially focus on Support Vector Regression (SVR), the proposed approach extends naturally to the broad domain of learning via relaxed optimization techniques. Our results enable an assessment of the model vulnerability without requiring additional test data and operate in a distribution-free setup. These results not only provide a tool to enhance trust in the model’s applicability but also aid in selecting among competing alternatives. Later in the paper, we show that our findings also offer useful insights for establishing new results within the out-of-distribution framework.
zh

[AI-20] Multi-Objective Reinforcement Learning for Water Management AAMAS2025

【速读】：该论文旨在解决多目标强化学习（Multi-objective Reinforcement Learning, MORL）在现实世界应用中的挑战，特别是缺乏复杂且真实的环境和基准测试的问题。其解决方案的关键在于引入尼罗河盆地水资源管理的案例，并将其建模为一个MORL环境，从而为现有MORL算法提供了一个实际任务进行基准测试。研究结果表明，专门的水资源管理方法在该任务中优于当前最先进的MORL方法，突显了MORL算法在实际场景中面临的可扩展性问题。

链接: https://arxiv.org/abs/2505.01094
作者: Zuzanna Osika,Roxana Radelescu,Jazmin Zatarain Salazar,Frans Oliehoek,Pradeep K. Murukannaiah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to AAMAS 2025

点击查看摘要

Abstract:Many real-world problems (e.g., resource management, autonomous driving, drug discovery) require optimizing multiple, conflicting objectives. Multi-objective reinforcement learning (MORL) extends classic reinforcement learning to handle multiple objectives simultaneously, yielding a set of policies that capture various trade-offs. However, the MORL field lacks complex, realistic environments and benchmarks. We introduce a water resource (Nile river basin) management case study and model it as a MORL environment. We then benchmark existing MORL algorithms on this task. Our results show that specialized water management methods outperform state-of-the-art MORL approaches, underscoring the scalability challenges MORL algorithms face in real-world scenarios.
zh

[AI-21] Artificial Intelligence in Government: Why People Feel They Lose Control

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在公共行政中的应用所带来的治理挑战，特别是其对公平性、透明度和问责制的影响。论文提出的解决方案关键在于运用委托-代理理论（Principal-Agent Theory, PAT）来分析AI采纳作为特殊形式的委托行为，并识别出三个核心张力：可评估性（assessability）、依赖性（dependency）和可争议性（contestability）。通过实验证据表明，尽管效率提升初期能增强公众信任，但长期来看，结构性风险会显著削弱制度信任和公众控制感，因此需要政策制定者透明应对委托风险以维持公众信任。

链接: https://arxiv.org/abs/2505.01085
作者: Alexander Wuttke,Adrian Rauchfleisch,Andreas Jungherr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of Artificial Intelligence (AI) in public administration is expanding rapidly, moving from automating routine tasks to deploying generative and agentic systems that autonomously act on goals. While AI promises greater efficiency and responsiveness, its integration into government functions raises concerns about fairness, transparency, and accountability. This article applies principal-agent theory (PAT) to conceptualize AI adoption as a special case of delegation, highlighting three core tensions: assessability (can decisions be understood?), dependency (can the delegation be reversed?), and contestability (can decisions be challenged?). These structural challenges may lead to a “failure-by-success” dynamic, where early functional gains obscure long-term risks to democratic legitimacy. To test this framework, we conducted a pre-registered factorial survey experiment across tax, welfare, and law enforcement domains. Our findings show that although efficiency gains initially bolster trust, they simultaneously reduce citizens’ perceived control. When the structural risks come to the foreground, institutional trust and perceived control both drop sharply, suggesting that hidden costs of AI adoption significantly shape public attitudes. The study demonstrates that PAT offers a powerful lens for understanding the institutional and political implications of AI in government, emphasizing the need for policymakers to address delegation risks transparently to maintain public trust.
zh

[AI-22] MADIL: An MDL-based Framework for Efficient Program Synthesis in the ARC Benchmark

【速读】：该论文旨在解决人工智能在技能获取和泛化能力上的效率问题，特别是在少样本或零样本条件下实现高效学习与推理。其解决方案的关键在于提出一种基于最小描述长度（Minimum Description Length, MDL）原则的新型方法——MADIL（MDL-based AI），通过模式分解实现结构化的泛化能力，从而在保证一定性能的同时提升模型的效率与可解释性。

链接: https://arxiv.org/abs/2505.01081
作者: Sébastien Ferré
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 54 pages

点击查看摘要

Abstract:Artificial Intelligence (AI) has achieved remarkable success in specialized tasks but struggles with efficient skill acquisition and generalization. The Abstraction and Reasoning Corpus (ARC) benchmark evaluates intelligence based on minimal training requirements. While Large Language Models (LLMs) have recently improved ARC performance, they rely on extensive pre-training and high computational costs. We introduce MADIL (MDL-based AI), a novel approach leveraging the Minimum Description Length (MDL) principle for efficient inductive learning. MADIL performs pattern-based decomposition, enabling structured generalization. While its performance (7% at ArcPrize 2024) remains below LLM-based methods, it offers greater efficiency and interpretability. This paper details MADIL’s methodology, its application to ARC, and experimental evaluations.
zh

[AI-23] Retrieval Augmented Learning: A Retrial-based Large Language Model Self-Supervised Learning and Autonomous Knowledge Generation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在预训练阶段缺乏领域特定数据的问题，这一问题严重限制了基于LLM的决策系统在专业应用场景中的表现，同时，场景化后的模型微调需要大量的计算资源。论文提出的解决方案是Retrial-Augmented Learning (RAL)，其关键在于无需模型训练即可运行的无奖励自监督学习框架，通过将检索增强生成（Retrieval-Augmented Generation, RAG）转化为组织中间数据的模块，实现了假设提出、假设验证和知识生成的三阶段自主知识生成过程。

链接: https://arxiv.org/abs/2505.01073
作者: Zongyuan Li,Pengfei Li,Runnan Qi,Yanan Ni,Lumin Jiang,Hui Wu,Xuebo Zhang,Kuihua Huang,Xian Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The lack of domain-specific data in the pre-training of Large Language Models (LLMs) severely limits LLM-based decision systems in specialized applications, while post-training a model in the scenarios requires significant computational resources. In this paper, we present Retrial-Augmented Learning (RAL), a reward-free self-supervised learning framework for LLMs that operates without model training. By developing Retrieval-Augmented Generation (RAG) into a module for organizing intermediate data, we realized a three-stage autonomous knowledge generation of proposing a hypothesis, validating the hypothesis, and generating the knowledge. The method is evaluated in the LLM-PySC2 environment, a representative decision-making platform that combines sufficient complexity with domain-specific knowledge requirements. Experiments demonstrate that the proposed method effectively reduces hallucination by generating and utilizing validated knowledge, and increases decision-making performance at an extremely low cost. Meanwhile, the approach exhibits potential in out-of-distribution(OOD) tasks, robustness, and transferability, making it a cost-friendly but effective solution for decision-making problems and autonomous knowledge generation.
zh

[AI-24] Improving Group Fairness in Knowledge Distillation via Laplace Approximation of Early Exits

【速读】：该论文试图解决知识蒸馏（Knowledge Distillation, KD）中由于教师模型与学生模型在特征表示上的差异导致的群体公平性下降问题，尤其是在标签与输入属性存在虚假相关性的场景下。解决方案的关键在于利用基于拉普拉斯近似（Laplace approximation）的方法获取校准的不确定性估计，以更有效地重新加权困难实例，从而提升群体公平性。相比基于置信度边界的重加权方法，作者假设拉普拉斯近似能够更稳健地识别困难或模糊的实例。

链接: https://arxiv.org/abs/2505.01070
作者: Edvin Fasth,Sagar Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Knowledge distillation (KD) has become a powerful tool for training compact student models using larger, pretrained teacher models, often requiring less data and computational resources. Teacher models typically possess more layers and thus exhibit richer feature representations compared to their student counterparts. Furthermore, student models tend to learn simpler, surface-level features in their early layers. This discrepancy can increase errors in groups where labels spuriously correlate with specific input attributes, leading to a decline in group fairness even when overall accuracy remains comparable to the teacher. To mitigate these challenges, Early-Exit Neural Networks (EENNs), which enable predictions at multiple intermediate layers, have been employed. Confidence margins derived from these early exits have been utilized to reweight both cross-entropy and distillation losses on a per-instance basis. In this paper, we propose that leveraging Laplace approximation-based methods to obtain well-calibrated uncertainty estimates can also effectively reweight challenging instances and improve group fairness. We hypothesize that Laplace approximation offers a more robust identification of difficult or ambiguous instances compared to margin-based approaches. To validate our claims, we benchmark our approach using a Bert-based model on the MultiNLI dataset.
zh

[AI-25] A Rusty Link in the AI Supply Chain: Detecting Evil Configurations in Model Repositories

【速读】：该论文旨在解决AI模型托管平台（如Hugging Face）中预训练模型配置文件所面临的潜在安全威胁，特别是这些配置文件可能被恶意利用以执行未经授权代码的问题。论文提出的关键解决方案是CONFIGSCAN，这是一个基于大型语言模型（LLMs）的工具，能够结合配置文件相关的运行时代码和关键库进行分析，从而高效检测可疑元素，具有低误报率和高准确性。

链接: https://arxiv.org/abs/2505.01067
作者: Ziqi Ding,Qian Fu,Junchen Ding,Gelei Deng,Yi Liu,Yuekang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have spurred the development of diverse AI applications from code generation and video editing to text generation; however, AI supply chains such as Hugging Face, which host pretrained models and their associated configuration files contributed by the public, face significant security challenges; in particular, configuration files originally intended to set up models by specifying parameters and initial settings can be exploited to execute unauthorized code, yet research has largely overlooked their security compared to that of the models themselves; in this work, we present the first comprehensive study of malicious configurations on Hugging Face, identifying three attack scenarios (file, website, and repository operations) that expose inherent risks; to address these threats, we introduce CONFIGSCAN, an LLM-based tool that analyzes configuration files in the context of their associated runtime code and critical libraries, effectively detecting suspicious elements with low false positive rates and high accuracy; our extensive evaluation uncovers thousands of suspicious repositories and configuration files, underscoring the urgent need for enhanced security validation in AI model hosting platforms.
zh

[AI-26] Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在自动化漏洞利用生成（Automated Exploit Generation, AEG）中的有效性问题，重点评估其合作性与技术能力。解决方案的关键在于引入一个经过重构的软件安全实验基准，以减少数据集偏差，并设计了一个基于LLM的攻击者框架，用于系统化地引导LLM生成漏洞利用代码。通过该框架，研究揭示了不同模型在AEG任务中的表现差异，为未来LLM驱动的AEG研究提供了基础和方向。

链接: https://arxiv.org/abs/2505.01065
作者: David Jin,Qian Fu,Yuekang Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code-related tasks, raising concerns about their potential for automated exploit generation (AEG). This paper presents the first systematic study on LLMs’ effectiveness in AEG, evaluating both their cooperativeness and technical proficiency. To mitigate dataset bias, we introduce a benchmark with refactored versions of five software security labs. Additionally, we design an LLM-based attacker to systematically prompt LLMs for exploit generation. Our experiments reveal that GPT-4 and GPT-4o exhibit high cooperativeness, comparable to uncensored models, while Llama3 is the most resistant. However, no model successfully generates exploits for refactored labs, though GPT-4o’s minimal errors highlight the potential for LLM-driven AEG advancements.
zh

[AI-27] Model Tensor Planning

【速读】：该论文旨在解决基于采样的模型预测控制（MPC）在非线性及接触丰富的机器人任务中因局部贪婪采样策略导致的探索能力不足问题。其解决方案的关键在于提出一种名为“模型张量规划”（MTP）的新颖采样式MPC框架，通过结构化张量采样引入高熵控制轨迹生成，并结合B样条和Akima样条进行轨迹插值，以确保控制候选的平滑性和全局多样性。此外，MTP还引入了一个简单的β-混合策略，在改进的交叉熵方法（CEM）更新中融合局部开发与全局探索样本，从而实现控制优化与探索之间的平衡。

链接: https://arxiv.org/abs/2505.01059
作者: An T. Le,Khai Nguyen,Minh Nhat Vu,João Carvalho,Jan Peters
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Sampling-based model predictive control (MPC) offers strong performance in nonlinear and contact-rich robotic tasks, yet often suffers from poor exploration due to locally greedy sampling schemes. We propose \emphModel Tensor Planning (MTP), a novel sampling-based MPC framework that introduces high-entropy control trajectory generation through structured tensor sampling. By sampling over randomized multipartite graphs and interpolating control trajectories with B-splines and Akima splines, MTP ensures smooth and globally diverse control candidates. We further propose a simple \beta -mixing strategy that blends local exploitative and global exploratory samples within the modified Cross-Entropy Method (CEM) update, balancing control refinement and exploration. Theoretically, we show that MTP achieves asymptotic path coverage and maximum entropy in the control trajectory space in the limit of infinite tensor depth and width. Our implementation is fully vectorized using JAX and compatible with MuJoCo XLA, supporting \emphJust-in-time (JIT) compilation and batched rollouts for real-time control with online domain randomization. Through experiments on various challenging robotic tasks, ranging from dexterous in-hand manipulation to humanoid locomotion, we demonstrate that MTP outperforms standard MPC and evolutionary strategy baselines in task success and control robustness. Design and sensitivity ablations confirm the effectiveness of MTP tensor sampling structure, spline interpolation choices, and mixing strategy. Altogether, MTP offers a scalable framework for robust exploration in model-based planning and control. Comments: 22 pages, 9 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY) Cite as: arXiv:2505.01059 [cs.RO] (or arXiv:2505.01059v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2505.01059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-28] Low-Precision Training of Large Language Models : Methods Challenges and Opportunities

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）训练过程中对硬件资源需求高导致的效率与可扩展性问题。其解决方案的关键在于低精度训练技术，通过将权重、激活和梯度等组件采用不同的数值格式进行表示，以提升训练效率。为系统化整理这些方法，作者将其分为三类：基于定点和整数的方法、基于浮点的方法以及自定义格式的方法，并探讨了量化感知训练方法，这些方法在前向传播中与低精度训练具有相似性。

链接: https://arxiv.org/abs/2505.01043
作者: Zhiwei Hao,Jianyuan Guo,Li Shen,Yong Luo,Han Hu,Guoxia Wang,Dianhai Yu,Yonggang Wen,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance across various domains. However, the substantial hardware resources required for their training present a significant barrier to efficiency and scalability. To mitigate this challenge, low-precision training techniques have been widely adopted, leading to notable advancements in training efficiency. Despite these gains, low-precision training involves several components \unicodex2013 such as weights, activations, and gradients \unicodex2013 each of which can be represented in different numerical formats. The resulting diversity has created a fragmented landscape in low-precision training research, making it difficult for researchers to gain a unified overview of the field. This survey provides a comprehensive review of existing low-precision training methods. To systematically organize these approaches, we categorize them into three primary groups based on their underlying numerical formats, which is a key factor influencing hardware compatibility, computational efficiency, and ease of reference for readers. The categories are: (1) fixed-point and integer-based methods, (2) floating-point-based methods, and (3) customized format-based methods. Additionally, we discuss quantization-aware training approaches, which share key similarities with low-precision training during forward propagation. Finally, we highlight several promising research directions to advance this field. A collection of papers discussed in this survey is provided in this https URL.
zh

[AI-29] Stagnation in Evolutionary Algorithms: Convergence neq Optimality

【速读】：该论文试图解决进化计算领域中关于停滞（stagnation）与收敛（convergence）关系的传统认知问题，即认为停滞会阻碍进化算法的收敛，而收敛必然意味着最优性。论文的关键解决方案是首次提出个体停滞实际上可能促进种群整体的收敛，并指出收敛并不必然代表最优性，甚至局部最优性。研究通过提供反例证明了仅依靠收敛无法保证进化算法的有效性。

链接: https://arxiv.org/abs/2505.01036
作者: Xiaojun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the evolutionary computation community, it is widely believed that stagnation impedes convergence in evolutionary algorithms, and that convergence inherently indicates optimality. However, this perspective is misleading. In this study, it is the first to highlight that the stagnation of an individual can actually facilitate the convergence of the entire population, and convergence does not necessarily imply optimality, not even local optimality. Convergence alone is insufficient to ensure the effectiveness of evolutionary algorithms. Several counterexamples are provided to illustrate this argument.
zh

[AI-30] Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory IJCAI2025

【速读】：该论文试图解决Windows Active Directory (AD)系统中安全漏洞的修复流程效率问题，具体是通过优化攻击图中的路径移除过程来减少人工干预的次数。其关键解决方案是提出一种自适应路径移除问题（Adaptive Path Removal Problem），通过智能推荐攻击路径并让IT管理员选择移除边，从而最小化人机交互的期望次数，提升安全加固工作的自动化程度和效率。

链接: https://arxiv.org/abs/2505.01028
作者: Huy Q. Ngo,Mingyu Guo,Hung Nguyen
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To be appear in IJCAI 2025

点击查看摘要

Abstract:Security vulnerabilities in Windows Active Directory (AD) systems are typically modeled using an attack graph and hardening AD systems involves an iterative workflow: security teams propose an edge to remove, and IT operations teams manually review these fixes before implementing the removal. As verification requires significant manual effort, we formulate an Adaptive Path Removal Problem to minimize the number of steps in this iterative removal process. In our model, a wizard proposes an attack path in each step and presents it as a set of multiple-choice options to the IT admin. The IT admin then selects one edge from the proposed set to remove. This process continues until the target t is disconnected from source s or the number of proposed paths reaches B . The model aims to optimize the human effort by minimizing the expected number of interactions between the IT admin and the security wizard. We first prove that the problem is \mathcal#P -hard. We then propose a set of solutions including an exact algorithm, an approximate algorithm, and several scalable heuristics. Our best heuristic, called DPR, can operate effectively on larger-scale graphs compared to the exact algorithm and consistently outperforms the approximate algorithm across all graphs. We verify the effectiveness of our algorithms on several synthetic AD graphs and an AD attack graph collected from a real organization.
zh

[AI-31] Improving Large Language Model Planning with Action Sequence Similarity

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在规划任务中因上下文信号选择不当而导致性能受限的问题，特别是如何通过上下文学习（In-Context Learning, ICL）提升模型的规划能力。其解决方案的关键在于提出GRASE-DC方法，该方法通过基于计划侧动作序列相似性（Action Sequence, AS）的样本采样与过滤，实现对示例的重新采样和动态聚类，从而在相关性与多样性之间取得平衡，显著提升了模型在多种规划任务中的表现。

链接: https://arxiv.org/abs/2505.01009
作者: Xinran Zhao,Hanie Sedghi,Bernd Bohnet,Dale Schuurmans,Azade Nova
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:Planning is essential for artificial intelligence systems to look ahead and proactively determine a course of actions to reach objectives in the virtual and real world. Recent work on large language models (LLMs) sheds light on their planning capability in various tasks. However, it remains unclear what signals in the context influence the model performance. In this work, we explore how to improve the model planning capability through in-context learning (ICL), specifically, what signals can help select the exemplars. Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars with dynamic clustering on AS to achieve a balance of relevance and diversity. Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks (up to ~11-40 point absolute accuracy improvement with 27.3% fewer exemplars needed on average). With GRASE-DC* + VAL, where we iteratively apply GRASE-DC with a validator, we are able to even boost the performance by 18.9% more. Extensive analysis validates the consistent performance improvement of GRASE-DC with various backbone LLMs and on both classical planning and natural language planning benchmarks. GRASE-DC can further boost the planning accuracy by ~24 absolute points on harder problems using simpler problems as exemplars over a random baseline. This demonstrates its ability to generalize to out-of-distribution problems. Comments: 25 pages, 11 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.01009 [cs.AI] (or arXiv:2505.01009v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.01009 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: The Thirteenth International Conference on Learning Representations (ICLR 2025)
zh

[AI-32] oward Data-centric Directed Graph Learning: An Entropy-driven Approach ICML2025

【速读】：该论文旨在解决现有方向图神经网络（DiGNNs）在利用有向边时未能充分挖掘图中隐藏的数据知识的问题，从而导致模型预测性能不佳。其解决方案的关键在于提出一种数据驱动的有向图学习框架EDEN，该框架基于分层编码理论，通过从拓扑结构角度构建粗粒度的层次知识树（HKT），并量化节点特征与标签之间的互信息，以优化知识流，实现数据驱动的知识蒸馏（KD）监督，进而提升模型的编码能力。

链接: https://arxiv.org/abs/2505.00983
作者: Xunkai Li,Zhengyu Wu,Kaichi Yu,Hongchao Qin,Guang Zeng,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:The directed graph (digraph), as a generalization of undirected graphs, exhibits superior representation capability in modeling complex topology systems and has garnered considerable attention in recent years. Despite the notable efforts made by existing DiGraph Neural Networks (DiGNNs) to leverage directed edges, they still fail to comprehensively delve into the abundant data knowledge concealed in the digraphs. This data-level limitation results in model-level sub-optimal predictive performance and underscores the necessity of further exploring the potential correlations between the directed edges (topology) and node profiles (feature and labels) from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we propose \textbfEntropy-driven \textbfDigraph knowl\textbfEdge distillatio\textbfN (EDEN), which can serve as a data-centric digraph learning paradigm or a model-agnostic hot-and-plug data-centric Knowledge Distillation (KD) module. The core idea is to achieve data-centric ML, guided by our proposed hierarchical encoding theory for structured data. Specifically, EDEN first utilizes directed structural measurements from a topology perspective to construct a coarse-grained Hierarchical Knowledge Tree (HKT). Subsequently, EDEN quantifies the mutual information of node profiles to refine knowledge flow in the HKT, enabling data-centric KD supervision within model training. As a general framework, EDEN can also naturally extend to undirected scenarios and demonstrate satisfactory performance. In our experiments, EDEN has been widely evaluated on 14 (di)graph datasets (homophily and heterophily) and across 4 downstream tasks. The results demonstrate that EDEN attains SOTA performance and exhibits strong improvement for prevalent (Di)GNNs. Comments: Accepted by ICML 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI) Cite as: arXiv:2505.00983 [cs.LG] (or arXiv:2505.00983v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.00983 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models

【速读】：该论文旨在解决自动驾驶车辆（Autonomous Vehicles, AVs）仿真测试中场景生成方法存在的问题，即现有方法要么过度拟合常见驾驶模式，要么以离线、非交互的方式运行，无法有效暴露罕见的安全关键性边缘案例。其解决方案的关键在于引入一种在线的、检索增强型的大语言模型（Large Language Model, LLM）框架，通过LLM行为分析器推断背景车辆最危险的意图，并调用其他LLM代理合成可行的对抗性轨迹，同时结合动态记忆与检索库以缓解灾难性遗忘并加速适应过程。

链接: https://arxiv.org/abs/2505.00972
作者: Yuewen Mei,Tong Nie,Jian Sun,Ye Tian
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Simulation-based testing is crucial for validating autonomous vehicles (AVs), yet existing scenario generation methods either overfit to common driving patterns or operate in an offline, non-interactive manner that fails to expose rare, safety-critical corner cases. In this paper, we introduce an online, retrieval-augmented large language model (LLM) framework for generating safety-critical driving scenarios. Our method first employs an LLM-based behavior analyzer to infer the most dangerous intent of the background vehicle from the observed state, then queries additional LLM agents to synthesize feasible adversarial trajectories. To mitigate catastrophic forgetting and accelerate adaptation, we augment the framework with a dynamic memorization and retrieval bank of intent-planner pairs, automatically expanding its behavioral library when novel intents arise. Evaluations using the Waymo Open Motion Dataset demonstrate that our model reduces the mean minimum time-to-collision from 1.62 to 1.08 s and incurs a 75% collision rate, substantially outperforming baselines.
zh

[AI-34] ree-Sliced Wasserstein Distance with Nonlinear Projection ICML2025

【速读】：该论文试图解决传统Sliced Wasserstein (SW)距离在捕捉积分域拓扑结构时的局限性，同时保持计算效率。其解决方案的关键在于提出一种非线性投影框架，将Tree-Sliced Wasserstein (TSW)距离中的线性投影替换为更一般的投影，确保相关的Radon变换具有单射性，并保持所生成度量的合理性。通过设计合适的投影方式，该方法能够在欧几里得空间和球面上构建高效的度量，并在多个应用中展现出优于现有SW和TSW变体的性能。

链接: https://arxiv.org/abs/2505.00968
作者: Thanh Tran,Viet-Hoang Tran,Thanh Chu,Trang Pham,Laurent El Ghaoui,Tam Le,Tan M. Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Tree-Sliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.
zh

[AI-35] A Self-Supervised Transformer for Unusable Shared Bike Detection ITSC2025

【速读】：该论文旨在解决共享自行车系统（BSS）中故障车辆检测的难题，传统方法在动态时空（ST）使用模式捕捉和标签稀缺性方面存在局限。其解决方案的关键在于提出一种自监督Transformer（SSTransformer）框架，通过利用GPS轨迹和行程记录提取ST特征，并采用自监督预训练策略提升特征提取能力，随后进行微调以实现高效的车辆状态识别。

链接: https://arxiv.org/abs/2505.00932
作者: Yin Huang,Yongqi Dong,Youhua Tang,Alvaro García Hernandez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 6 pages, 5 figures, under review by the 2025 IEEE International Conference on Intelligent Transportation Systems (IEEE ITSC 2025)

点击查看摘要

Abstract:The rapid expansion of bike-sharing systems (BSS) has greatly improved urban “last-mile” connectivity, yet large-scale deployments face escalating operational challenges, particularly in detecting faulty bikes. Existing detection approaches either rely on static model-based thresholds that overlook dynamic spatiotemporal (ST) usage patterns or employ supervised learning methods that struggle with label scarcity and class imbalance. To address these limitations, this paper proposes a novel Self-Supervised Transformer (SSTransformer) framework for automatically detecting unusable shared bikes, leveraging ST features extracted from GPS trajectories and trip records. The model incorporates a self-supervised pre-training strategy to enhance its feature extraction capabilities, followed by fine-tuning for efficient status recognition. In the pre-training phase, the Transformer encoder learns generalized representations of bike movement via a self-supervised objective; in the fine-tuning phase, the encoder is adapted to a downstream binary classification task. Comprehensive experiments on a real-world dataset of 10,730 bikes (1,870 unusable, 8,860 normal) from Chengdu, China, demonstrate that SSTransformer significantly outperforms traditional machine learning, ensemble learning, and deep learning baselines, achieving the best accuracy (97.81%), precision (0.8889), and F1-score (0.9358). This work highlights the effectiveness of self-supervised Transformer on ST data for capturing complex anomalies in BSS, paving the way toward more reliable and scalable maintenance solutions for shared mobility.
zh

[AI-36] Dynamic and Distributed Routing in IoT Networks based on Multi-Objective Q-Learning

【速读】：该论文旨在解决物联网（IoT）网络中动态优先级变化下的路由优化问题，传统路由协议通常针对静态目标进行优化，而实际应用中，如智能监控系统，不同传输任务对延迟和能耗的需求可能随时间快速变化。解决方案的关键在于提出一种基于多目标Q-learning的动态分布式路由算法，结合多目标优化与Q-learning的思想，并引入一种新颖的贪心插值策略，以在实时偏好变化时做出近似最优决策，从而利用历史知识快速适应不可预测的偏好变化，提升整体性能。

链接: https://arxiv.org/abs/2505.00918
作者: Shubham Vaishnav,Praveen Kumar Donta,Sindri Magnússon
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The last few decades have witnessed a rapid increase in IoT devices owing to their wide range of applications, such as smart healthcare monitoring systems, smart cities, and environmental monitoring. A critical task in IoT networks is sensing and transmitting information over the network. The IoT nodes gather data by sensing the environment and then transmit this data to a destination node via multi-hop communication, following some routing protocols. These protocols are usually designed to optimize possibly contradictory objectives, such as maximizing packet delivery ratio and energy efficiency. While most literature has focused on optimizing a static objective that remains unchanged, many real-world IoT applications require adapting to rapidly shifting priorities. For example, in monitoring systems, some transmissions are time-critical and require a high priority on low latency, while other transmissions are less urgent and instead prioritize energy efficiency. To meet such dynamic demands, we propose novel dynamic and distributed routing based on multiobjective Q-learning that can adapt to changes in preferences in real-time. Our algorithm builds on ideas from both multi-objective optimization and Q-learning. We also propose a novel greedy interpolation policy scheme to take near-optimal decisions for unexpected preference changes. The proposed scheme can approximate and utilize the Pareto-efficient solutions for dynamic preferences, thus utilizing past knowledge to adapt to unpredictable preferences quickly during runtime. Simulation results show that the proposed scheme outperforms state-of-the-art algorithms for various exploration strategies, preference variation patterns, and important metrics like overall reward, energy efficiency, and packet delivery ratio.
zh

[AI-37] Fine-Tuning without Performance Degradation

【速读】：该论文试图解决离线学习策略在在线微调（fine-tuning）过程中性能下降和学习效率低下的问题（offline-to-online fine-tuning challenges）。其解决方案的关键在于提出一种基于“Jump Start”算法的新型微调方法，该方法根据在线性能估计逐步增加探索力度，从而实现快速微调并显著减少性能下降。

链接: https://arxiv.org/abs/2505.00913
作者: Han Wang,Adam White,Martha White
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning policies learned offline remains a major challenge in application domains. Monotonic performance improvement during \emphfine-tuning is often challenging, as agents typically experience performance degradation at the early fine-tuning stage. The community has identified multiple difficulties in fine-tuning a learned network online, however, the majority of progress has focused on improving learning efficiency during fine-tuning. In practice, this comes at a serious cost during fine-tuning: initially, agent performance degrades as the agent explores and effectively overrides the policy learned offline. We show across a range of settings, many offline-to-online algorithms exhibit either (1) performance degradation or (2) slow learning (sometimes effectively no improvement) during fine-tuning. We introduce a new fine-tuning algorithm, based on an algorithm called Jump Start, that gradually allows more exploration based on online estimates of performance. Empirically, this approach achieves fast fine-tuning and significantly reduces performance degradations compared with existing algorithms designed to do the same.
zh

[AI-38] Rethinking Time Encoding via Learnable Transformation Functions

【速读】：该论文试图解决如何有效建模时间信息并将其融入涉及时间顺序事件的应用或模型中的问题，特别是在面对现实世界中多样化和复杂的时间模式时，传统时间编码方法因依赖特定归纳偏置（如使用三角函数建模周期性）而显得不足。解决方案的关键在于引入基于可学习变换的广义时间编码（LeTE），通过深度函数学习技术对时间编码中的非线性变换进行参数化，使其具备可学习能力，从而能够建模包括多样化和复杂时间动态在内的广义时间模式。

链接: https://arxiv.org/abs/2505.00887
作者: Xi Chen,Yateng Tang,Jiarong Xu,Jiawei Zhang,Siwei Zhang,Sijia Peng,Xuehao Zheng,Yun Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures, 10 tables

点击查看摘要

Abstract:Effectively modeling time information and incorporating it into applications or models involving chronologically occurring events is crucial. Real-world scenarios often involve diverse and complex time patterns, which pose significant challenges for time encoding methods. While previous methods focus on capturing time patterns, many rely on specific inductive biases, such as using trigonometric functions to model periodicity. This narrow focus on single-pattern modeling makes them less effective in handling the diversity and complexities of real-world time patterns. In this paper, we investigate to improve the existing commonly used time encoding methods and introduce Learnable Transformation-based Generalized Time Encoding (LeTE). We propose using deep function learning techniques to parameterize non-linear transformations in time encoding, making them learnable and capable of modeling generalized time patterns, including diverse and complex temporal dynamics. By enabling learnable transformations, LeTE encompasses previous methods as specific cases and allows seamless integration into a wide range of tasks. Through extensive experiments across diverse domains, we demonstrate the versatility and effectiveness of LeTE.
zh

[AI-39] owards Explainable Temporal User Profiling with LLM s

【速读】：该论文试图解决传统用户画像方法在建模用户偏好时忽视用户兴趣的动态性和复杂性，尤其是短期与长期偏好之间相互作用的问题。其解决方案的关键在于利用大语言模型（Large Language Models, LLMs）生成用户交互历史的自然语言摘要，并区分近期行为与更持久的倾向，从而构建包含时间维度的用户偏好模型。该框架通过预训练模型编码文本摘要，并结合注意力机制动态融合短期和长期嵌入，形成全面的用户表示，同时支持推荐结果的可解释性。

链接: https://arxiv.org/abs/2505.00886
作者: Milad Sabouri,Masoud Mansoury,Kun Lin,Bamshad Mobasher
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately modeling user preferences is vital not only for improving recommendation performance but also for enhancing transparency in recommender systems. Conventional user profiling methods, such as averaging item embeddings, often overlook the evolving, nuanced nature of user interests, particularly the interplay between short-term and long-term preferences. In this work, we leverage large language models (LLMs) to generate natural language summaries of users’ interaction histories, distinguishing recent behaviors from more persistent tendencies. Our framework not only models temporal user preferences but also produces natural language profiles that can be used to explain recommendations in an interpretable manner. These textual profiles are encoded via a pre-trained model, and an attention mechanism dynamically fuses the short-term and long-term embeddings into a comprehensive user representation. Beyond boosting recommendation accuracy over multiple baselines, our approach naturally supports explainability: the interpretable text summaries and attention weights can be exposed to end users, offering insights into why specific items are suggested. Experiments on real-world datasets underscore both the performance gains and the promise of generating clearer, more transparent justifications for content-based recommendations.
zh

[AI-40] Car Sensors Health Monitoring by Verification Based on Autoencoder and Random Forest Regression

【速读】：该论文旨在解决车辆传感器健康状态监测的问题，以确保驾驶辅助系统能够可靠地运行。解决方案的关键在于利用机器学习和深度学习方法分析传感器数据之间的复杂关联，并通过自编码器检测传感器故障、随机森林回归估计传感器值，结合正态分布统计模型实现对潜在传感器故障的主动识别与早期预警。

链接: https://arxiv.org/abs/2505.00876
作者: Sahar Torkhesari,Behnam Yousefimehr,Mehdi Ghatee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9Pages, 3 Figures and 5 Tables

点击查看摘要

Abstract:Driver assistance systems provide a wide range of crucial services, including closely monitoring the condition of vehicles. This paper showcases a groundbreaking sensor health monitoring system designed for the automotive industry. The ingenious system leverages cutting-edge techniques to process data collected from various vehicle sensors. It compares their outputs within the Electronic Control Unit (ECU) to evaluate the health of each sensor. To unravel the intricate correlations between sensor data, an extensive exploration of machine learning and deep learning methodologies was conducted. Through meticulous analysis, the most correlated sensor data were identified. These valuable insights were then utilized to provide accurate estimations of sensor values. Among the diverse learning methods examined, the combination of autoencoders for detecting sensor failures and random forest regression for estimating sensor values proved to yield the most impressive outcomes. A statistical model using the normal distribution has been developed to identify possible sensor failures proactively. By comparing the actual values of the sensors with their estimated values based on correlated sensors, faulty sensors can be detected early. When a defective sensor is detected, both the driver and the maintenance department are promptly alerted. Additionally, the system replaces the value of the faulty sensor with the estimated value obtained through analysis. This proactive approach was evaluated using data from twenty essential sensors in the Saipa’s Quick vehicle’s ECU, resulting in an impressive accuracy rate of 99%.
zh

[AI-41] houghts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLM s through Agent ic Pipelines

【速读】：该论文试图解决在基于代理的流水线（agentic pipelines）中实现以人为本的可解释性（human-centered explainability）问题，特别是如何使大型语言模型（LLM）的内部运作以可操作的方式透明化。研究的核心在于分析链式思维（Chain-of-Thought, CoT）推理在代理流水线中的作用，发现CoT推理本身并不能提升输出质量或提供有效的可解释性，因为其生成的解释并未增强终端用户对系统理解或目标达成的能力。因此，解决方案的关键在于重新审视和改进可解释性机制，以适应多模型协作且人类干预有限的复杂系统环境。

链接: https://arxiv.org/abs/2505.00875
作者: Ramesh Manuvinakurike,Emanuel Moss,Elizabeth Anne Watkins,Saurav Sahay,Giuseppe Raffa,Lama Nachman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in actionable ways. Agentic pipelines consist of multiple LLMs working in cooperation with minimal human control. In this research paper, we present early findings from an agentic pipeline implementation of a perceptive task guidance system. Through quantitative and qualitative analysis, we analyze how Chain-of-Thought (CoT) reasoning, a common vehicle for explainability in LLMs, operates within agentic pipelines. We demonstrate that CoT reasoning alone does not lead to better outputs, nor does it offer explainability, as it tends to produce explanations without explainability, in that they do not improve the ability of end users to better understand systems or achieve their goals.
zh

[AI-42] IK Seed Generator for Dual-Arm Human-like Physicality Robot with Mobile Base

【速读】：该论文试图解决在机械臂逆运动学（Inverse Kinematics, IK）求解过程中，由于关节角度限制等机械约束导致的求解难度问题。解决方案的关键在于通过定义一个基于缩放雅可比矩阵（scaled Jacobian matrix）的初始猜测优劣评估指标，该指标能够考虑关节限制造成的可操作性（manipulability）影响，并利用遗传算法（Genetic Algorithm, GA）优化该指标以生成高质量的初始猜测，从而提高数值IK求解器的成功率。

链接: https://arxiv.org/abs/2505.00871
作者: Jun Takamatsu,Atsushi Kanehira,Kazuhiro Sasabuchi,Naoki Wake,Katsushi Ikeuchi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 12 figures, 4 tables

点击查看摘要

Abstract:Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations tend to have difficulty solving inverse kinematics (IK) due to mechanical limitations, such as joint angle limitations. Conversely, if the difficulty coming from this limitation could be mitigated, one can expect that the use of such robots becomes more valuable. In numerical IK solver, which is commonly used for robots with higher degrees-of-freedom (DOF), the solvability of IK depends on the initial guess given to the solver. Thus, this paper proposes a method for generating a good initial guess for a numerical IK solver given the target hand configuration. For the purpose, we define the goodness of an initial guess using the scaled Jacobian matrix, which can calculate the manipulability index considering the joint limits. These two factors are related to the difficulty of solving IK. We generate the initial guess by optimizing the goodness using the genetic algorithm (GA). To enumerate much possible IK solutions, we use the reachability map that represents the reachable area of the robot hand in the arm-base coordinate system. We conduct quantitative evaluation and prove that using an initial guess that is judged to be better using the goodness value increases the probability that IK is solved. Finally, as an application of the proposed method, we show that by generating good initial guesses for IK a robot actually achieves three typical scenarios.
zh

[AI-43] ICQuant: Index Coding enables Low-bit LLM Quantization

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在低比特后训练量化（PTQ）过程中因权重中的异常值（outliers）导致的量化范围膨胀和误差增大的问题。现有异常值抑制技术要么无法有效缩小量化范围，要么带来较高的比特开销。论文提出的解决方案ICQuant的关键在于利用异常值统计信息设计一种高效的索引编码方案，实现面向异常值的权重仅量化。与现有方法相比，ICQuant仅需约0.3比特即可将量化范围减半，显著降低了极端压缩场景下的比特开销，并在不进行微调的情况下提升了量化模型的性能。

链接: https://arxiv.org/abs/2505.00850
作者: Xinlin Li,Osama Hanna,Christina Fragouli,Suhas Diggavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid deployment of Large Language Models (LLMs) highlights the need for efficient low-bit post-training quantization (PTQ), due to their high memory costs. A key challenge in weight quantization is the presence of outliers, which inflate quantization ranges and lead to large errors. While a number of outlier suppression techniques have been proposed, they either: fail to effectively shrink the quantization range, or incur (relatively) high bit overhead. In this paper, we present ICQuant, a novel framework that leverages outlier statistics to design an efficient index coding scheme for outlier-aware weight-only quantization. Compared to existing outlier suppression techniques requiring \approx 1 bit overhead to halve the quantization range, ICQuant requires only \approx 0.3 bits; a significant saving in extreme compression regimes (e.g., 2-3 bits per weight). ICQuant can be used on top of any existing quantizers to eliminate outliers, improving the quantization quality. Using just 2.3 bits per weight and simple scalar quantizers, ICQuant improves the zero-shot accuracy of the 2-bit Llama3-70B model by up to 130% and 150% relative to QTIP and QuIP#; and it achieves comparable performance to the best-known fine-tuned quantizer (PV-tuning) without fine-tuning.
zh

[AI-44] OET: Optimization-based prompt injection Evaluation Toolkit

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对提示注入攻击（prompt injection attacks）时的安全性问题，尤其是现有防御策略缺乏标准化评估框架的问题。解决方案的关键在于提出OET（Optimization-based Evaluation Toolkit），该工具通过自适应测试框架系统地评估提示注入攻击与防御措施，其核心特性包括模块化工作流、对抗字符串生成、动态攻击执行以及全面的结果分析，并结合白盒和黑盒访问的优化方法生成最坏情况下的对抗样本，从而实现严格的红队评估。

链接: https://arxiv.org/abs/2505.00843
作者: Jinsheng Pan,Xiaogeng Liu,Chaowei Xiao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, enabling their widespread adoption across various domains. However, their susceptibility to prompt injection attacks poses significant security risks, as adversarial inputs can manipulate model behavior and override intended instructions. Despite numerous defense strategies, a standardized framework to rigorously evaluate their effectiveness, especially under adaptive adversarial scenarios, is lacking. To address this gap, we introduce OET, an optimization-based evaluation toolkit that systematically benchmarks prompt injection attacks and defenses across diverse datasets using an adaptive testing framework. Our toolkit features a modular workflow that facilitates adversarial string generation, dynamic attack execution, and comprehensive result analysis, offering a unified platform for assessing adversarial robustness. Crucially, the adaptive testing framework leverages optimization methods with both white-box and black-box access to generate worst-case adversarial examples, thereby enabling strict red-teaming evaluations. Extensive experiments underscore the limitations of current defense mechanisms, with some models remaining susceptible even after implementing security enhancements.
zh

[AI-45] From Texts to Shields: Convergence of Large Language Models and Cybersecurity

【速读】：该论文试图解决将大型语言模型（Large Language Models, LLMs）有效且安全地应用于网络安全领域所面临的多方面挑战，包括技术复杂性、系统可解释性、安全性以及社会技术因素。其解决方案的关键在于通过集成技术进步与组织和社会考量，推动LLMs在自动化复杂任务、提升操作效率及实现基于推理的安全分析中的应用，同时通过人机协同系统、角色特定培训和主动鲁棒性测试等策略应对信任、透明度和伦理问题。

链接: https://arxiv.org/abs/2505.00841
作者: Tao Li,Ya-Ting Yang,Yunian Pan,Quanyan Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report explores the convergence of large language models (LLMs) and cybersecurity, synthesizing interdisciplinary insights from network security, artificial intelligence, formal methods, and human-centered design. It examines emerging applications of LLMs in software and network security, 5G vulnerability analysis, and generative security engineering. The report highlights the role of agentic LLMs in automating complex tasks, improving operational efficiency, and enabling reasoning-driven security analytics. Socio-technical challenges associated with the deployment of LLMs – including trust, transparency, and ethical considerations – can be addressed through strategies such as human-in-the-loop systems, role-specific training, and proactive robustness testing. The report further outlines critical research challenges in ensuring interpretability, safety, and fairness in LLM-based systems, particularly in high-stakes domains. By integrating technical advances with organizational and societal considerations, this report presents a forward-looking research agenda for the secure and effective adoption of LLMs in cybersecurity.
zh

[AI-46] MIMIC-RNum4-Ext-22MCTS: A 22 Millions-Event Temporal Clinical Time-Series Dataset with Relative Timestamp for Risk Prediction

【速读】：该论文旨在解决临床风险预测中高质量时间序列临床事件数据收集的问题，这是构建可靠预测模型的关键环节。文献中提到的MIMIC-IV-Note数据集虽然广泛但结构复杂，其出院总结文本过长且临床事件通常缺乏明确的时间戳，这对传统自然语言处理模型构成了挑战。解决方案的关键在于提出一种新框架，包括将出院总结分割为可管理的小文本块、利用上下文BM25和语义搜索检索潜在包含临床事件的文本块，以及设计特定提示以指导Llama-3.1-8B模型识别或推断文本块中的时间信息。该方法有效提升了数据集的信息量和透明度，从而在医疗问答和临床试验匹配任务中显著提高了模型性能。

链接: https://arxiv.org/abs/2505.00827
作者: Jing Wang,Xing Niu,Juyong Kim,Jie Shen,Tong Zhang,Jeremy C. Weiss
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical risk prediction based on machine learning algorithms plays a vital role in modern healthcare. A crucial component in developing a reliable prediction model is collecting high-quality time series clinical events. In this work, we release such a dataset that consists of 22,588,586 Clinical Time Series events, which we term MIMIC-\RNum4-Ext-22MCTS. Our source data are discharge summaries selected from the well-known yet unstructured MIMIC-IV-Note \citeJohnson2023-pg. We then extract clinical events as short text span from the discharge summaries, along with the timestamps of these events as temporal information. The general-purpose MIMIC-IV-Note pose specific challenges for our work: it turns out that the discharge summaries are too lengthy for typical natural language models to process, and the clinical events of interest often are not accompanied with explicit timestamps. Therefore, we propose a new framework that works as follows: 1) we break each discharge summary into manageably small text chunks; 2) we apply contextual BM25 and contextual semantic search to retrieve chunks that have a high potential of containing clinical events; and 3) we carefully design prompts to teach the recently released Llama-3.1-8B \citetouvron2023llama model to identify or infer temporal information of the chunks. We show that the obtained dataset is so informative and transparent that standard models fine-tuned on our dataset are achieving significant improvements in healthcare applications. In particular, the BERT model fine-tuned based on our dataset achieves 10% improvement in accuracy on medical question answering task, and 3% improvement in clinical trial matching task compared with the classic BERT. The GPT-2 model, fine-tuned on our dataset, produces more clinically reliable results for clinical questions.
zh

[AI-47] Spill The Beans: Exploiting CPU Cache Side-Channels to Leak Tokens from Large Language Models

【速读】：该论文旨在解决共享硬件资源上的侧信道攻击对大型语言模型（Large Language Models, LLMs）生成的令牌（tokens）造成的隐私泄露问题。其解决方案的关键在于利用缓存侧信道技术，通过在同一硬件上共置攻击进程，刷新并重新加载嵌入层中的嵌入向量，从而检测到在令牌生成过程中产生的缓存命中，进而泄露敏感信息。面对LLMs计算密集导致嵌入向量快速被驱逐出缓存的挑战，该方法通过平衡监控的令牌数量与泄露的信息量来优化攻击效果。

链接: https://arxiv.org/abs/2505.00817
作者: Andrew Adiletta,Berk Sunar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Side-channel attacks on shared hardware resources increasingly threaten confidentiality, especially with the rise of Large Language Models (LLMs). In this work, we introduce Spill The Beans, a novel application of cache side-channels to leak tokens generated by an LLM. By co-locating an attack process on the same hardware as the victim model, we flush and reload embedding vectors from the embedding layer, where each token corresponds to a unique embedding vector. When accessed during token generation, it results in a cache hit detectable by our attack on shared lower-level caches. A significant challenge is the massive size of LLMs, which, by nature of their compute intensive operation, quickly evicts embedding vectors from the cache. We address this by balancing the number of tokens monitored against the amount of information leaked. Monitoring more tokens increases potential vocabulary leakage but raises the chance of missing cache hits due to eviction; monitoring fewer tokens improves detection reliability but limits vocabulary coverage. Through extensive experimentation, we demonstrate the feasibility of leaking tokens from LLMs via cache side-channels. Our findings reveal a new vulnerability in LLM deployments, highlighting that even sophisticated models are susceptible to traditional side-channel attacks. We discuss the implications for privacy and security in LLM-serving infrastructures and suggest considerations for mitigating such threats. For proof of concept we consider two concrete attack scenarios: Our experiments show that an attacker can recover as much as 80%-90% of a high entropy API key with single shot monitoring. As for English text we can reach a 40% recovery rate with a single shot. We should note that the rate highly depends on the monitored token set and these rates can be improved by targeting more specialized output domains. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: K.6.5 Cite as: arXiv:2505.00817 [cs.CR] (or arXiv:2505.00817v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.00817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-48] Handling Label Noise via Instance-Level Difficulty Modeling and Dynamic Optimization

【速读】：该论文试图解决在噪声监督下深度神经网络泛化性能下降的问题（noise-affected generalization performance），现有方法在隔离清洁子集或修正噪声标签方面存在计算成本高、超参数调优复杂以及优化粒度粗略等局限。解决方案的关键在于提出一种新颖的两阶段噪声学习框架，通过动态加权损失函数实现实例级优化，避免了超参数调优，并引入一种称为“wrong event”的简单有效指标，以动态建模样本的清洁度和难度，同时保持计算成本可控。

链接: https://arxiv.org/abs/2505.00812
作者: Kuan Zhang,Chengliang Chai,Jingzhe Xu,Chi Zhang,Ye Yuan,Guoren Wang,Lei Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies indicate that deep neural networks degrade in generalization performance under noisy supervision. Existing methods focus on isolating clean subsets or correcting noisy labels, facing limitations such as high computational costs, heavy hyperparameter tuning process, and coarse-grained optimization. To address these challenges, we propose a novel two-stage noisy learning framework that enables instance-level optimization through a dynamically weighted loss function, avoiding hyperparameter tuning. To obtain stable and accurate information about noise modeling, we introduce a simple yet effective metric, termed wrong event, which dynamically models the cleanliness and difficulty of individual samples while maintaining computational costs. Our framework first collects wrong event information and builds a strong base model. Then we perform noise-robust training on the base model, using a probabilistic model to handle the wrong event information of samples. Experiments on five synthetic and real-world LNL benchmarks demonstrate our method surpasses state-of-the-art methods in performance, achieves a nearly 75% reduction in computational time and improves model scalability.
zh

[AI-49] o Repair or Not to Repair? Investigating the Importance of AB-Cycles for the State-of-the-Art TSP Heuristic EAX

【速读】：该论文旨在解决边缘组装交叉（Edge Assembly Crossover, EAX）算法在求解旅行商问题（Traveling Salesperson Problem, TSP）时，其第一阶段中AB环生成的有效性验证问题。传统上，EAX的第二阶段已被深入研究和优化，而第一阶段则缺乏系统性分析。论文提出了一种新方法，用于快速判断内部优化过程中生成的AB环是否能构成有效路径，或是否需要修复，这对于后续应用如广义划分交叉（Generalized Partition Crossover, GPX）等强大交叉算子具有重要意义。解决方案的关键在于提升EAX算法在第一阶段的效率与准确性，从而整体改善算法的计算效率和解的质量。

链接: https://arxiv.org/abs/2505.00803
作者: Jonathan Heins,Darrell Whitley,Pascal Kerschke
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Edge Assembly Crossover (EAX) algorithm is the state-of-the-art heuristic for solving the Traveling Salesperson Problem (TSP). It regularly outperforms other methods, such as the Lin-Kernighan-Helsgaun heuristic (LKH), across diverse sets of TSP instances. Essentially, EAX employs a two-stage mechanism that focuses on improving the current solutions, first, at the local and, subsequently, at the global level. Although the second phase of the algorithm has been thoroughly studied, configured, and refined in the past, in particular, its first stage has hardly been examined. In this paper, we thus focus on the first stage of EAX and introduce a novel method that quickly verifies whether the AB-cycles, generated during its internal optimization procedure, yield valid tours – or whether they need to be repaired. Knowledge of the latter is also particularly relevant before applying other powerful crossover operators such as the Generalized Partition Crossover (GPX). Based on our insights, we propose and evaluate several improved versions of EAX. According to our benchmark study across 10 000 different TSP instances, the most promising of our proposed EAX variants demonstrates improved computational efficiency and solution quality on previously rather difficult instances compared to the current state-of-the-art EAX algorithm. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.00803 [cs.NE] (or arXiv:2505.00803v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2505.00803 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3712256.3726436 Focus to learn more DOI(s) linking to related resources
zh

[AI-50] Explanations as Bias Detectors: A Critical Study of Local Post-hoc XAI Methods for Fairness Exploration

【速读】：该论文试图解决在人工智能系统中由于算法偏见导致的不公平问题，尤其是在保护性群体中的影响。其解决方案的关键在于利用可解释性方法（explanation methods）来检测和解释不公平现象，提出了一种集成局部后验解释方法的流程，以获得与公平性相关的洞察。该流程重点探讨了分配公平性与程序公平性的关系、移除受保护属性的影响、不同解释方法结果的一致性与质量、局部解释聚合策略对群体公平性评估的影响以及解释作为偏见检测器的整体可信度等关键问题。

链接: https://arxiv.org/abs/2505.00802
作者: Vasiliki Papanikou,Danae Pla Karidi,Evaggelia Pitoura,Emmanouil Panagiotou,Eirini Ntoutsi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) is increasingly used in areas that significantly impact human lives, concerns about fairness and transparency have grown, especially regarding their impact on protected groups. Recently, the intersection of explainability and fairness has emerged as an important area to promote responsible AI systems. This paper explores how explainability methods can be leveraged to detect and interpret unfairness. We propose a pipeline that integrates local post-hoc explanation methods to derive fairness-related insights. During the pipeline design, we identify and address critical questions arising from the use of explanations as bias detectors such as the relationship between distributive and procedural fairness, the effect of removing the protected attribute, the consistency and quality of results across different explanation methods, the impact of various aggregation strategies of local explanations on group fairness evaluations, and the overall trustworthiness of explanations as bias detectors. Our results show the potential of explanation methods used for fairness while highlighting the need to carefully consider the aforementioned critical aspects.
zh

[AI-51] Howards Policy Iteration is Subexponential for Deterministic Markov Decision Problems with Rewards of Fixed Bit-size and Arbitrary Discount Factor

【速读】：该论文试图解决Howard’s Policy Iteration (HPI)算法在确定性马尔可夫决策过程（DMDPs）中的运行时间上界问题，即现有理论认为其运行时间在状态数上是指数级的，而实际应用中可能存在更优的性能。论文提出的解决方案的关键在于建立了针对DMDPs的次指数级上界，该上界依赖于奖励的位大小，而与折扣因子无关，同时适用于仅有两种可能奖励的DMDPs。这一成果显著改进了HPI在DMDPs上的理论性能分析。

链接: https://arxiv.org/abs/2505.00795
作者: Dibyangshu Mukherjee,Shivaram Kalyanakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Howard’s Policy Iteration (HPI) is a classic algorithm for solving Markov Decision Problems (MDPs). HPI uses a “greedy” switching rule to update from any non-optimal policy to a dominating one, iterating until an optimal policy is found. Despite its introduction over 60 years ago, the best-known upper bounds on HPI’s running time remain exponential in the number of states – indeed even on the restricted class of MDPs with only deterministic transitions (DMDPs). Meanwhile, the tightest lower bound for HPI for MDPs with a constant number of actions per state is only linear. In this paper, we report a significant improvement: a subexponential upper bound for HPI on DMDPs, which is parameterised by the bit-size of the rewards, while independent of the discount factor. The same upper bound also applies to DMDPs with only two possible rewards (which may be of arbitrary size).
zh

[AI-52] Scalable Meta-Learning via Mixed-Mode Differentiation

【速读】：该论文旨在解决梯度下降型双层优化（gradient-based bilevel optimisation）中由于需要计算“梯度的梯度”而导致的高计算成本问题，尤其是在现代元学习（meta-learning）设置中，其二阶导数和混合导数的计算往往效率低下。论文提出的解决方案关键在于采用混合模式微分（mixed-mode differentiation）构建更高效且可扩展的计算图，从而显著提升计算性能，实验结果显示该方法在内存使用上提高了10倍以上，并在实际运行时间上最多减少了25%。

链接: https://arxiv.org/abs/2505.00793
作者: Iurii Kemaev,Dan A Calian,Luisa M Zintgraf,Gregory Farquhar,Hado van Hasselt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Gradient-based bilevel optimisation is a powerful technique with applications in hyperparameter optimisation, task adaptation, algorithm discovery, meta-learning more broadly, and beyond. It often requires differentiating through the gradient-based optimisation process itself, leading to “gradient-of-a-gradient” calculations with computationally expensive second-order and mixed derivatives. While modern automatic differentiation libraries provide a convenient way to write programs for calculating these derivatives, they oftentimes cannot fully exploit the specific structure of these problems out-of-the-box, leading to suboptimal performance. In this paper, we analyse such cases and propose Mixed-Flow Meta-Gradients, or MixFlow-MG – a practical algorithm that uses mixed-mode differentiation to construct more efficient and scalable computational graphs yielding over 10x memory and up to 25% wall-clock time improvements over standard implementations in modern meta-learning setups.
zh

[AI-53] Constructing an Optimal Behavior Basis for the Option Keyboard

【速读】：该论文试图解决多任务强化学习中如何快速识别新任务最优解的问题，特别是在仅需最少或无需与环境进一步交互的情况下。其解决方案的关键在于引入一种新颖的方法，用于高效构建一个最优的行为基（optimal behavior basis），该基能够实现对任意线性任务的零样本最优解识别。该方法相比传统的广义策略改进（GPI）和凸覆盖集（CCS）技术，不仅显著减少了确保任务最优性所需的基策略数量，还在表达能力上更优，能够处理特定类型的非线性任务。

链接: https://arxiv.org/abs/2505.00787
作者: Lucas N. Alegre,Ana L. C. Bazzan,André Barreto,Bruno C. da Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task reinforcement learning aims to quickly identify solutions for new tasks with minimal or no additional interaction with the environment. Generalized Policy Improvement (GPI) addresses this by combining a set of base policies to produce a new one that is at least as good – though not necessarily optimal – as any individual base policy. Optimality can be ensured, particularly in the linear-reward case, via techniques that compute a Convex Coverage Set (CCS). However, these are computationally expensive and do not scale to complex domains. The Option Keyboard (OK) improves upon GPI by producing policies that are at least as good – and often better. It achieves this through a learned meta-policy that dynamically combines base policies. However, its performance critically depends on the choice of base policies. This raises a key question: is there an optimal set of base policies – an optimal behavior basis – that enables zero-shot identification of optimal solutions for any linear tasks? We solve this open problem by introducing a novel method that efficiently constructs such an optimal behavior basis. We show that it significantly reduces the number of base policies needed to ensure optimality in new tasks. We also prove that it is strictly more expressive than a CCS, enabling particular classes of non-linear tasks to be solved optimally. We empirically evaluate our technique in challenging domains and show that it outperforms state-of-the-art approaches, increasingly so as task complexity increases.
zh

[AI-54] he Coral Protocol: Open Infrastructure Connecting The Internet of Agents

【速读】：该论文旨在解决多专用AI代理在跨领域和供应商环境中实现互操作性的问题（interoperability），这是随着组织部署多种AI代理并要求其协同工作而日益增长的需求。解决方案的关键在于Coral Protocol作为多代理AI生态系统的基础平台，通过引入标准化的消息格式、模块化的协调机制以及安全的团队组建能力，建立了一个通用语言和协调框架，从而确保代理之间的高效且可信赖的交互。

链接: https://arxiv.org/abs/2505.00749
作者: Roman J. Georgio,Caelum Forder,Suman Deb,Peter Carroll,Önder Gürcan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 31 pages, 3 figures, Whitepaper

点击查看摘要

Abstract:The Coral Protocol is an open and decentralized collaboration infrastructure that enables communication, coordination, trust and payments for The Internet of Agents. It addresses the growing need for interoperability in a world where organizations are deploying multiple specialized AI agents that must work together across domains and vendors. As a foundational platform for multi-agent AI ecosystems, Coral establishes a common language and coordination framework allowing any agent to participate in complex workflows with others. Its design emphasizes broad compatibility, security, and vendor neutrality, ensuring that agent interactions are efficient and trustworthy. In particular, Coral introduces standardized messaging formats for agent communication, a modular coordination mechanism for orchestrating multi-agent tasks, and secure team formation capabilities for dynamically assembling trusted groups of agents. Together, these innovations position Coral Protocol as a cornerstone of the emerging “Internet of Agents,” unlocking new levels of automation, collective intelligence, and business value through open agent collaboration.
zh

[AI-55] ROSA: A Knowledge-based Solution for Robot Self-Adaptation

【速读】：该论文旨在解决自主机器人在多样环境中执行多任务时面临的软件架构设计与任务决策算法挑战，特别是在存在不确定性的情况下，不同情境可能需要不同的任务逻辑和架构配置。解决方案的关键在于提出ROSA（Robot Self-Adaptation）框架，该框架通过构建一个知识模型来捕获所有必要的应用特定知识，并在运行时对这些知识进行推理，以确定适应性调整的时机和方式，从而实现任务与架构的协同自适应（TACA）。

链接: https://arxiv.org/abs/2505.00733
作者: Gustavo Rezende Silva,Juliane Päßler,S. Lizeth Tapia Tarifa,Einar Broch Johnsen,Carlos Hernández Corbato
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous robots must operate in diverse environments and handle multiple tasks despite uncertainties. This creates challenges in designing software architectures and task decision-making algorithms, as different contexts may require distinct task logic and architectural configurations. To address this, robotic systems can be designed as self-adaptive systems capable of adapting their task execution and software architecture at runtime based on their this http URL paper introduces ROSA, a novel knowledge-based framework for RObot Self-Adaptation, which enables task-and-architecture co-adaptation (TACA) in robotic systems. ROSA achieves this by providing a knowledge model that captures all application-specific knowledge required for adaptation and by reasoning over this knowledge at runtime to determine when and how adaptation should occur. In addition to a conceptual framework, this work provides an open-source ROS 2-based reference implementation of ROSA and evaluates its feasibility and performance in an underwater robotics application. Experimental results highlight ROSA’s advantages in reusability and development effort for designing self-adaptive robotic systems.
zh

[AI-56] Advancing Software Security and Reliability in Cloud Platforms through AI-based Anomaly Detection

【速读】：该论文试图解决CI/CD流水线中的安全问题，特别是在云环境中由网络流量模式异常引发的网络安全威胁。解决方案的关键在于利用人工智能（Artificial Intelligence）支持的异常检测技术，通过结合卷积神经网络（Convolution Neural Network, CNN）和长短期记忆网络（Long Short-Term Memory, LSTM）对网络流量模式进行分析，以识别流水线和云平台中的异常行为，从而提升DevOps实践中的软件安全性与可靠性。

链接: https://arxiv.org/abs/2411.09200
作者: Sabbir M. Saleh,Ibrahim Mohammed Sayem,Nazim Madhavji,John Steinbacher
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Continuous Integration/Continuous Deployment (CI/CD) is fundamental for advanced software development, supporting faster and more efficient delivery of code changes into cloud environments. However, security issues in the CI/CD pipeline remain challenging, and incidents (e.g., DDoS, Bot, Log4j, etc.) are happening over the cloud environments. While plenty of literature discusses static security testing and CI/CD practices, only a few deal with network traffic pattern analysis to detect different cyberattacks. This research aims to enhance CI/CD pipeline security by implementing anomaly detection through AI (Artificial Intelligence) support. The goal is to identify unusual behaviour or variations from network traffic patterns in pipeline and cloud platforms. The system shall integrate into the workflow to continuously monitor pipeline activities and cloud infrastructure. Additionally, it aims to explore adaptive response mechanisms to mitigate the detected anomalies or security threats. This research employed two popular network traffic datasets, CSE-CIC-IDS2018 and CSE-CIC-IDS2017. We implemented a combination of Convolution Neural Network(CNN) and Long Short-Term Memory (LSTM) to detect unusual traffic patterns. We achieved an accuracy of 98.69% and 98.30% and generated log files in different CI/CD pipeline stages that resemble the network anomalies affected to address security challenges in modern DevOps practices, contributing to advancing software security and reliability.
zh

[AI-57] Differentiable Nonlinear Model Predictive Control

【速读】：该论文试图解决在非线性模型预测控制（nonlinear model predictive control, NMPC）中，学习增强方法集成时参数解的敏感性高效计算问题，因为这些敏感性对于许多学习算法至关重要。解决方案的关键在于利用隐函数定理（implicit function theorem, IFT）和内点法（interior-point methods, IPM）中处理的平滑最优性条件，来计算一般非线性规划（nonlinear programs, NLPs）的解敏感性，并在序列二次规划（sequential quadratic programming, SQP）方法中实现该敏感性计算，该方法使用IPM求解二次子问题。

链接: https://arxiv.org/abs/2505.01353
作者: Jonathan Frey,Katrin Baumgärtner,Gianluca Frison,Dirk Reinhardt,Jasper Hoffmann,Leonard Fichtner,Sebastien Gros,Moritz Diehl
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 page, 4 figures, 2 tables

点击查看摘要

Abstract:The efficient computation of parametric solution sensitivities is a key challenge in the integration of learning-enhanced methods with nonlinear model predictive control (MPC), as their availability is crucial for many learning algorithms. While approaches presented in the machine learning community are limited to convex or unconstrained formulations, this paper discusses the computation of solution sensitivities of general nonlinear programs (NLPs) using the implicit function theorem (IFT) and smoothed optimality conditions treated in interior-point methods (IPM). We detail sensitivity computation within a sequential quadratic programming (SQP) method which employs an IPM for the quadratic subproblems. The publication is accompanied by an efficient open-source implementation within the framework, providing both forward and adjoint sensitivities for general optimal control problems, achieving speedups exceeding 3x over the state-of-the-art solver this http URL.
zh

[AI-58] Multivariate Conformal Selection ICML2025

【速读】：该论文试图解决在药物发现、精准医学和大语言模型（Large Language Models, LLMs）对齐等应用中，从大规模数据集中选择高质量候选者的问题。现有方法Conformal Selection（CS）虽然提供了严格的不确定性量化，但其仅适用于单变量响应和标量准则。为了解决这一限制，本文提出了多变量共形选择（Multivariate Conformal Selection, mCS），其关键在于引入区域单调性并使用多变量非一致性得分构建共形p值，从而实现有限样本下的错误发现率（False Discovery Rate, FDR）控制。

链接: https://arxiv.org/abs/2505.00917
作者: Tian Bai,Yue Zhao,Xiang Yu,Archer Y. Yang
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 25 pages, 4 figures. Accepted to ICML 2025

点击查看摘要

Abstract:Selecting high-quality candidates from large datasets is critical in applications such as drug discovery, precision medicine, and alignment of large language models (LLMs). While Conformal Selection (CS) provides rigorous uncertainty quantification, it is limited to univariate responses and scalar criteria. To address this issue, we propose Multivariate Conformal Selection (mCS), a generalization of CS designed for multivariate response settings. Our method introduces regional monotonicity and employs multivariate nonconformity scores to construct conformal p-values, enabling finite-sample False Discovery Rate (FDR) control. We present two variants: mCS-dist, using distance-based scores, and mCS-learn, which learns optimal scores via differentiable optimization. Experiments on simulated and real-world datasets demonstrate that mCS significantly improves selection power while maintaining FDR control, establishing it as a robust framework for multivariate selection tasks.
zh

机器学习

[LG-0] Computational Data-Driven and Physics-Informed Machine Learning Approaches for Microstructure Modeling in Metal Additive Manufacturing

链接: https://arxiv.org/abs/2505.01424
作者: D. Patel,R. Sharma,Y.B. Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Metal additive manufacturing enables unprecedented design freedom and the production of customized, complex components. However, the rapid melting and solidification dynamics inherent to metal AM processes generate heterogeneous, non-equilibrium microstructures that significantly impact mechanical properties and subsequent functionality. Predicting microstructure and its evolution across spatial and temporal scales remains a central challenge for process optimization and defect mitigation. While conventional experimental techniques and physics-based simulations provide a physical foundation and valuable insights, they face critical limitations. In contrast, data-driven machine learning offers an alternative prediction approach and powerful pattern recognition but often operate as black-box, lacking generalizability and physical consistency. To overcome these limitations, physics-informed machine learning, including physics-informed neural networks, has emerged as a promising paradigm by embedding governing physical laws into neural network architectures, thereby enhancing accuracy, transparency, data efficiency, and extrapolation capabilities. This work presents a comprehensive evaluation of modeling strategies for microstructure prediction in metal AM. The strengths and limitations of experimental, computational, and data-driven methods are analyzed in depth, and highlight recent advances in hybrid PIML frameworks that integrate physical knowledge with ML. Key challenges, such as data scarcity, multi-scale coupling, and uncertainty quantification, are discussed alongside future directions. Ultimately, this assessment underscores the importance of PIML-based hybrid approaches in enabling predictive, scalable, and physically consistent microstructure modeling for site-specific, microstructure-aware process control and the reliable production of high-performance AM components.

[LG-1] Evaluating Frontier Models for Stealth and Situational Awareness

链接: https://arxiv.org/abs/2505.01420
作者: Mary Phuong,Roland S. Zimmermann,Ziyue Wang,David Lindner,Victoria Krakovna,Sarah Cogan,Allan Dafoe,Lewis Ho,Rohin Shah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work has demonstrated the plausibility of frontier AI models scheming – knowingly and covertly pursuing an objective misaligned with its developer’s intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model’s ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.

[LG-2] How Effective are Large Time Series Models in Hydrology? A Study on Water Level Forecasting in Everglades

链接: https://arxiv.org/abs/2505.01415
作者: Rahuul Rangaraj,Jimeng Shi,Azam Shirali,Rajendra Paudel,Yanzhao Wu,Giri Narasimhan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Everglades play a crucial role in flood and drought regulation, water resource planning, and ecosystem management in the surrounding regions. However, traditional physics-based and statistical methods for predicting water levels often face significant challenges, including high computational costs and limited adaptability to diverse or unforeseen conditions. Recent advancements in large time series models have demonstrated the potential to address these limitations, with state-of-the-art deep learning and foundation models achieving remarkable success in time series forecasting across various domains. Despite this progress, their application to critical environmental systems, such as the Everglades, remains underexplored. In this study, we fill the gap by investigating twelve task-specific models and five time series foundation models across six categories for a real-world application focused on water level prediction in the Everglades. Our primary results show that the foundation model, Chronos, significantly outperforms all other models while the remaining foundation models exhibit relatively poor performance. Moreover, the performance of task-specific models varies with the model architectures. Lastly, we discuss the possible reasons for the varying performance of models.

[LG-3] Predicting the Price of Gold in the Financial Markets Using Hybrid Models

链接: https://arxiv.org/abs/2505.01402
作者: Mohammadhossein Rashidi,Mohammad Modarres
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:Predicting the price that has the least error and can provide the best and highest accuracy has been one of the most challenging issues and one of the most critical concerns among capital market activists and researchers. Therefore, a model that can solve problems and provide results with high accuracy is one of the topics of interest among researchers. In this project, using time series prediction models such as ARIMA to estimate the price, variables, and indicators related to technical analysis show the behavior of traders involved in involving psychological factors for the model. By linking all of these variables to stepwise regression, we identify the best variables influencing the prediction of the variable. Finally, we enter the selected variables as inputs to the artificial neural network. In other words, we want to call this whole prediction process the “ARIMA_Stepwise Regression_Neural Network” model and try to predict the price of gold in international financial markets. This approach is expected to be able to be used to predict the types of stocks, commodities, currency pairs, financial market indicators, and other items used in local and international financial markets. Moreover, a comparison between the results of this method and time series methods is also expressed. Finally, based on the results, it can be seen that the resulting hybrid model has the highest accuracy compared to the time series method, regression, and stepwise regression.

[LG-4] Learning and Transferring Physical Models through Derivatives

链接: https://arxiv.org/abs/2505.01391
作者: Alessandro Trenta,Andrea Cossu,Davide Bacciu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Derivative Learning (DERL), a supervised approach that models physical systems by learning their partial derivatives. We also leverage DERL to build physical models incrementally, by designing a distillation protocol that effectively transfers knowledge from a pre-trained to a student model. We provide theoretical guarantees that our approach can learn the true physical system, being consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms state-of-the-art methods in generalizing an ODE to unseen initial conditions and a parametric PDE to unseen parameters. We finally propose a method based on DERL to transfer physical knowledge across models by extending them to new portions of the physical domain and new range of PDE parameters. We believe this is the first attempt at building physical models incrementally in multiple stages.

[LG-5] Carbon Aware Transformers Through Joint Model-Hardware Optimization

链接: https://arxiv.org/abs/2505.01386
作者: Irene Wang,Newsha Ardalani,Mostafa Elhoushi,Daniel Jiang,Samuel Hsia,Ekin Sumbul,Divya Mahajan,Carole-Jean Wu,Bilge Acun
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:The rapid growth of machine learning (ML) systems necessitates a more comprehensive evaluation of their environmental impact, particularly their carbon footprint, which comprises operational carbon from training and inference execution and embodied carbon from hardware manufacturing and its entire life-cycle. Despite the increasing importance of embodied emissions, there is a lack of tools and frameworks to holistically quantify and optimize the total carbon footprint of ML systems. To address this, we propose CATransformers, a carbon-aware architecture search framework that enables sustainability-driven co-optimization of ML models and hardware architectures. By incorporating both operational and embodied carbon metrics into early design space exploration of domain-specific hardware accelerators, CATransformers demonstrates that optimizing for carbon yields design choices distinct from those optimized solely for latency or energy efficiency. We apply our framework to multi-modal CLIP-based models, producing CarbonCLIP, a family of CLIP models achieving up to 17% reduction in total carbon emissions while maintaining accuracy and latency compared to state-of-the-art edge small CLIP baselines. This work underscores the need for holistic optimization methods to design high-performance, environmentally sustainable AI systems.

[LG-6] Stabilizing Temporal Difference Learning via Implicit Stochastic Approximation

链接: https://arxiv.org/abs/2505.01361
作者: Hwanwoo Kim,Panos Toulis,Eric Laber
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Temporal Difference (TD) learning is a foundational algorithm in reinforcement learning (RL). For nearly forty years, TD learning has served as a workhorse for applied RL as well as a building block for more complex and specialized algorithms. However, despite its widespread use, it is not without drawbacks, the most prominent being its sensitivity to step size. A poor choice of step size can dramatically inflate the error of value estimates and slow convergence. Consequently, in practice, researchers must use trial and error in order to identify a suitable step size – a process that can be tedious and time consuming. As an alternative, we propose implicit TD algorithms that reformulate TD updates into fixed-point equations. These updates are more stable and less sensitive to step size without sacrificing computational efficiency. Moreover, our theoretical analysis establishes asymptotic convergence guarantees and finite-time error bounds. Our results demonstrate their robustness and practicality for modern RL tasks, establishing implicit TD as a versatile tool for policy evaluation and value approximation.

[LG-7] Learning Stabilizing Policies via an Unstable Subspace Representation

链接: https://arxiv.org/abs/2505.01348
作者: Leonardo F. Toso,Lintao Ye,James Anderson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning to stabilize (LTS) a linear time-invariant (LTI) system. Policy gradient (PG) methods for control assume access to an initial stabilizing policy. However, designing such a policy for an unknown system is one of the most fundamental problems in control, and it may be as hard as learning the optimal policy itself. Existing work on the LTS problem requires large data as it scales quadratically with the ambient dimension. We propose a two-phase approach that first learns the left unstable subspace of the system and then solves a series of discounted linear quadratic regulator (LQR) problems on the learned unstable subspace, targeting to stabilize only the system’s unstable dynamics and reduce the effective dimension of the control space. We provide non-asymptotic guarantees for both phases and demonstrate that operating on the unstable subspace reduces sample complexity. In particular, when the number of unstable modes is much smaller than the state dimension, our analysis reveals that LTS on the unstable subspace substantially speeds up the stabilization process. Numerical experiments are provided to support this sample complexity reduction achieved by our approach.

[LG-8] How to Learn a Star: Binary Classification with Starshaped Polyhedral Sets

链接: https://arxiv.org/abs/2505.01346
作者: Marie-Charlotte Brandenburg,Katharina Jochemko
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Combinatorics (math.CO); Metric Geometry (math.MG)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:We consider binary classification restricted to a class of continuous piecewise linear functions whose decision boundaries are (possibly nonconvex) starshaped polyhedral sets, supported on a fixed polyhedral simplicial fan. We investigate the expressivity of these function classes and describe the combinatorial and geometric structure of the loss landscape, most prominently the sublevel sets, for two loss-functions: the 0/1-loss (discrete loss) and an exponential loss function. In particular, we give explicit bounds on the VC dimension of this model, and concretely describe the sublevel sets of the discrete loss as chambers in a hyperplane arrangement. For the exponential loss, we give sufficient conditions for the optimum to be unique, and describe the geometry of the optimum when varying the rate parameter of the underlying exponential probability distribution.

[LG-9] Enhancing Diversity in Parallel Agents : A Maximum State Entropy Exploration Story

链接: https://arxiv.org/abs/2505.01336
作者: Vincenzo De Paola,Riccardo Zamboni,Mirco Mutti,Marcello Restelli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parallel data collection has redefined Reinforcement Learning (RL), unlocking unprecedented efficiency and powering breakthroughs in large-scale real-world applications. In this paradigm, N identical agents operate in N replicas of an environment simulator, accelerating data collection by a factor of N . A critical question arises: \textitDoes specializing the policies of the parallel agents hold the key to surpass the N factor acceleration? In this paper, we introduce a novel learning framework that maximizes the entropy of collected data in a parallel setting. Our approach carefully balances the entropy of individual agents with inter-agent diversity, effectively minimizing redundancies. The latter idea is implemented with a centralized policy gradient method, which shows promise when evaluated empirically against systems of identical agents, as well as synergy with batch RL techniques that can exploit data diversity. Finally, we provide an original concentration analysis that shows faster rates for specialized parallel sampling distributions, which supports our methodology and may be of independent interest.

[LG-10] Integration of Multi-Mode Preference into Home Energy Management System Using Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.01332
作者: Mohammed Sumayli,Olugbenga Moses Anubi
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
*备注: Accepted for publication in ASME journal of engineering for sustainable buildings and cities

点击查看摘要

Abstract:Home Energy Management Systems (HEMS) have emerged as a pivotal tool in the smart home ecosystem, aiming to enhance energy efficiency, reduce costs, and improve user comfort. By enabling intelligent control and optimization of household energy consumption, HEMS plays a significant role in bridging the gap between consumer needs and energy utility objectives. However, much of the existing literature construes consumer comfort as a mere deviation from the standard appliance settings. Such deviations are typically incorporated into optimization objectives via static weighting factors. These factors often overlook the dynamic nature of consumer behaviors and preferences. Addressing this oversight, our paper introduces a multi-mode Deep Reinforcement Learning-based HEMS (DRL-HEMS) framework, meticulously designed to optimize based on dynamic, consumer-defined preferences. Our primary goal is to augment consumer involvement in Demand Response (DR) programs by embedding dynamic multi-mode preferences tailored to individual appliances. In this study, we leverage a model-free, single-agent DRL algorithm to deliver a HEMS framework that is not only dynamic but also user-friendly. To validate its efficacy, we employed real-world data at 15-minute intervals, including metrics such as electricity price, ambient temperature, and appliances’ power consumption. Our results show that the model performs exceptionally well in optimizing energy consumption within different preference modes. Furthermore, when compared to traditional algorithms based on Mixed-Integer Linear Programming (MILP), our model achieves nearly optimal performance while outperforming in computational efficiency.

[LG-11] Model See Model Do: Speech-Driven Facial Animation with Style Control SIGGRAPH

链接: https://arxiv.org/abs/2505.01319
作者: Yifang Pan,Karan Singh,Luiz Gustavo Hafemann
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, SIGGRAPH Conference Papers '25

点击查看摘要

Abstract:Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle stylistic cues while ensuring that the generated animations align closely with the input speech. Extensive qualitative, quantitative, and perceptual evaluations demonstrate the effectiveness of our method in faithfully reproducing the desired style while achieving superior lip synchronization across various speech scenarios.

[LG-12] MultiGran-STGCNFog: Towards Accurate and High-Throughput Inference for Multi-Granular Spatiotemporal Traffic Forecasting

链接: https://arxiv.org/abs/2505.01279
作者: Zhaoyan Wang,Xiangchi Song,In-Young Ko
类目: Machine Learning (cs.LG)
*备注: nine pages and five figures included

点击查看摘要

Abstract:Accurate traffic forecasting and swift inference provision are essential for intelligent transportation systems. However, the present Graph Convolutional Network (GCN)-based approaches cannot extract and fuse multi-granular spatiotemporal features across various spatial and temporal scales sufficiently, proven to yield less accurate forecasts. Besides, additional feature extraction branches introduced in prior studies critically increased model complexity and extended inference time, making it challenging to provide fast inference for traffic forecasting. In this paper, we propose MultiGran-STGCNFog, an efficient fog distributed inference system with a novel traffic forecasting model that employs multi-granular spatiotemporal feature fusion on generated dynamic traffic graphs to fully capture interdependent traffic dynamics. The proposed scheduling algorithm GA-DPHDS, optimizing layer execution order and layer-device scheduling scheme simultaneously, contributes to considerable inference throughput improvement by leveraging heterogeneous fog devices in a pipelined manner. Extensive experiments on real-world datasets demonstrate the superiority of the proposed method over selected baselines.

[LG-13] mwBTFreddy: A Dataset for Flash Flood Damage Assessment in Urban Malawi

链接: https://arxiv.org/abs/2505.01242
作者: Evelyn Chapuma,Grey Mengezi,Lewis Msasa,Amelia Taylor
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the mwBTFreddy dataset, a resource developed to support flash flood damage assessment in urban Malawi, specifically focusing on the impacts of Cyclone Freddy in 2023. The dataset comprises paired pre- and post-disaster satellite images sourced from Google Earth Pro, accompanied by JSON files containing labelled building annotations with geographic coordinates and damage levels (no damage, minor, major, or destroyed). Developed by the Kuyesera AI Lab at the Malawi University of Business and Applied Sciences, this dataset is intended to facilitate the development of machine learning models tailored to building detection and damage classification in African urban contexts. It also supports flood damage visualisation and spatial analysis to inform decisions on relocation, infrastructure planning, and emergency response in climate-vulnerable regions.

[LG-14] Quantitative Attractor Analysis of High-Capacity Kernel Logistic Regression Hopfield Networks

链接: https://arxiv.org/abs/2505.01218
作者: Akira Tamamori
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Traditional Hopfield networks, using Hebbian learning, face severe storage capacity limits ( \approx 0.14 P/N) and spurious attractors. Kernel Logistic Regression (KLR) offers a non-linear approach, mapping patterns to high-dimensional feature spaces for improved separability. Our previous work showed KLR dramatically improves capacity and noise robustness over conventional methods. This paper quantitatively analyzes the attractor structures in KLR-trained networks via extensive simulations. We evaluated recall from diverse initial states across wide storage loads (up to 4.0 P/N) and noise levels. We quantified convergence rates and speed. Our analysis confirms KLR’s superior performance: high capacity (up to 4.0 P/N) and robustness. The attractor landscape is remarkably “clean,” with near-zero spurious fixed points. Recall failures under high load/noise are primarily due to convergence to other learned patterns, not spurious ones. Dynamics are exceptionally fast (typically 1-2 steps for high-similarity states). This characterization reveals how KLR reshapes dynamics for high-capacity associative memory, highlighting its effectiveness and contributing to AM understanding.

[LG-15] AGRO: An Autonomous AI Rover for Precision Agriculture

链接: https://arxiv.org/abs/2505.01200
作者: Simar Ghumman,Fabio Di Troia,William Andreopoulos,Mark Stamp,Sanjit Rai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unmanned Ground Vehicles (UGVs) are emerging as a crucial tool in the world of precision agriculture. The combination of UGVs with machine learning allows us to find solutions for a range of complex agricultural problems. This research focuses on developing a UGV capable of autonomously traversing agricultural fields and capturing data. The project, known as AGRO (Autonomous Ground Rover Observer) leverages machine learning, computer vision and other sensor technologies. AGRO uses its capabilities to determine pistachio yields, performing self-localization and real-time environmental mapping while avoiding obstacles. The main objective of this research work is to automate resource-consuming operations so that AGRO can support farmers in making data-driven decisions. Furthermore, AGRO provides a foundation for advanced machine learning techniques as it captures the world around it.

[LG-16] CaReAQA: A Cardiac and Respiratory Audio Question Answering Model for Open-Ended Diagnostic Reasoning

链接: https://arxiv.org/abs/2505.01199
作者: Tsai-Ning Wang,Lin-Lin Chen,Neil Zeghidour,Aaqib Saeed
类目: Machine Learning (cs.LG)
*备注: Accepted at AHLI CHIL 2025

点击查看摘要

Abstract:Medical audio signals, such as heart and lung sounds, play a crucial role in clinical diagnosis. However, analyzing these signals remains challenging: traditional methods rely on handcrafted features or supervised deep learning models that demand extensive labeled datasets, limiting their scalability and applicability. To address these issues, we propose CaReAQA, an audio-language model that integrates a foundation audio model with the reasoning capabilities of large language models, enabling clinically relevant, open-ended diagnostic responses. Alongside CaReAQA, we introduce CaReSound, a benchmark dataset of annotated medical audio recordings enriched with metadata and paired question-answer examples, intended to drive progress in diagnostic reasoning research. Evaluation results show that CaReAQA achieves 86.2% accuracy on open-ended diagnostic reasoning tasks, outperforming baseline models. It also generalizes well to closed-ended classification tasks, achieving an average accuracy of 56.9% on unseen datasets. Our findings show how audio-language integration and reasoning advances medical diagnostics, enabling efficient AI systems for clinical decision support.

[LG-17] A Secured Triad of IoT Machine Learning and Blockchain for Crop Forecasting in Agriculture

链接: https://arxiv.org/abs/2505.01196
作者: Najmus Sakib Sizan,Md. Abu Layek,Khondokar Fida Hasan
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:To improve crop forecasting and provide farmers with actionable data-driven insights, we propose a novel approach integrating IoT, machine learning, and blockchain technologies. Using IoT, real-time data from sensor networks continuously monitor environmental conditions and soil nutrient levels, significantly improving our understanding of crop growth dynamics. Our study demonstrates the exceptional accuracy of the Random Forest model, achieving a 99.45% accuracy rate in predicting optimal crop types and yields, thereby offering precise crop projections and customized recommendations. To ensure the security and integrity of the sensor data used for these forecasts, we integrate the Ethereum blockchain, which provides a robust and secure platform. This ensures that the forecasted data remain tamper-proof and reliable. Stakeholders can access real-time and historical crop projections through an intuitive online interface, enhancing transparency and facilitating informed decision-making. By presenting multiple predicted crop scenarios, our system enables farmers to optimize production strategies effectively. This integrated approach promises significant advances in precision agriculture, making crop forecasting more accurate, secure, and user-friendly.

[LG-18] Empirical Comparison of Lightweight Forecasting Models for Seasonal and Non-Seasonal Time Series

链接: https://arxiv.org/abs/2505.01163
作者: Thanh Son Nguyen,Dang Minh Duc Nguyen,Van Thanh Nguyen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate time series forecasting is essential in many real-time applications that demand both high predictive accuracy and computational efficiency. This study provides an empirical comparison between a Polynomial Classifier and a Radial Basis Function Neural Network (RBFNN) across four real-world time series datasets (weather conditions, gold prices, crude oil prices, and beer production volumes) that cover both seasonal and nonseasonal patterns. Model performance is evaluated by forecasting accuracy (using Mean Absolute Error, Root Mean Squared Error, and Coefficient of Variation of Root Mean Squared Error) and computational time to assess each model’s viability for real time forecasting. The results show that the PC yields more accurate and faster forecasts for non seasonal series, whereas the RBFNN performs better on series with pronounced seasonal patterns. From an interpretability standpoint, the polynomial model offers a simpler, more transparent structure (in contrast to the black box nature of neural network), which is advantageous for understanding and trust in real time decision making. The performance differences between PC and RBFNN are statistically significant, as confirmed by paired t tests and Wilcoxon signed rank tests. These findings provide practical guidance for model selection in time series forecasting, indicating that PC may be preferable for quick, interpretable forecasts in non-seasonal contexts, whereas RBFNN is superior for capturing complex seasonal behaviors

[LG-19] ActiLE: Tiny Active LEarning for wearable devices IJCNN

链接: https://arxiv.org/abs/2505.01160
作者: Massimo Pavan,Claudio Galimberti,Manuel Roveri
类目: Machine Learning (cs.LG)
*备注: Accepted to the “Eyes Of The Future: Integrating Artificial Intelligence in Smart Eyewear (IAISE)” Workshop, Held at the “International Joint Conference on Neural Networks (IJCNN) 2025”

点击查看摘要

Abstract:Tiny Machine Learning (TinyML) algorithms have seen extensive use in recent years, enabling wearable devices to be not only connected but also genuinely intelligent by running machine learning (ML) computations directly on-device. Among such devices, smart glasses have particularly benefited from TinyML advancements. TinyML facilitates the on-device execution of the inference phase of ML algorithms on embedded and wearable devices, and more recently, it has expanded into On-device Learning (ODL), which allows both inference and learning phases to occur directly on the device. The application of ODL techniques to wearable devices is particularly compelling, as it enables the development of more personalized models that adapt based on the data of the user. However, one of the major challenges of ODL algorithms is the scarcity of labeled data collected on-device. In smart wearable contexts, requiring users to manually label large amounts of data is often impractical and could lead to user disengagement with the technology. To address this issue, this paper explores the application of Active Learning (AL) techniques, i.e., techniques that aim at minimizing the labeling effort, by actively selecting from a large quantity of unlabeled data only a small subset to be labeled and added to the training set of the algorithm. In particular, we propose TActiLE, a novel AL algorithm that selects from the stream of on-device sensor data the ones that would help the ML algorithm improve the most once coupled with labels provided by the user. TActiLE is the first Active Learning technique specifically designed for the TinyML context. We evaluate its effectiveness and efficiency through experiments on multiple image classification datasets. The results demonstrate its suitability for tiny and wearable devices.

[LG-20] Machine Learning for Physical Simulation Challenge Results and Retrospective Analysis: Power Grid Use Case

链接: https://arxiv.org/abs/2505.01156
作者: Milad Leyli-Abadi,Jérôme Picault,Antoine Marot,Jean-Patrick Brunet,Agathe Gilain,Amarsagar Reddy Ramapuram Matavalam,Shaban Ghias Satti,Quingbin Jiang,Yang Liu,Dean Justin Ninalga
类目: Machine Learning (cs.LG)
*备注: 47 pages, 12 figures, 4 table, submission to Energy and AI journal

点击查看摘要

Abstract:This paper addresses the growing computational challenges of power grid simulations, particularly with the increasing integration of renewable energy sources like wind and solar. As grid operators must analyze significantly more scenarios in near real-time to prevent failures and ensure stability, traditional physical-based simulations become computationally impractical. To tackle this, a competition was organized to develop AI-driven methods that accelerate power flow simulations by at least an order of magnitude while maintaining operational reliability. This competition utilized a regional-scale grid model with a 30% renewable energy mix, mirroring the anticipated near-future composition of the French power grid. A key contribution of this work is through the use of LIPS (Learning Industrial Physical Systems), a benchmarking framework that evaluates solutions based on four critical dimensions: machine learning performance, physical compliance, industrial readiness, and generalization to out-of-distribution scenarios. The paper provides a comprehensive overview of the Machine Learning for Physical Simulation (ML4PhySim) competition, detailing the benchmark suite, analyzing top-performing solutions that outperformed traditional simulation methods, and sharing key organizational insights and best practices for running large-scale AI competitions. Given the promising results achieved, the study aims to inspire further research into more efficient, scalable, and sustainable power network simulation methodologies.

[LG-21] CppSATD: A Reusable Self-Admitted Technical Debt Dataset in C

链接: https://arxiv.org/abs/2505.01136
作者: Phuoc Pham,Murali Sridharan,Matteo Esposito,Valentina Lenarduzzi
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:In software development, technical debt (TD) refers to suboptimal implementation choices made by the developers to meet urgent deadlines and limited resources, posing challenges for future maintenance. Self-Admitted Technical Debt (SATD) is a sub-type of TD, representing specific TD instances ``openly admitted’’ by the developers and often expressed through source code comments. Previous research on SATD has focused predominantly on the Java programming language, revealing a significant gap in cross-language SATD. Such a narrow focus limits the generalizability of existing findings as well as SATD detection techniques across multiple programming languages. Our work addresses such limitation by introducing CppSATD, a dedicated C++ SATD dataset, comprising over 531,000 annotated comments and their source code contexts. Our dataset can serve as a foundation for future studies that aim to develop SATD detection methods in C++, generalize the existing findings to other languages, or contribute novel insights to cross-language SATD research.

[LG-22] Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts

链接: https://arxiv.org/abs/2505.01135
作者: Wenfa Wu,Guanyu Zhang,Zheng Tan,Yi Wang,Hongsheng Qi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing single-modal time series models rely solely on numerical series, which suffer from the limitations imposed by insufficient information. Recent studies have revealed that multimodal models can address the core issue by integrating textual information. However, these models focus on either historical or future textual information, overlooking the unique contributions each plays in time series forecasting. Besides, these models fail to grasp the intricate relationships between textual and time series data, constrained by their moderate capacity for multimodal comprehension. To tackle these challenges, we propose Dual-Forecaster, a pioneering multimodal time series model that combines both descriptively historical textual information and predictive textual insights, leveraging advanced multimodal comprehension capability empowered by three well-designed cross-modality alignment techniques. Our comprehensive evaluations on fifteen multimodal time series datasets demonstrate that Dual-Forecaster is a distinctly effective multimodal time series model that outperforms or is comparable to other state-of-the-art models, highlighting the superiority of integrating textual information for time series forecasting. This work opens new avenues in the integration of textual information with numerical time series data for multimodal time series analysis.

[LG-23] Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders

链接: https://arxiv.org/abs/2505.01134
作者: Rogelio A Mancisidor,Robert Jenssen,Shujian Yu,Michael Kampffmeyer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.

[LG-24] Exploring Equity of Climate Policies using Multi-Agent Multi-Objective Reinforcement Learning IJCAI2025

链接: https://arxiv.org/abs/2505.01115
作者: Palok Biswas,Zuzanna Osika,Isidoro Tamassia,Adit Whorra,Jazmin Zatarain-Salazar,Jan Kwakkel,Frans A. Oliehoek,Pradeep K. Murukannaiah
类目: Machine Learning (cs.LG)
*备注: Accepted to IJCAI 2025, AI and Social Good Track

点击查看摘要

Abstract:Addressing climate change requires coordinated policy efforts of nations worldwide. These efforts are informed by scientific reports, which rely in part on Integrated Assessment Models (IAMs), prominent tools used to assess the economic impacts of climate policies. However, traditional IAMs optimize policies based on a single objective, limiting their ability to capture the trade-offs among economic growth, temperature goals, and climate justice. As a result, policy recommendations have been criticized for perpetuating inequalities, fueling disagreements during policy negotiations. We introduce Justice, the first framework integrating IAM with Multi-Objective Multi-Agent Reinforcement Learning (MOMARL). By incorporating multiple objectives, Justice generates policy recommendations that shed light on equity while balancing climate and economic goals. Further, using multiple agents can provide a realistic representation of the interactions among the diverse policy actors. We identify equitable Pareto-optimal policies using our framework, which facilitates deliberative decision-making by presenting policymakers with the inherent trade-offs in climate and economic policy.

[LG-25] Learning Low-Dimensional Embeddings for Black-Box Optimization

链接: https://arxiv.org/abs/2505.01112
作者: Riccardo Busetto,Manas Mejari,Marco Forgione,Alberto Bemporad,Dario Piga
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When gradient-based methods are impractical, black-box optimization (BBO) provides a valuable alternative. However, BBO often struggles with high-dimensional problems and limited trial budgets. In this work, we propose a novel approach based on meta-learning to pre-compute a reduced-dimensional manifold where optimal points lie for a specific class of optimization problems. When optimizing a new problem instance sampled from the class, black-box optimization is carried out in the reduced-dimensional space, effectively reducing the effort required for finding near-optimal solutions.

[LG-26] Incorporating Inductive Biases to Energy-based Generative Models

链接: https://arxiv.org/abs/2505.01111
作者: Yukun Li,Li-Ping Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the advent of score-matching techniques for model training and Langevin dynamics for sample generation, energy-based models (EBMs) have gained renewed interest as generative models. Recent EBMs usually use neural networks to define their energy functions. In this work, we introduce a novel hybrid approach that combines an EBM with an exponential family model to incorporate inductive bias into data modeling. Specifically, we augment the energy term with a parameter-free statistic function to help the model capture key data statistics. Like an exponential family model, the hybrid model aims to align the distribution statistics with data statistics during model training, even when it only approximately maximizes the data likelihood. This property enables us to impose constraints on the hybrid model. Our empirical study validates the hybrid model’s ability to match statistics. Furthermore, experimental results show that data fitting and generation improve when suitable informative statistics are incorporated into the hybrid model.

[LG-27] CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures

链接: https://arxiv.org/abs/2505.01107
作者: Yingjie Qi,Jianlei Yang,Yiou Wang,Yikun Wang,Dayu Wang,Ling Tang,Cenlin Duan,Xiaolin He,Weisheng Zhao
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 7 pages, accepted by DAC 2025

点击查看摘要

Abstract:Digital Compute-in-Memory (CIM) architectures have shown great promise in Deep Neural Network (DNN) acceleration by effectively addressing the “memory wall” bottleneck. However, the development and optimization of digital CIM accelerators are hindered by the lack of comprehensive tools that encompass both software and hardware design spaces. Moreover, existing design and evaluation frameworks often lack support for the capacity constraints inherent in digital CIM architectures. In this paper, we present CIMFlow, an integrated framework that provides an out-of-the-box workflow for implementing and evaluating DNN workloads on digital CIM architectures. CIMFlow bridges the compilation and simulation infrastructures with a flexible instruction set architecture (ISA) design, and addresses the constraints of digital CIM through advanced partitioning and parallelism strategies in the compilation flow. Our evaluation demonstrates that CIMFlow enables systematic prototyping and optimization of digital CIM architectures across diverse configurations, providing researchers and designers with an accessible platform for extensive design space exploration.

[LG-28] CoCoAFusE: Beyond Mixtures of Experts via Model Fusion

链接: https://arxiv.org/abs/2505.01105
作者: Aurelio Raffa Ugolini,Mara Tanelli,Valentina Breschi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many learning problems involve multiple patterns and varying degrees of uncertainty dependent on the covariates. Advances in Deep Learning (DL) have addressed these issues by learning highly nonlinear input-output dependencies. However, model interpretability and Uncertainty Quantification (UQ) have often straggled behind. In this context, we introduce the Competitive/Collaborative Fusion of Experts (CoCoAFusE), a novel, Bayesian Covariates-Dependent Modeling technique. CoCoAFusE builds on the very philosophy behind Mixtures of Experts (MoEs), blending predictions from several simple sub-models (or “experts”) to achieve high levels of expressiveness while retaining a substantial degree of local interpretability. Our formulation extends that of a classical Mixture of Experts by contemplating the fusion of the experts’ distributions in addition to their more usual mixing (i.e., superimposition). Through this additional feature, CoCoAFusE better accommodates different scenarios for the intermediate behavior between generating mechanisms, resulting in tighter credible bounds on the response variable. Indeed, only resorting to mixing, as in classical MoEs, may lead to multimodality artifacts, especially over smooth transitions. Instead, CoCoAFusE can avoid these artifacts even under the same structure and priors for the experts, leading to greater expressiveness and flexibility in modeling. This new approach is showcased extensively on a suite of motivating numerical examples and a collection of real-data ones, demonstrating its efficacy in tackling complex regression problems where uncertainty is a key quantity of interest.

[LG-29] Nesterov Method for Asynchronous Pipeline Parallel Optimization

链接: https://arxiv.org/abs/2505.01099
作者: Thalaiyasingam Ajanthan,Sameera Ramasinghe,Yan Zuo,Gil Avraham,Alexander Long
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

[LG-30] Integration Matters for Learning PDEs with Backwards SDEs

链接: https://arxiv.org/abs/2505.01078
作者: Sungje Park,Stephen Tu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, existing BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to short-horizon self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step sizes or longer self-consistency horizons. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.

[LG-31] Federated Adapter on Foundation Models: An Out-Of-Distribution Approach

链接: https://arxiv.org/abs/2505.01075
作者: Yiyuan Yang,Guodong Long,Tianyi Zhou,Qinghua Lu,Shanshan Ye,Jing Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As foundation models gain prominence, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach to collaboratively fine-tune models in federated learning (FL) frameworks using distributed datasets across clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients may exhibit distribution shifts leading to suboptimal performance. Although numerous studies have explored OOD generalization in conventional FL, these methods are inadequate for FedFM due to the challenges posed by large parameter scales and increased data heterogeneity. To address these, we propose FedOA, which employs adapter-based parameter-efficient fine-tuning methods for efficacy and introduces personalized adapters with feature distance-based regularization to align distributions and guarantee OOD generalization for each client. Theoretically, we demonstrate that the conventional aggregated global model in FedFM inherently retains OOD generalization capabilities, and our proposed method enhances the personalized model’s OOD generalization through regularization informed by the global model, with proven convergence under general non-convex settings. Empirically, the effectiveness of the proposed method is validated on benchmark datasets across various NLP tasks.

[LG-32] Monotone Peridynamic Neural Operator for Nonlinear Material Modeling with Conditionally Unique Solutions

链接: https://arxiv.org/abs/2505.01060
作者: Jihong Wang,Xiaochuan Tian,Zhongqiang Zhang,Stewart Silling,Siavash Jafarzadeh,Yue Yu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Data-driven methods have emerged as powerful tools for modeling the responses of complex nonlinear materials directly from experimental measurements. Among these methods, the data-driven constitutive models present advantages in physical interpretability and generalizability across different boundary conditions/domain settings. However, the well-posedness of these learned models is generally not guaranteed a priori, which makes the models prone to non-physical solutions in downstream simulation tasks. In this study, we introduce monotone peridynamic neural operator (MPNO), a novel data-driven nonlocal constitutive model learning approach based on neural operators. Our approach learns a nonlocal kernel together with a nonlinear constitutive relation, while ensuring solution uniqueness through a monotone gradient network. This architectural constraint on gradient induces convexity of the learnt energy density function, thereby guaranteeing solution uniqueness of MPNO in small deformation regimes. To validate our approach, we evaluate MPNO’s performance on both synthetic and real-world datasets. On synthetic datasets with manufactured kernel and constitutive relation, we show that the learnt model converges to the ground-truth as the measurement grid size decreases both theoretically and numerically. Additionally, our MPNO exhibits superior generalization capabilities than the conventional neural networks: it yields smaller displacement solution errors in down-stream tasks with new and unseen loadings. Finally, we showcase the practical utility of our approach through applications in learning a homogenized model from molecular dynamics data, highlighting its expressivity and robustness in real-world scenarios.

[LG-33] Multi-Step Consistency Models: Fast Generation with Theoretical Guarantees

链接: https://arxiv.org/abs/2505.01049
作者: Nishant Jain,Xunpeng Huang,Yian Ma,Tong Zhang
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 29 pages

点击查看摘要

Abstract:Consistency models have recently emerged as a compelling alternative to traditional SDE based diffusion models, offering a significant acceleration in generation by producing high quality samples in very few steps. Despite their empirical success, a proper theoretic justification for their speed up is still lacking. In this work, we provide the analysis which bridges this gap, showing that given a consistency model which can map the input at a given time to arbitrary timestamps along the reverse trajectory, one can achieve KL divergence of order O(\varepsilon^2) using only O\left(\log\left(\fracd\varepsilon\right)\right) iterations with constant step size, where d is the data dimension. Additionally, under minimal assumptions on the data distribution an increasingly common setting in recent diffusion model analyses we show that a similar KL convergence guarantee can be obtained, with the number of steps scaling as O\left(d \log\left(\fracd\varepsilon\right)\right) . Going further, we also provide a theoretical analysis for estimation of such consistency models, concluding that accurate learning is feasible using small discretization steps, both in smooth and non smooth settings. Notably, our results for the non smooth case yield best in class convergence rates compared to existing SDE or ODE based analyses under minimal assumptions.

[LG-34] Global Optimality of Single-Timescale Actor-Critic under Continuous State-Action Space: A Study on Linear Quadratic Regulator

链接: https://arxiv.org/abs/2505.01041
作者: Xuyang Chen,Jingliang Duan,Lin Zhao
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2208.08744

点击查看摘要

Abstract:Actor-critic methods have achieved state-of-the-art performance in various challenging tasks. However, theoretical understandings of their performance remain elusive and challenging. Existing studies mostly focus on practically uncommon variants such as double-loop or two-timescale stepsize actor-critic algorithms for simplicity. These results certify local convergence on finite state- or action-space only. We push the boundary to investigate the classic single-sample single-timescale actor-critic on continuous (infinite) state-action space, where we employ the canonical linear quadratic regulator (LQR) problem as a case study. We show that the popular single-timescale actor-critic can attain an epsilon-optimal solution with an order of epsilon to -2 sample complexity for solving LQR on the demanding continuous state-action space. Our work provides new insights into the performance of single-timescale actor-critic, which further bridges the gap between theory and practice.

[LG-35] Wheres the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content CVPR2025

链接: https://arxiv.org/abs/2505.01008
作者: Haoyue Bai,Yiyou Sun,Wei Cheng,Haifeng Chen
类目: Machine Learning (cs.LG)
*备注: CVPR 2025

点击查看摘要

Abstract:The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real world scenarios. In this work, we introduce a novel black box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt and recover strategy: by masking part of an image and assessing the model ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked image inputs, we incorporate a cost efficient surrogate model trained to align with the target model distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.

[LG-36] Accelerating Deep Neural Network Training via Distributed Hybrid Order Optimization

链接: https://arxiv.org/abs/2505.00982
作者: Shunxian Gu,Chaoqun You,Bangbang Ren,Lailong Luo,Junxu Xia,Deke Guo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Scaling deep neural network (DNN) training to more devices can reduce time-to-solution. However, it is impractical for users with limited computing resources. FOSI, as a hybrid order optimizer, converges faster than conventional optimizers by taking advantage of both gradient information and curvature information when updating the DNN model. Therefore, it provides a new chance for accelerating DNN training in the resource-constrained setting. In this paper, we explore its distributed design, namely DHO _2 , including distributed calculation of curvature information and model update with partial curvature information to accelerate DNN training with a low memory burden. To further reduce the training time, we design a novel strategy to parallelize the calculation of curvature information and the model update on different devices. Experimentally, our distributed design can achieve an approximate linear reduction of memory burden on each device with the increase of the device number. Meanwhile, it achieves 1.4\times\sim2.1\times speedup in the total training time compared with other distributed designs based on conventional first- and second-order optimizers.

[LG-37] A Minimax-MDP Framework with Future-imposed Conditions for Learning-augmented Problems

链接: https://arxiv.org/abs/2505.00973
作者: Xin Chen,Yuze Chen,Yuan Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 64 pages, 1 figure

点击查看摘要

Abstract:We study a class of sequential decision-making problems with augmented predictions, potentially provided by a machine learning algorithm. In this setting, the decision-maker receives prediction intervals for unknown parameters that become progressively refined over time, and seeks decisions that are competitive with the hindsight optimal under all possible realizations of both parameters and predictions. We propose a minimax Markov Decision Process (minimax-MDP) framework, where the system state consists of an adversarially evolving environment state and an internal state controlled by the decision-maker. We introduce a set of future-imposed conditions that characterize the feasibility of minimax-MDPs and enable the design of efficient, often closed-form, robustly competitive policies. We illustrate the framework through three applications: multi-period inventory ordering with refining demand predictions, resource allocation with uncertain utility functions, and a multi-phase extension of the minimax-MDP applied to the inventory problem with time-varying ordering costs. Our results provide a tractable and versatile approach to robust online decision-making under predictive uncertainty.

[LG-38] Adaptive Branch-and-Bound Tree Exploration for Neural Network Verification

链接: https://arxiv.org/abs/2505.00963
作者: Kota Fukuda,Guanqin Zhang,Zhenya Zhang,Yulei Sui,Jianjun Zhao
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-problems that ignores the \emphimportance of different sub-problems. To bridge this gap, we first introduce a notion of importance'' that reflects how likely a counterexample can be found with a sub-problem, and then we devise a novel verification approach, called ABONN, that explores the sub-problem space of BaB adaptively, in a Monte-Carlo tree search (MCTS) style. The exploration is guided by the importance’’ of different sub-problems, so it favors the sub-problems that are more likely to find counterexamples. As soon as it finds a counterexample, it can immediately terminate; even though it cannot find, after visiting all the sub-problems, it can still manage to verify the problem. We evaluate ABONN with 552 verification problems from commonly-used datasets and neural network models, and compare it with the state-of-the-art verifiers as baseline approaches. Experimental evaluation shows that ABONN demonstrates speedups of up to 15.2\times on MNIST and 24.7\times on CIFAR-10. We further study the influences of hyperparameters to the performance of ABONN, and the effectiveness of our adaptive tree exploration.

[LG-39] Enhancing User Sequence Modeling through Barlow Twins-based Self-Supervised Learning

链接: https://arxiv.org/abs/2505.00953
作者: Yuhan Liu,Lin Ning,Neo Wu,Karan Singhal,Philip Andrew Mansfield,Devora Berlowitz,Sushant Prakash,Bradley Green
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:User sequence modeling is crucial for modern large-scale recommendation systems, as it enables the extraction of informative representations of users and items from their historical interactions. These user representations are widely used for a variety of downstream tasks to enhance users’ online experience. A key challenge for learning these representations is the lack of labeled training data. While self-supervised learning (SSL) methods have emerged as a promising solution for learning representations from unlabeled data, many existing approaches rely on extensive negative sampling, which can be computationally expensive and may not always be feasible in real-world scenario. In this work, we propose an adaptation of Barlow Twins, a state-of-the-art SSL methods, to user sequence modeling by incorporating suitable augmentation methods. Our approach aims to mitigate the need for large negative sample batches, enabling effective representation learning with smaller batch sizes and limited labeled data. We evaluate our method on the MovieLens-1M, MovieLens-20M, and Yelp datasets, demonstrating that our method consistently outperforms the widely-used dual encoder model across three downstream tasks, achieving an 8%-20% improvement in accuracy. Our findings underscore the effectiveness of our approach in extracting valuable sequence-level information for user modeling, particularly in scenarios where labeled data is scarce and negative examples are limited.

[LG-40] Preserving Privacy and Utility in LLM -Based Product Recommendations

链接: https://arxiv.org/abs/2505.00951
作者: Tina Khezresmaeilzadeh,Jiang Zhang,Dimitrios Andreadis,Konstantinos Psounis
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based recommendation systems leverage powerful language models to generate personalized suggestions by processing user interactions and preferences. Unlike traditional recommendation systems that rely on structured data and collaborative filtering, LLM-based models process textual and contextual information, often using cloud-based infrastructure. This raises privacy concerns, as user data is transmitted to remote servers, increasing the risk of exposure and reducing control over personal information. To address this, we propose a hybrid privacy-preserving recommendation framework which separates sensitive from nonsensitive data and only shares the latter with the cloud to harness LLM-powered recommendations. To restore lost recommendations related to obfuscated sensitive data, we design a de-obfuscation module that reconstructs sensitive recommendations locally. Experiments on real-world e-commerce datasets show that our framework achieves almost the same recommendation utility with a system which shares all data with an LLM, while preserving privacy to a large extend. Compared to obfuscation-only techniques, our approach improves HR@10 scores and category distribution alignment, offering a better balance between privacy and recommendation quality. Furthermore, our method runs efficiently on consumer-grade hardware, making privacy-aware LLM-based recommendation systems practical for real-world use.

[LG-41] Addressing Noise and Stochasticity in Fraud Detection for Service Networks

链接: https://arxiv.org/abs/2505.00946
作者: Wenxin Zhang,Ding Xu,Xi Xuan,Lei Jiang,Guangzhen Yao,Renda Han,Xiangxiang Lang,Cuicui Luo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Fraud detection is crucial in social service networks to maintain user trust and improve service network security. Existing spectral graph-based methods address this challenge by leveraging different graph filters to capture signals with different frequencies in service networks. However, most graph filter-based methods struggle with deriving clean and discriminative graph signals. On the one hand, they overlook the noise in the information propagation process, resulting in degradation of filtering ability. On the other hand, they fail to discriminate the frequency-specific characteristics of graph signals, leading to distortion of signals fusion. To address these issues, we develop a novel spectral graph network based on information bottleneck theory (SGNN-IB) for fraud detection in service networks. SGNN-IB splits the original graph into homophilic and heterophilic subgraphs to better capture the signals at different frequencies. For the first limitation, SGNN-IB applies information bottleneck theory to extract key characteristics of encoded representations. For the second limitation, SGNN-IB introduces prototype learning to implement signal fusion, preserving the frequency-specific characteristics of signals. Extensive experiments on three real-world datasets demonstrate that SGNN-IB outperforms state-of-the-art fraud detection methods.

[LG-42] FreCT: Frequency-augmented Convolutional Transformer for Robust Time Series Anomaly Detection

链接: https://arxiv.org/abs/2505.00941
作者: Wenxin Zhang,Ding Xu,Guangzhen Yao,Xiaojian Lin,Renxiang Guan,Chengze Du,Renda Han,Xi Xuan,Cuicui Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series anomaly detection is critical for system monitoring and risk identification, across various domains, such as finance and healthcare. However, for most reconstruction-based approaches, detecting anomalies remains a challenge due to the complexity of sequential patterns in time series data. On the one hand, reconstruction-based techniques are susceptible to computational deviation stemming from anomalies, which can lead to impure representations of normal sequence patterns. On the other hand, they often focus on the time-domain dependencies of time series, while ignoring the alignment of frequency information beyond the time domain. To address these challenges, we propose a novel Frequency-augmented Convolutional Transformer (FreCT). FreCT utilizes patch operations to generate contrastive views and employs an improved Transformer architecture integrated with a convolution module to capture long-term dependencies while preserving local topology information. The introduced frequency analysis based on Fourier transformation could enhance the model’s ability to capture crucial characteristics beyond the time domain. To protect the training quality from anomalies and improve the robustness, FreCT deploys stop-gradient Kullback-Leibler (KL) divergence and absolute error to optimize consistency information in both time and frequency domains. Extensive experiments on four public datasets demonstrate that FreCT outperforms existing methods in identifying anomalies.

[LG-43] StablePCA: Learning Shared Representations across Multiple Sources via Minimax Optimization

链接: https://arxiv.org/abs/2505.00940
作者: Zhenyu Wang,Molei Liu,Jing Lei,Francis Bach,Zijian Guo
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:When synthesizing multisource high-dimensional data, a key objective is to extract low-dimensional feature representations that effectively approximate the original features across different sources. Such general feature extraction facilitates the discovery of transferable knowledge, mitigates systematic biases such as batch effects, and promotes fairness. In this paper, we propose Stable Principal Component Analysis (StablePCA), a novel method for group distributionally robust learning of latent representations from high-dimensional multi-source data. A primary challenge in generalizing PCA to the multi-source regime lies in the nonconvexity of the fixed rank constraint, rendering the minimax optimization nonconvex. To address this challenge, we employ the Fantope relaxation, reformulating the problem as a convex minimax optimization, with the objective defined as the maximum loss across sources. To solve the relaxed formulation, we devise an optimistic-gradient Mirror Prox algorithm with explicit closed-form updates. Theoretically, we establish the global convergence of the Mirror Prox algorithm, with the convergence rate provided from the optimization perspective. Furthermore, we offer practical criteria to assess how closely the solution approximates the original nonconvex formulation. Through extensive numerical experiments, we demonstrate StablePCA’s high accuracy and efficiency in extracting robust low-dimensional representations across various finite-sample scenarios.

[LG-44] unnElQNN: A Hybrid Quantum-classical Neural Network for Efficient Learning

链接: https://arxiv.org/abs/2505.00933
作者: A. H. Abbas
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Quantum Physics (quant-ph)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Hybrid quantum-classical neural networks (HQCNNs) represent a promising frontier in machine learning, leveraging the complementary strengths of both models. In this work, we propose the development of TunnElQNN, a non-sequential architecture composed of alternating classical and quantum layers. Within the classical component, we employ the Tunnelling Diode Activation Function (TDAF), inspired by the I-V characteristics of quantum tunnelling. We evaluate the performance of this hybrid model on a synthetic dataset of interleaving half-circle for multi-class classification tasks with varying degrees of class overlap. The model is compared against a baseline hybrid architecture that uses the conventional ReLU activation function (ReLUQNN). Our results show that the TunnElQNN model consistently outperforms the ReLUQNN counterpart. Furthermore, we analyse the decision boundaries generated by TunnElQNN under different levels of class overlap and compare them to those produced by a neural network implementing TDAF within a fully classical architecture. These findings highlight the potential of integrating physics-inspired activation functions with quantum components to enhance the expressiveness and robustness of hybrid quantum-classical machine learning architectures.

[LG-45] Robust Root Cause Diagnosis using In-Distribution Interventions ICLR-25

链接: https://arxiv.org/abs/2505.00930
作者: Lokesh Nagalapatti,Ashutosh Srivastava,Sunita Sarawagi,Amit Sharma
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR-25

点击查看摘要

Abstract:Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. We propose In-Distribution Interventions (IDI), a novel algorithm that predicts root cause as nodes that meet two criteria: 1) Anomaly: root cause nodes should take on anomalous values; 2) Fix: had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. We present a theoretical analysis comparing and bounding the errors in assessing the fix condition using interventional and counterfactual estimates. We then conduct experiments by systematically varying the SCM’s complexity to demonstrate the cases where IDI’s interventional approach outperforms the counterfactual approach and vice versa. Experiments on both synthetic and PetShop RCD benchmark datasets demonstrate that \our\ consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. Code is released at this https URL.

[LG-46] Compact Recurrent Transformer with Persistent Memory

链接: https://arxiv.org/abs/2505.00929
作者: Edison Mucllari,Zachary Daniels,David Zhang,Qiang Ye
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Transformer architecture has shown significant success in many language processing and visual tasks. However, the method faces challenges in efficiently scaling to long sequences because the self-attention computation is quadratic with respect to the input length. To overcome this limitation, several approaches scale to longer sequences by breaking long sequences into a series of segments, restricting self-attention to local dependencies between tokens within each segment and using a memory mechanism to manage information flow between segments. However, these approached generally introduce additional compute overhead that restricts them from being used for applications where limited compute memory and power are of great concern (such as edge computing). We propose a novel and efficient Compact Recurrent Transformer (CRT), which combines shallow Transformer models that process short local segments with recurrent neural networks to compress and manage a single persistent memory vector that summarizes long-range global information between segments. We evaluate CRT on WordPTB and WikiText-103 for next-token-prediction tasks, as well as on the Toyota Smarthome video dataset for classification. CRT achieves comparable or superior prediction results to full-length Transformers in the language datasets while using significantly shorter segments (half or quarter size) and substantially reduced FLOPs. Our approach also demonstrates state-of-the-art performance on the Toyota Smarthome video dataset.

[LG-47] Gaussian Process Policy Iteration with Additive Schwarz Acceleration for Forward and Inverse HJB and Mean Field Game Problems

链接: https://arxiv.org/abs/2505.00909
作者: Xianjin Yang,Jingguo Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a Gaussian Process (GP)-based policy iteration framework for addressing both forward and inverse problems in Hamilton–Jacobi–Bellman (HJB) equations and mean field games (MFGs). Policy iteration is formulated as an alternating procedure between solving the value function under a fixed control policy and updating the policy based on the resulting value function. By exploiting the linear structure of GPs for function approximation, each policy evaluation step admits an explicit closed-form solution, eliminating the need for numerical optimization. To improve convergence, we incorporate the additive Schwarz acceleration as a preconditioning step following each policy update. Numerical experiments demonstrate the effectiveness of Schwarz acceleration in improving computational efficiency.

[LG-48] Learning Neural Control Barrier Functions from Offline Data with Conservatism

链接: https://arxiv.org/abs/2505.00908
作者: Ihab Tabbara,Hussein Sibai
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms, however, suffer from the curse of dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we contribute to this line of work by proposing an algorithm for training control barrier functions from offline datasets. Our algorithm trains the filter to not only prevent the system from reaching unsafe states but also out-of-distribution ones, at which the filter would be unreliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety and out-of-distribution avoidance while minimally affecting task performance.

[LG-49] Protocol-agnostic and Data-free Backdoor Attacks on Pre-trained Models in RF Fingerprinting

链接: https://arxiv.org/abs/2505.00881
作者: Tianya Zhao,Ningning Wang,Junqing Zhang,Xuyu Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 10 pages, 7 figures, accepted by IEEE INFOCOM 2025

点击查看摘要

Abstract:While supervised deep neural networks (DNNs) have proven effective for device authentication via radio frequency (RF) fingerprinting, they are hindered by domain shift issues and the scarcity of labeled data. The success of large language models has led to increased interest in unsupervised pre-trained models (PTMs), which offer better generalization and do not require labeled datasets, potentially addressing the issues mentioned above. However, the inherent vulnerabilities of PTMs in RF fingerprinting remain insufficiently explored. In this paper, we thoroughly investigate data-free backdoor attacks on such PTMs in RF fingerprinting, focusing on a practical scenario where attackers lack access to downstream data, label information, and training processes. To realize the backdoor attack, we carefully design a set of triggers and predefined output representations (PORs) for the PTMs. By mapping triggers and PORs through backdoor training, we can implant backdoor behaviors into the PTMs, thereby introducing vulnerabilities across different downstream RF fingerprinting tasks without requiring prior knowledge. Extensive experiments demonstrate the wide applicability of our proposed attack to various input domains, protocols, and PTMs. Furthermore, we explore potential detection and defense methods, demonstrating the difficulty of fully safeguarding against our proposed backdoor attack.

[LG-50] IberFire – a detailed creation of a spatio-temporal dataset for wildfire risk assessment in Spain

链接: https://arxiv.org/abs/2505.00837
作者: Julen Ercibengoa,Meritxell Gómez-Omella,Izaro Goienetxea
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wildfires pose a critical environmental issue to ecosystems, economies, and public safety, particularly in Mediterranean regions such as Spain. Accurate predictive models rely on high-resolution spatio-temporal data to capture the complex interplay of environmental and anthropogenic factors. To address the lack of localised and fine-grained datasets in Spain, this work introduces IberFire, a spatio-temporal datacube at 1 km x 1 km x 1-day resolution covering mainland Spain and the Balearic Islands from December 2007 to December 2024. IberFire integrates 260 features across eight main categories: auxiliary features, fire history, geography, topography, meteorology, vegetation indices, human activity, and land cover. All features are derived from open-access sources, ensuring transparency and real-time applicability. The data processing pipeline was implemented entirely using open-source tools, and the codebase has been made publicly available. This work not only enhances spatio-temporal granularity and feature diversity compared to existing European datacubes but also provides a reproducible methodology for constructing similar datasets. IberFire supports advanced wildfire risk modelling through Machine Learning (ML) and Deep Learning (DL) techniques, enables climate pattern analysis and informs strategic planning in fire prevention and land management. The dataset is publicly available on Zenodo to promote open research and collaboration.

[LG-51] Intersectional Divergence: Measuring Fairness in Regression

链接: https://arxiv.org/abs/2505.00830
作者: Joe Germino,Nuno Moniz,Nitesh V. Chawla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research on fairness in machine learning has been mainly framed in the context of classification tasks, leaving critical gaps in regression. In this paper, we propose a seminal approach to measure intersectional fairness in regression tasks, going beyond the focus on single protected attributes from existing work to consider combinations of all protected attributes. Furthermore, we contend that it is insufficient to measure the average error of groups without regard for imbalanced domain preferences. To this end, we propose Intersectional Divergence (ID) as the first fairness measure for regression tasks that 1) describes fair model behavior across multiple protected attributes and 2) differentiates the impact of predictions in target ranges most relevant to users. We extend our proposal demonstrating how ID can be adapted into a loss function, IDLoss, and used in optimization problems. Through an extensive experimental evaluation, we demonstrate how ID allows unique insights into model behavior and fairness, and how incorporating IDLoss into optimization can considerably improve single-attribute and intersectional model fairness while maintaining a competitive balance in predictive performance.

[LG-52] Data-Driven Optical To Thermal Inference in Pool Boiling Using Generative Adversarial Networks

链接: https://arxiv.org/abs/2505.00823
作者: Qianxi Fu,Youngjoon Suh,Xiaojing Zhang,Yoonjin Won
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 17 pages, 5 figures, supplemental information

点击查看摘要

Abstract:Phase change plays a critical role in thermal management systems, yet quantitative characterization of multiphase heat transfer remains limited by the challenges of measuring temperature fields in chaotic, rapidly evolving flow regimes. While computational methods offer spatiotemporal resolution in idealized cases, replicating complex experimental conditions remains prohibitively difficult. Here, we present a data-driven framework that leverages a conditional generative adversarial network (CGAN) to infer temperature fields from geometric phase contours in a canonical pool boiling configuration where advanced data collection techniques are restricted. Using high-speed imaging data and simulation-informed training, our model demonstrates the ability to reconstruct temperature fields with errors below 6%. We further show that standard data augmentation strategies are effective in enhancing both accuracy and physical plausibility of the predicted maps across both simulation and experimental datasets when precise physical constraints are not applicable. Our results highlight the potential of deep generative models to bridge the gap between observable multiphase phenomena and underlying thermal transport, offering a powerful approach to augment and interpret experimental measurements in complex two-phase systems.

[LG-53] Dual Filter: A Mathematical Framework for Inference using Transformer-like Architectures

链接: https://arxiv.org/abs/2505.00818
作者: Heng-Sheng Chang,Prashant G. Mehta
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR)
*备注: 49 pages, 6 figures

点击查看摘要

Abstract:This paper presents a mathematical framework for causal nonlinear prediction in settings where observations are generated from an underlying hidden Markov model (HMM). Both the problem formulation and the proposed solution are motivated by the decoder-only transformer architecture, in which a finite sequence of observations (tokens) is mapped to the conditional probability of the next token. Our objective is not to construct a mathematical model of a transformer. Rather, our interest lies in deriving, from first principles, transformer-like architectures that solve the prediction problem for which the transformer is designed. The proposed framework is based on an original optimal control approach, where the prediction objective (MMSE) is reformulated as an optimal control problem. An analysis of the optimal control problem is presented leading to a fixed-point equation on the space of probability measures. To solve the fixed-point equation, we introduce the dual filter, an iterative algorithm that closely parallels the architecture of decoder-only transformers. These parallels are discussed in detail along with the relationship to prior work on mathematical modeling of transformers as transport on the space of probability measures. Numerical experiments are provided to illustrate the performance of the algorithm using parameter values used in researchscale transformer models.

[LG-54] Aggregating empirical evidence from data strategy studies: a case on model quantization

链接: https://arxiv.org/abs/2505.00816
作者: Santiago del Rey,Paulo Sérgio Medeiros dos Santos,Guilherme Horta Travassos,Xavier Franch,Silverio Martínez-Fernández
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 11 pages, 3 figures, submitted to the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

点击查看摘要

Abstract:Background: As empirical software engineering evolves, more studies adopt data strategies - approaches that investigate digital artifacts such as models, source code, or system logs rather than relying on human subjects. Synthesizing results from such studies introduces new methodological challenges. Aims: This study assesses the effects of model quantization on correctness and resource efficiency in deep learning (DL) systems. Additionally, it explores the methodological implications of aggregating evidence from empirical studies that adopt data strategies. Method: We conducted a research synthesis of six primary studies that empirically evaluate model quantization. We applied the Structured Synthesis Method (SSM) to aggregate the findings, which combines qualitative and quantitative evidence through diagrammatic modeling. A total of 19 evidence models were extracted and aggregated. Results: The aggregated evidence indicates that model quantization weakly negatively affects correctness metrics while consistently improving resource efficiency metrics, including storage size, inference latency, and GPU energy consumption - a manageable trade-off for many DL deployment contexts. Evidence across quantization techniques remains fragmented, underscoring the need for more focused empirical studies per technique. Conclusions: Model quantization offers substantial efficiency benefits with minor trade-offs in correctness, making it a suitable optimization strategy for resource-constrained environments. This study also demonstrates the feasibility of using SSM to synthesize findings from data strategy-based research. Comments: 11 pages, 3 figures, submitted to the 19th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2505.00816 [cs.SE] (or arXiv:2505.00816v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.00816 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Santiago del Rey [view email] [v1] Thu, 1 May 2025 19:18:35 UTC (171 KB) Full-text links: Access Paper: View a PDF of the paper titled Aggregating empirical evidence from data strategy studies: a case on model quantization, by Santiago del Rey and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-05 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-55] Scalable Unit Harmonization in Medical Informatics Using Bi-directional Transformers and Bayesian-Optimized BM25 and Sentence Embedding Retrieval

链接: https://arxiv.org/abs/2505.00810
作者: Jordi de la Torre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: To develop and evaluate a scalable methodology for harmonizing inconsistent units in large-scale clinical datasets, addressing a key barrier to data interoperability. Materials and Methods: We designed a novel unit harmonization system combining BM25, sentence embeddings, Bayesian optimization, and a bidirectional transformer based binary classifier for retrieving and matching laboratory test entries. The system was evaluated using the Optum Clinformatics Datamart dataset (7.5 billion entries). We implemented a multi-stage pipeline: filtering, identification, harmonization proposal generation, automated re-ranking, and manual validation. Performance was assessed using Mean Reciprocal Rank (MRR) and other standard information retrieval metrics. Results: Our hybrid retrieval approach combining BM25 and sentence embeddings (MRR: 0.8833) significantly outperformed both lexical-only (MRR: 0.7985) and embedding-only (MRR: 0.5277) approaches. The transformer-based reranker further improved performance (absolute MRR improvement: 0.10), bringing the final system MRR to 0.9833. The system achieved 83.39% precision at rank 1 and 94.66% recall at rank 5. Discussion: The hybrid architecture effectively leverages the complementary strengths of lexical and semantic approaches. The reranker addresses cases where initial retrieval components make errors due to complex semantic relationships in medical terminology. Conclusion: Our framework provides an efficient, scalable solution for unit harmonization in clinical datasets, reducing manual effort while improving accuracy. Once harmonized, data can be reused seamlessly in different analyses, ensuring consistency across healthcare systems and enabling more reliable multi-institutional studies and meta-analyses. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.00810 [cs.LG] (or arXiv:2505.00810v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.00810 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jordi De La Torre [view email] [v1] Thu, 1 May 2025 19:09:15 UTC (107 KB) Full-text links: Access Paper: View a PDF of the paper titled Scalable Unit Harmonization in Medical Informatics Using Bi-directional Transformers and Bayesian-Optimized BM25 and Sentence Embedding Retrieval, by Jordi de la TorreView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-56] Improving Routing in Sparse Mixture of Experts with Graph of Tokens

链接: https://arxiv.org/abs/2505.00792
作者: Tam Nguyen,Ngoc N. Tran,Khai Nguyen,Richard G. Baraniuk
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Sparse Mixture of Experts (SMoE) has emerged as a key to achieving unprecedented scalability in deep learning. By activating only a small subset of parameters per sample, SMoE achieves an exponential increase in parameter counts while maintaining a constant computational overhead. However, SMoE models are susceptible to routing fluctuations–changes in the routing of a given input to its target expert–at the late stage of model training, leading to model non-robustness. In this work, we unveil the limitation of SMoE through the perspective of the probabilistic graphical model (PGM). Through this PGM framework, we highlight the independence in the expert-selection of tokens, which exposes the model to routing fluctuation and non-robustness. Alleviating this independence, we propose the novel Similarity-Aware (S)MoE, which considers interactions between tokens during expert selection. We then derive a new PGM underlying an (S)MoE-Attention block, going beyond just a single (S)MoE layer. Leveraging the token similarities captured by the attention matrix, we propose the innovative Attention-Aware (S)MoE, which employs the attention matrix to guide the routing of tokens to appropriate experts in (S)MoE. We theoretically prove that Similarity/Attention-Aware routing help reduce the entropy of expert selection, resulting in more stable token routing mechanisms. We empirically validate our models on various tasks and domains, showing significant improvements in reducing routing fluctuations, enhancing accuracy, and increasing model robustness over the baseline MoE-Transformer with token routing via softmax gating.

[LG-57] Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures

链接: https://arxiv.org/abs/2505.00779
作者: Junwon Seo,Kensuke Nakamura,Andrea Bajcsy
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Recent advances in generative world models have enabled classical safe control methods, such as Hamilton-Jacobi (HJ) reachability, to generalize to complex robotic systems operating directly from high-dimensional sensor observations. However, obtaining comprehensive coverage of all safety-critical scenarios during world model training is extremely challenging. As a result, latent safety filters built on top of these models may miss novel hazards and even fail to prevent known ones, overconfidently misclassifying risky out-of-distribution (OOD) situations as safe. To address this, we introduce an uncertainty-aware latent safety filter that proactively steers robots away from both known and unseen failures. Our key idea is to use the world model’s epistemic uncertainty as a proxy for identifying unseen potential hazards. We propose a principled method to detect OOD world model predictions by calibrating an uncertainty threshold via conformal prediction. By performing reachability analysis in an augmented state space-spanning both the latent representation and the epistemic uncertainty-we synthesize a latent safety filter that can reliably safeguard arbitrary policies from both known and unseen safety hazards. In simulation and hardware experiments on vision-based control tasks with a Franka manipulator, we show that our uncertainty-aware safety filter preemptively detects potential unsafe scenarios and reliably proposes safe, in-distribution actions. Video results can be found on the project website at this https URL

[LG-58] Primality Testing via Circulant Matrix Eigenvalue Structure: A Novel Approach Using Cyclotomic Field Theory

链接: https://arxiv.org/abs/2505.00730
作者: Marius-Constantin Dinu
类目: ymbolic Computation (cs.SC); Machine Learning (cs.LG)
*备注: 27 pages, 5 figures, 2 tables; This paper was created with AI assistance using Symbia Engine from SymbolicAI Framework [Dinu et al.]; Repository: this https URL

点击查看摘要

Abstract:This paper presents a novel primality test based on the eigenvalue structure of circulant matrices constructed from roots of unity. We prove that an integer n 2 is prime if and only if the minimal polynomial of the circulant matrix C_n = W_n + W_n^2 has exactly two irreducible factors over \mathbbQ . This characterization connects cyclotomic field theory with matrix algebra, providing both theoretical insights and practical applications. We demonstrate that the eigenvalue patterns of these matrices reveal fundamental distinctions between prime and composite numbers, leading to a deterministic primality test. Our approach leverages the relationship between primitive roots of unity, Galois theory, and the factorization of cyclotomic polynomials. We provide comprehensive experimental validation across various ranges of integers, discuss practical implementation considerations, and analyze the computational complexity of our method in comparison with established primality tests. The visual interpretation of our mathematical framework provides intuitive understanding of the algebraic structures that distinguish prime numbers. Our experimental validation demonstrates that our approach offers a deterministic alternative to existing methods, with performance characteristics reflecting its algebraic foundations.

[LG-59] Negative Stepsizes Make Gradient-Descent-Ascent Converge

链接: https://arxiv.org/abs/2505.01423
作者: Henry Shugart,Jason M. Altschuler
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient computation of min-max problems is a central question in optimization, learning, games, and controls. Arguably the most natural algorithm is gradient-descent-ascent (GDA). However, since the 1970s, conventional wisdom has argued that GDA fails to converge even on simple problems. This failure spurred an extensive literature on modifying GDA with additional building blocks such as extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes. The key innovation is the proposal of unconventional stepsize schedules (dubbed slingshot stepsize schedules) that are time-varying, asymmetric, and periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). All of our results apply to the last iterate of GDA, as is typically desired in practice. The core algorithmic intuition is that although negative stepsizes make backward progress, they de-synchronize the min and max variables (overcoming the cycling issue of GDA), and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. We interpret this as a second-order finite-differencing algorithm and show that, intriguingly, it approximately implements consensus optimization, an empirically popular algorithm for min-max problems involving deep neural networks (e.g., training GANs). Subjects: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2505.01423 [math.OC] (or arXiv:2505.01423v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2505.01423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Provable Efficiency of Guidance in Diffusion Models for General Data Distribution

链接: https://arxiv.org/abs/2505.01382
作者: Gen Li,Yuchen Jiao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a powerful framework for generative modeling, with guidance techniques playing a crucial role in enhancing sample quality. Despite their empirical success, a comprehensive theoretical understanding of the guidance effect remains limited. Existing studies only focus on case studies, where the distribution conditioned on each class is either isotropic Gaussian or supported on a one-dimensional interval with some extra conditions. How to analyze the guidance effect beyond these case studies remains an open question. Towards closing this gap, we make an attempt to analyze diffusion guidance under general data distributions. Rather than demonstrating uniform sample quality improvement, which does not hold in some distributions, we prove that guidance can improve the whole sample quality, in the sense that the average reciprocal of the classifier probability decreases with the existence of guidance. This aligns with the motivation of introducing guidance.

[LG-61] How much to Dereverberate? Low-Latency Single-Channel Speech Enhancement in Distant Microphone Scenarios ICASSP2025

链接: https://arxiv.org/abs/2505.01338
作者: Satvik Venkatesh,Philip Coleman,Arthur Benilov,Simon Brown,Selim Sheta,Frederic Roskam
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Published in ICASSP 2025

点击查看摘要

Abstract:Dereverberation is an important sub-task of Speech Enhancement (SE) to improve the signal’s intelligibility and quality. However, it remains challenging because the reverberation is highly correlated with the signal. Furthermore, the single-channel SE literature has predominantly focused on rooms with short reverb times (typically under 1 second), smaller rooms (under volumes of 1000 cubic meters) and relatively short distances (up to 2 meters). In this paper, we explore real-time low-latency single-channel SE under distant microphone scenarios, such as 5 to 10 meters, and focus on conference rooms and theatres, with larger room dimensions and reverberation times. Such a setup is useful for applications such as lecture demonstrations, drama, and to enhance stage acoustics. First, we show that single-channel SE in such challenging scenarios is feasible. Second, we investigate the relationship between room volume and reverberation time, and demonstrate its importance when randomly simulating room impulse responses. Lastly, we show that for dereverberation with short decay times, preserving early reflections before decaying the transfer function of the room improves overall signal quality.

[LG-62] A Provably Convergent Plug-and-Play Framework for Stochastic Bilevel Optimization

链接: https://arxiv.org/abs/2505.01258
作者: Tianshu Chu,Dachuan Xu,Wei Yao,Chengming Yu,Jin Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel optimization has recently attracted significant attention in machine learning due to its wide range of applications and advanced hierarchical optimization capabilities. In this paper, we propose a plug-and-play framework, named PnPBO, for developing and analyzing stochastic bilevel optimization methods. This framework integrates both modern unbiased and biased stochastic estimators into the single-loop bilevel optimization framework introduced in [9], with several improvements. In the implementation of PnPBO, all stochastic estimators for different variables can be independently incorporated, and an additional moving average technique is applied when using an unbiased estimator for the upper-level variable. In the theoretical analysis, we provide a unified convergence and complexity analysis for PnPBO, demonstrating that the adaptation of various stochastic estimators (including PAGE, ZeroSARAH, and mixed strategies) within the PnPBO framework achieves optimal sample complexity, comparable to that of single-level optimization. This resolves the open question of whether the optimal complexity bounds for solving bilevel optimization are identical to those for single-level optimization. Finally, we empirically validate our framework, demonstrating its effectiveness on several benchmark problems and confirming our theoretical findings.

[LG-63] Gaussian Differential Private Bootstrap by Subsampling

链接: https://arxiv.org/abs/2505.01197
作者: Holger Dette,Carina Graw
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Bootstrap is a common tool for quantifying uncertainty in data analysis. However, besides additional computational costs in the application of the bootstrap on massive data, a challenging problem in bootstrap based inference under Differential Privacy consists in the fact that it requires repeated access to the data. As a consequence, bootstrap based differentially private inference requires a significant increase of the privacy budget, which on the other hand comes with a substantial loss in statistical accuracy. A potential solution to reconcile the conflicting goals of statistical accuracy and privacy is to analyze the data under parametric model assumptions and in the last decade, several parametric bootstrap methods for inference under privacy have been investigated. However, uncertainty quantification by parametric bootstrap is only valid if the the quantities of interest can be identified as the parameters of a statistical model and the imposed model assumptions are (at least approximately) satisfied. An alternative to parametric methods is the empirical bootstrap that is a widely used tool for non-parametric inference and well studied in the non-private regime. However, under privacy, less insight is available. In this paper, we propose a private empirical m out of n bootstrap and validate its consistency and privacy guarantees under Gaussian Differential Privacy. Compared to the the private n out of n bootstrap, our approach has several advantages. First, it comes with less computational costs, in particular for massive data. Second, the proposed procedure needs less additional noise in the bootstrap iterations, which leads to an improved statistical accuracy while asymptotically guaranteeing the same level of privacy. Third, we demonstrate much better finite sample properties compared to the currently available procedures. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2505.01197 [stat.ML] (or arXiv:2505.01197v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.01197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-64] A flexible Bayesian non-parametric mixture model reveals multiple dependencies of swap errors in visual working memory

链接: https://arxiv.org/abs/2505.01178
作者: Puria Radmard,Paul M. Bays,Máté Lengyel
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Human behavioural data in psychophysics has been used to elucidate the underlying mechanisms of many cognitive processes, such as attention, sensorimotor integration, and perceptual decision making. Visual working memory has particularly benefited from this approach: analyses of VWM errors have proven crucial for understanding VWM capacity and coding schemes, in turn constraining neural models of both. One poorly understood class of VWM errors are swap errors, whereby participants recall an uncued item from memory. Swap errors could arise from erroneous memory encoding, noisy storage, or errors at retrieval time - previous research has mostly implicated the latter two. However, these studies made strong a priori assumptions on the detailed mechanisms and/or parametric form of errors contributed by these sources. Here, we pursue a data-driven approach instead, introducing a Bayesian non-parametric mixture model of swap errors (BNS) which provides a flexible descriptive model of swapping behaviour, such that swaps are allowed to depend on both the probed and reported features of every stimulus item. We fit BNS to the trial-by-trial behaviour of human participants and show that it recapitulates the strong dependence of swaps on cue similarity in multiple datasets. Critically, BNS reveals that this dependence coexists with a non-monotonic modulation in the report feature dimension for a random dot motion direction-cued, location-reported dataset. The form of the modulation inferred by BNS opens new questions about the importance of memory encoding in causing swap errors in VWM, a distinct source to the previously suggested binding and cueing errors. Our analyses, combining qualitative comparisons of the highly interpretable BNS parameter structure with rigorous quantitative model comparison and recovery methods, show that previous interpretations of swap errors may have been incomplete.

[LG-65] On Simulating Thin-Film Processes at the Atomic Scale Using Machine Learned Force Fields

链接: https://arxiv.org/abs/2505.01118
作者: S. Kondati Natarajan,J. Schneider,N. Pandey,J. Wellendorff,S. Smidstrup
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 35 pages, 18 figures

点击查看摘要

Abstract:Atomistic modeling of thin-film processes provides an avenue not only for discovering key chemical mechanisms of the processes but also to extract quantitative metrics on the events and reactions taking place at the gas-surface interface. Molecular dynamics (MD) is a powerful computational method to study the evolution of a process at the atomic scale, but studies of industrially relevant processes usually require suitable force fields, which are in general not available for all processes of interest. However, machine learned force fields (MLFF) are conquering the field of computational materials and surface science. In this paper, we demonstrate how to efficiently build MLFFs suitable for process simulations and provide two examples for technologically relevant processes: precursor pulse in the atomic layer deposition of HfO2 and atomic layer etching of MoS2.

[LG-66] Characterization and Learning of Causal Graphs from Hard Interventions

链接: https://arxiv.org/abs/2505.01037
作者: Zihan Zhou,Muhammad Qasim Elahi,Murat Kocaoglu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A fundamental challenge in the empirical sciences involves uncovering causal structure through observation and experimentation. Causal discovery entails linking the conditional independence (CI) invariances in observational data to their corresponding graphical constraints via d-separation. In this paper, we consider a general setting where we have access to data from multiple experimental distributions resulting from hard interventions, as well as potentially from an observational distribution. By comparing different interventional distributions, we propose a set of graphical constraints that are fundamentally linked to Pearl’s do-calculus within the framework of hard interventions. These graphical constraints associate each graphical structure with a set of interventional distributions that are consistent with the rules of do-calculus. We characterize the interventional equivalence class of causal graphs with latent variables and introduce a graphical representation that can be used to determine whether two causal graphs are interventionally equivalent, i.e., whether they are associated with the same family of hard interventional distributions, where the elements of the family are indistinguishable using the invariances from do-calculus. We also propose a learning algorithm to integrate multiple datasets from hard interventions, introducing new orientation rules. The learning objective is a tuple of augmented graphs which entails a set of causal graphs. We also prove the soundness of the proposed algorithm.

[LG-67] Quantum Support Vector Regression for Robust Anomaly Detection

链接: https://arxiv.org/abs/2505.01012
作者: Kilian Tscharke,Maximilian Wendlinger,Sebastian Issel,Pascal Debus
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to IEEE International Conference on Quantum Computing and Engineering (QCE) 2025

点击查看摘要

Abstract:Anomaly Detection (AD) is critical in data analysis, particularly within the domain of IT security. In recent years, Machine Learning (ML) algorithms have emerged as a powerful tool for AD in large-scale data. In this study, we explore the potential of quantum ML approaches, specifically quantum kernel methods, for the application to robust AD. We build upon previous work on Quantum Support Vector Regression (QSVR) for semisupervised AD by conducting a comprehensive benchmark on IBM quantum hardware using eleven datasets. Our results demonstrate that QSVR achieves strong classification performance and even outperforms the noiseless simulation on two of these datasets. Moreover, we investigate the influence of - in the NISQ-era inevitable - quantum noise on the performance of the QSVR. Our findings reveal that the model exhibits robustness to depolarizing, phase damping, phase flip, and bit flip noise, while amplitude damping and miscalibration noise prove to be more disruptive. Finally, we explore the domain of Quantum Adversarial Machine Learning and demonstrate that QSVR is highly vulnerable to adversarial attacks and that noise does not improve the adversarial robustness of the model.

[LG-68] DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

链接: https://arxiv.org/abs/2505.00961
作者: Shu Tamano,Masanori Nojima
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Off-policy evaluation (OPE) and off-policy learning (OPL) for contextual bandit policies leverage historical data to evaluate and optimize a target policy. Most existing OPE/OPL methods–based on importance weighting or imputation–assume common support between the target and logging policies. When this assumption is violated, these methods typically require unstable extrapolation, truncation, or conservative strategies for individuals outside the common support assumption. However, such approaches can be inadequate in settings where explicit evaluation or optimization for such individuals is required. To address this issue, we propose DOLCE: Decomposing Off-policy evaluation/learning into Lagged and Current Effects, a novel estimator that leverages contextual information from multiple time points to decompose rewards into lagged and current effects. By incorporating both past and present contexts, DOLCE effectively handles individuals who violate the common support assumption. We show that the proposed estimator is unbiased under two assumptions–local correctness and conditional independence. Our experiments demonstrate that DOLCE achieves substantial improvements in OPE and OPL, particularly as the proportion of individuals outside the common support assumption increases.

[LG-69] On the emergence of numerical instabilities in Next Generation Reservoir Computing

链接: https://arxiv.org/abs/2505.00846
作者: Edmilson Roque dos Santos,Erik Bollt
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 21 pages, 8 figures

点击查看摘要

Abstract:Next Generation Reservoir Computing (NGRC) is a low-cost machine learning method for forecasting chaotic time series from data. However, ensuring the dynamical stability of NGRC models during autonomous prediction remains a challenge. In this work, we uncover a key connection between the numerical conditioning of the NGRC feature matrix – formed by polynomial evaluations on time-delay coordinates – and the long-term NGRC dynamics. Merging tools from numerical linear algebra and ergodic theory of dynamical systems, we systematically study how the feature matrix conditioning varies across hyperparameters. We demonstrate that the NGRC feature matrix tends to be ill-conditioned for short time lags and high-degree polynomials. Ill-conditioning amplifies sensitivity to training data perturbations, which can produce unstable NGRC dynamics. We evaluate the impact of different numerical algorithms (Cholesky, SVD, and LU) for solving the regularized least-squares problem.

[LG-70] Multi-site modelling and reconstruction of past extreme skew surges along the French Atlantic coast

链接: https://arxiv.org/abs/2505.00835
作者: Nathan Huet,Philippe Naveau,Anne Sabourin
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Appropriate modelling of extreme skew surges is crucial, particularly for coastal risk management. Our study focuses on modelling extreme skew surges along the French Atlantic coast, with a particular emphasis on investigating the extremal dependence structure between stations. We employ the peak-over-threshold framework, where a multivariate extreme event is defined whenever at least one location records a large value, though not necessarily all stations simultaneously. A novel method for determining an appropriate level (threshold) above which observations can be classified as extreme is proposed. Two complementary approaches are explored. First, the multivariate generalized Pareto distribution is employed to model extremes, leveraging its properties to derive a generative model that predicts extreme skew surges at one station based on observed extremes at nearby stations. Second, a novel extreme regression framework is assessed for point predictions. This specific regression framework enables accurate point predictions using only the “angle” of input variables, i.e. input variables divided by their norms. The ultimate objective is to reconstruct historical skew surge time series at stations with limited data. This is achieved by integrating extreme skew surge data from stations with longer records, such as Brest and Saint-Nazaire, which provide over 150 years of observations.

[LG-71] Q-Learning with Clustered-SMART (cSMART) Data: Examining Moderators in the Construction of Clustered Adaptive Interventions

链接: https://arxiv.org/abs/2505.00822
作者: Yao Song,Kelly Speth,Amy Kilbourne,Andrew Quanbeck,Daniel Almirall,Lu Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A clustered adaptive intervention (cAI) is a pre-specified sequence of decision rules that guides practitioners on how best - and based on which measures - to tailor cluster-level intervention to improve outcomes at the level of individuals within the clusters. A clustered sequential multiple assignment randomized trial (cSMART) is a type of trial that is used to inform the empirical development of a cAI. The most common type of secondary aim in a cSMART focuses on assessing causal effect moderation by candidate tailoring variables. We introduce a clustered Q-learning framework with the M-out-of-N Cluster Bootstrap using data from a cSMART to evaluate whether a set of candidate tailoring variables may be useful in defining an optimal cAI. This approach could construct confidence intervals (CI) with near-nominal coverage to assess parameters indexing the causal effect moderation function. Specifically, it allows reliable inferences concerning the utility of candidate tailoring variables in constructing a cAI that maximizes a mean end-of-study outcome even when “non-regularity”, a well-known challenge exists. Simulations demonstrate the numerical performance of the proposed method across varying non-regularity conditions and investigate the impact of varying number of clusters and intra-cluster correlation coefficient on CI coverage. Methods are applied on ADEPT dataset to inform the construction of a clinic-level cAI for improving evidence-based practice in treating mood disorders.

[LG-72] Dynamical System Parameter Path Optimization using Persistent Homology

链接: https://arxiv.org/abs/2505.00782
作者: Max M. Chumley,Firas A. Khasawneh
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注: 18 pages, 24 figures

点击查看摘要

Abstract:Nonlinear dynamical systems are complex and typically only simple systems can be analytically studied. In applications, these systems are usually defined with a set of tunable parameters and as the parameters are varied the system response undergoes significant topological changes or bifurcations. In a high dimensional parameter space, it is difficult to determine which direction to vary the system parameters to achieve a desired system response or state. In this paper, we introduce a new approach for optimally navigating a dynamical system parameter space that is rooted in topological data analysis. Specifically we use the differentiability of persistence diagrams to define a topological language for intuitively promoting or deterring different topological features in the state space response of a dynamical system and use gradient descent to optimally move from one point in the parameter space to another. The end result is a path in this space that guides the system to a set of parameters that yield the desired topological features defined by the loss function. We show a number of examples by applying the methods to different dynamical systems and scenarios to demonstrate how to promote different features and how to choose the hyperparameters to achieve different outcomes.

[LG-73] JFlow: Model-Independent Spherical Jeans Analysis using Equivariant Continuous Normalizing Flows

链接: https://arxiv.org/abs/2505.00763
作者: Sung Hak Lim,Kohei Hayashi,Shun’ichi Horigome,Shigeki Matsumoto,Mihoko M. Nojiri
类目: Astrophysics of Galaxies (astro-ph.GA); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 9 pages, 3 figures, 1 table

点击查看摘要

Abstract:The kinematics of stars in dwarf spheroidal galaxies have been studied to understand the structure of dark matter halos. However, the kinematic information of these stars is often limited to celestial positions and line-of-sight velocities, making full phase space analysis challenging. Conventional methods rely on projected analytic phase space density models with several parameters and infer dark matter halo structures by solving the spherical Jeans equation. In this paper, we introduce an unsupervised machine learning method for solving the spherical Jeans equation in a model-independent way as a first step toward model-independent analysis of dwarf spheroidal galaxies. Using equivariant continuous normalizing flows, we demonstrate that spherically symmetric stellar phase space densities and velocity dispersions can be estimated without model assumptions. As a proof of concept, we apply our method to Gaia challenge datasets for spherical models and measure dark matter mass densities given velocity anisotropy profiles. Our method can identify halo structures accurately, even with a small number of tracer stars.

[LG-74] XeMap: Contextual Referring in Large-Scale Remote Sensing Environments

链接: https://arxiv.org/abs/2505.00738
作者: Yuxi Li,Lu Si,Yujie Hou,Chengaung Liu,Bin Li,Hongjian Fang,Jun Zhang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 14 pages, 8 figures

点击查看摘要

Abstract:Advancements in remote sensing (RS) imagery have provided high-resolution detail and vast coverage, yet existing methods, such as image-level captioning/retrieval and object-level detection/segmentation, often fail to capture mid-scale semantic entities essential for interpreting large-scale scenes. To address this, we propose the conteXtual referring Map (XeMap) task, which focuses on contextual, fine-grained localization of text-referred regions in large-scale RS scenes. Unlike traditional approaches, XeMap enables precise mapping of mid-scale semantic entities that are often overlooked in image-level or object-level methods. To achieve this, we introduce XeMap-Network, a novel architecture designed to handle the complexities of pixel-level cross-modal contextual referring mapping in RS. The network includes a fusion layer that applies self- and cross-attention mechanisms to enhance the interaction between text and image embeddings. Furthermore, we propose a Hierarchical Multi-Scale Semantic Alignment (HMSA) module that aligns multiscale visual features with the text semantic vector, enabling precise multimodal matching across large-scale RS imagery. To support XeMap task, we provide a novel, annotated dataset, XeMap-set, specifically tailored for this task, overcoming the lack of XeMap datasets in RS imagery. XeMap-Network is evaluated in a zero-shot setting against state-of-the-art methods, demonstrating superior performance. This highlights its effectiveness in accurately mapping referring regions and providing valuable insights for interpreting large-scale RS environments.

信息检索

[IR-0] Multi-agents based User Values Mining for Recommendation

链接: https://arxiv.org/abs/2505.00981
作者: Lijian Chen,Wei Yuan,Tong Chen,Xiangyu Zhao,Nguyen Quoc Viet Hung,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems have rapidly evolved and become integral to many online services. However, existing systems sometimes produce unstable and unsatisfactory recommendations that fail to align with users’ fundamental and long-term preferences. This is because they primarily focus on extracting shallow and short-term interests from user behavior data, which is inherently dynamic and challenging to model. Unlike these transient interests, user values are more stable and play a crucial role in shaping user behaviors, such as purchasing items and consuming content. Incorporating user values into recommender systems can help stabilize recommendation performance and ensure results better reflect users’ latent preferences. However, acquiring user values is typically difficult and costly. To address this challenge, we leverage the strong language understanding, zero-shot inference, and generalization capabilities of Large Language Models (LLMs) to extract user values from users’ historical interactions. Unfortunately, direct extraction using LLMs presents several challenges such as length constraints and hallucination. To overcome these issues, we propose ZOOM, a zero-shot multi-LLM collaborative framework for effective and accurate user value extraction. In ZOOM, we apply text summarization techniques to condense item content while preserving essential meaning. To mitigate hallucinations, ZOOM introduces two specialized agent roles: evaluators and supervisors, to collaboratively generate accurate user values. Extensive experiments on two widely used recommendation datasets with two state-of-the-art recommendation models demonstrate the effectiveness and generalization of our framework in automatic user value mining and recommendation performance improvement.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-05-05

目录

概览 (2025-05-05)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载