Arxiv今日论文 | 2025-05-02

本篇博文主要内容为 2025-05-02 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决将链式思维（Chain-of-Thought, CoT）与强化学习（Reinforcement Learning, RL）结合以提升视觉生成性能的问题，尤其是在文本到图像生成领域中，此类方法的应用仍处于探索阶段。其解决方案的关键在于提出一种基于双层次CoT推理过程的生成增强模型——T2I-R1，该模型通过引入BiCoT-GRPO框架，集成生成奖励机制，在同一训练步骤中协同优化语义层CoT（用于提示的高层规划）和标记层CoT（用于分块生成中的低层像素处理），从而显著提升了生成质量与一致性。

链接: https://arxiv.org/abs/2505.00703
作者: Dongzhi Jiang,Ziyu Guo,Renrui Zhang,Zhuofan Zong,Hao Li,Le Zhuo,Shilin Yan,Pheng-Ann Heng,Hongsheng Li
机构: CUHK MMLab(香港中文大学多媒体实验室); CUHK MiuLar Lab(香港中文大学MiuLar实验室); Shanghai AI Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: this https URL
zh

[NLP-1] Steering Large Language Models with Register Analysis for Arbitrary Style Transfer

【速读】：该论文试图解决基于示例的任意风格迁移问题，即如何将输入文本重写为与给定示例风格匹配的文本，而这一过程在大型语言模型（Large Language Models, LLMs）中仍面临挑战。其核心问题在于如何准确描述示例的风格以引导LLMs生成高质量的重写结果。论文提出的解决方案关键在于采用基于语域分析（register analysis）的提示方法，通过该方法增强风格迁移效果，同时更有效地保持原文含义。

链接: https://arxiv.org/abs/2505.00679
作者: Xinchen Yang,Marine Carpuat
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in rewriting text across various styles. However, effectively leveraging this ability for example-based arbitrary style transfer, where an input text is rewritten to match the style of a given exemplar, remains an open challenge. A key question is how to describe the style of the exemplar to guide LLMs toward high-quality rewrites. In this work, we propose a prompting method based on register analysis to guide LLMs to perform this task. Empirical evaluations across multiple style transfer tasks show that our prompting approach enhances style transfer strength while preserving meaning more effectively than existing prompting strategies.
zh

[NLP-2] Rethinking Memory in AI: Taxonomy Operations Topics and Future Directions

【速读】：该论文试图解决当前对AI系统中记忆机制研究的不足，尤其是针对基于大语言模型（Large Language Models, LLMs）的智能体，现有文献多集中于记忆的应用层面，而忽视了支撑记忆动态的核心原子操作。其解决方案的关键在于将记忆表示分为参数化、上下文结构化和上下文非结构化三类，并引入六种基础的记忆操作：巩固、更新、索引、遗忘、检索和压缩。通过系统地将这些操作映射到长期记忆、长上下文、参数化修改和多源记忆等研究主题，该论文从原子操作和表示类型的视角重新构建了记忆系统，为相关研究提供了结构化和动态化的分析框架。

链接: https://arxiv.org/abs/2505.00675
作者: Yiming Du,Wenyu Huang,Danna Zheng,Zhaowei Wang,Sebastien Montella,Mirella Lapata,Kam-Fai Wong,Jeff Z. Pan
机构: The Chinese University of Hong Kong (香港中文大学); The University of Edinburgh (爱丁堡大学); HKUST (香港科技大学); Poisson Lab, CSI, Huawei UK R&D Ltd. (普森实验室，CSI，华为英国研发中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory is a fundamental component of AI systems, underpinning large language models (LLMs) based agents. While prior surveys have focused on memory applications with LLMs, they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric, contextual structured, and contextual unstructured and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We systematically map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnoteThe paper list, datasets, methods and tools are available at \hrefthis https URLthis https URL_Memory_in_AI…
zh

[NLP-3] DeepCritic: Deliberate Critique with Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）输出中错误识别与反馈不足的问题，尤其是当前LLM批评者在数学解题步骤上的批判过于浅显，导致判断准确率低且难以提供有效的修正反馈。其解决方案的关键在于提出一种新颖的两阶段框架，首先利用Qwen2.5-72B-Instruct生成包含多视角验证和深入分析的长文本批评作为监督微调的种子数据，其次通过强化学习进一步提升模型的批评能力，采用人工标注数据或基于蒙特卡洛采样的自动标注数据进行训练，从而显著提升模型在错误识别任务上的表现，并为生成器提供更细致的反馈以优化解题过程。

链接: https://arxiv.org/abs/2505.00662
作者: Wenkai Yang,Jingwen Chen,Yankai Lin,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); School of Computer Science and Technology, Beijing Jiaotong University (北京交通大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress. Data and models are available at this https URL

点击查看摘要

Abstract:As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.
zh

[NLP-4] On the generalization of language models from in-context learning and finetuning: a controlled study

【速读】：该论文试图解决大型语言模型在微调（fine-tuning）过程中表现出的泛化能力不足问题，尤其是在处理关系反转和逻辑推理等任务时的失败。其解决方案的关键在于利用上下文学习（in-context learning）所具有的不同归纳偏置，通过将上下文推理（in-context inferences）引入微调数据中，从而提升模型的泛化能力。实验表明，这种方法在多个数据集和基准测试中均能有效改善模型的泛化表现。

链接: https://arxiv.org/abs/2505.00661
作者: Andrew K. Lampinen,Arslan Chaudhry,Stephanie C.Y. Chan,Cody Wild,Diane Wan,Alex Ku,Jörg Bornschein,Razvan Pascanu,Murray Shanahan,James L. McClelland
机构: Google DeepMind(谷歌深度思维); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning – from failing to generalize to simple reversals of relations they are trained on, to missing logical deductions that can be made from trained information. These failures to generalize from fine-tuning can hinder practical application of these models. However, language models’ in-context learning shows different inductive biases, and can generalize better in some of these cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models’ ability to generalize from finetuning data. The datasets are constructed to isolate the knowledge in the dataset from that in pretraining, to create clean tests of generalization. We expose pretrained large models to controlled subsets of the information in these datasets – either in context, or through fine-tuning – and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.
zh

[NLP-5] Large Language Models Understanding: an Inherent Ambiguity Barrier

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）是否能够真正理解其对话内容的问题，而非仅是生成看似合理的回应。论文的解决方案关键在于提出一个思想实验和半形式化论证，指出LLMs存在一种固有的模糊性障碍，这种障碍使得它们无法真正理解自身流畅对话的含义。

链接: https://arxiv.org/abs/2505.00654
作者: Daniel N. Nissani(Nissensohn)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: submitted to NEURAL COMPUTATION

点击查看摘要

Abstract:A lively ongoing debate is taking place, since the extraordinary emergence of Large Language Models (LLMs) with regards to their capability to understand the world and capture the meaning of the dialogues in which they are involved. Arguments and counter-arguments have been proposed based upon thought experiments, anecdotal conversations between LLMs and humans, statistical linguistic analysis, philosophical considerations, and more. In this brief paper we present a counter-argument based upon a thought experiment and semi-formal considerations leading to an inherent ambiguity barrier which prevents LLMs from having any understanding of what their amazingly fluent dialogues mean.
zh

[NLP-6] Investigating Task Arithmetic for Zero-Shot Information Retrieval SIGIR’25

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在未见过的任务和领域中性能下降的问题，这一问题主要源于词汇和词频分布的变化。解决方案的关键在于提出一种称为“任务算术”（Task Arithmetic）的技术，该技术通过简单的数学操作（如加法或减法）组合在不同任务或领域上预训练的LLMs权重，从而在无需额外微调的情况下适应检索模型。该方法能够将多种任务和领域知识整合到单一模型中，实现不同检索场景下的有效零样本适应。

链接: https://arxiv.org/abs/2505.00649
作者: Marco Braga,Pranav Kasela,Alessandro Raganato,Gabriella Pasi
机构: University of Milano-Bicocca (米兰大学博科尼校区); Politecnico di Torino (都灵理工大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in SIGIR '25

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at this https URL.
zh

[NLP-7] he Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多角色交互中难以准确区分不同角色消息的问题，即\emph{role separation}（角色分离）问题。解决方案的关键在于通过调整输入编码中的token级提示信号，强化标记角色边界的不变信号，而非依赖任务类型或文本起始位置等表面代理特征。该方法有助于模型更可靠地维持一致的多角色行为，而非仅仅记忆已知提示或触发词。

链接: https://arxiv.org/abs/2505.00626
作者: Zihao Wang,Yibo Jiang,Jiahao Yu,Heqing Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role – a concept we call \emphrole separation – is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine \emphrole-separation learning: the process of teaching LLMs to robustly distinguish system and user tokens. Through a \emphsimple, controlled experimental framework, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing \emphinvariant signals that mark role boundaries by adjusting token-wise cues in the model’s input encoding. In particular, manipulating position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.
zh

[NLP-8] FineScope : Precision Pruning for Domain-Specialized Large Language Models Using SAE-Guided Self-Data Cultivation

【速读】：该论文旨在解决在资源受限环境下训练高效且具备强任务性能的小型领域特定大语言模型（Large Language Models, LLMs）的问题。现有中等规模模型如LLaMA在特定数据集上测试时常常出现准确率下降的问题，因此需要一种有效的方法来优化模型以适应特定领域。解决方案的关键在于提出FineScope框架，该框架利用稀疏自编码器（Sparse Autoencoder, SAE）从预训练的大模型中提取领域特定的特征子集，并结合结构化剪枝与领域约束，确保剪枝后的模型保留目标领域的关键知识。此外，通过自数据蒸馏进一步恢复剪枝过程中丢失的领域信息，从而提升模型性能。

链接: https://arxiv.org/abs/2505.00624
作者: Chaitali Bhattacharyya,Yeseong Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) from scratch requires significant computational resources, driving interest in developing smaller, domain-specific LLMs that maintain both efficiency and strong task performance. Medium-sized models such as LLaMA, llama have served as starting points for domain-specific adaptation, but they often suffer from accuracy degradation when tested on specialized datasets. We introduce FineScope, a framework for deriving compact, domain-optimized LLMs from larger pretrained models. FineScope leverages the Sparse Autoencoder (SAE) framework, inspired by its ability to produce interpretable feature representations, to extract domain-specific subsets from large datasets. We apply structured pruning with domain-specific constraints, ensuring that the resulting pruned models retain essential knowledge for the target domain. To further enhance performance, these pruned models undergo self-data distillation, leveraging SAE-curated datasets to restore key domain-specific information lost during pruning. Extensive experiments and ablation studies demonstrate that FineScope achieves highly competitive performance, outperforming several large-scale state-of-the-art LLMs in domain-specific tasks. Additionally, our results show that FineScope enables pruned models to regain a substantial portion of their original performance when fine-tuned with SAE-curated datasets. Furthermore, applying these datasets to fine-tune pretrained LLMs without pruning also improves their domain-specific accuracy, highlighting the robustness of our approach. The code will be released.
zh

[NLP-9] Block Circulant Adapter for Large Language Models IJCAI-2025

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）微调过程中因模型规模庞大而导致的计算和存储成本高昂的问题。其解决方案的关键在于提出一种基于块循环矩阵（Block Circulant Matrix）的微调方法，并结合稳定的训练策略，利用循环矩阵和一维傅里叶变换的特性，从而显著降低参数数量和计算量。实验表明，该方法在保持接近或优于现有方法的任务性能的同时，相较于VeRA、LoRA和FourierFT分别减少了14倍、16倍和32倍的参数量与FLOPs。

链接: https://arxiv.org/abs/2505.00582
作者: Xinyu Ding,Meiqi Wang,Siyu Liao,Zhongfeng Wang
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: to appear in Proceedings of the 2025 International Joint Conference on Artificial Intelligence (IJCAI-2025)

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses 14\times less number of parameters than VeRA, 16\times smaller than LoRA and 32\times less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.
zh

[NLP-10] FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）中扩展上下文窗口时面临的挑战，即键值（Key-Value, KV）缓存内存需求的线性增长以及自注意力机制随序列长度呈二次复杂度的问题。现有方法在扩展至更长上下文时会出现性能退化。该论文提出的解决方案的关键在于观察到KV缓存的能量分布主要集中在低频成分，在频域中通过过滤高频成分实现KV缓存的有效压缩，从而在最小信息损失的情况下提升模型在微调和推理中的效率。该方法名为FreqKV，无需引入额外参数或结构修改，能够迭代地将不断增长的KV缓存压缩至固定大小。

链接: https://arxiv.org/abs/2505.00570
作者: Jushi Kai,Boyi Zeng,Yixuan Wang,Haoli Bai,Bo Jiang,Zhouhan Lin
机构: LUMIA Lab, Shanghai Jiao Tong University (LUMIA实验室，上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extending the context window in large language models (LLMs) is essential for applications involving long-form content generation. However, the linear increase in key-value (KV) cache memory requirements and the quadratic complexity of self-attention with respect to sequence length present significant challenges during fine-tuning and inference. Existing methods suffer from performance degradation when extending to longer contexts. In this work, we introduce a novel context extension method that optimizes both fine-tuning and inference efficiency. Our method exploits a key observation: in the frequency domain, the energy distribution of the KV cache is primarily concentrated in low-frequency components. By filtering out the high-frequency components, the KV cache can be effectively compressed with minimal information loss. Building on this insight, we propose an efficient compression technique, FreqKV, that iteratively compresses the increasing KV cache to a fixed size in the frequency domain, applicable to both fine-tuning and inference. FreqKV introduces no additional parameters or architectural modifications. With minimal fine-tuning, LLMs can learn to leverage the limited cache that is compressed in the frequency domain and extend the context window efficiently. Experiments on various long context language modeling and understanding tasks demonstrate the efficiency and efficacy of the proposed method.
zh

[NLP-11] riggering Hallucinations in LLM s: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）中幻觉（hallucination）问题，即模型生成的内容虽然流畅但可能与事实不符，这在医疗、法律等对事实可靠性要求较高的领域尤为关键。论文提出的解决方案的关键在于构建一个基于提示的框架，包括两个核心组件：幻觉诱导提示（Hallucination-Inducing Prompt, HIP）和幻觉量化提示（Hallucination Quantifying Prompt, HQP）。HIP通过合成融合语义上相距甚远的概念来系统性地触发幻觉，而HQP则用于评估输出的合理性、置信度和连贯性。该框架为研究LLMs的幻觉脆弱性提供了可重复的测试平台，并为开发更安全、更具自我反思能力的模型奠定了基础。

链接: https://arxiv.org/abs/2505.00557
作者: Makoto Sato
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucinations in large language models (LLMs) present a growing challenge across real-world applications, from healthcare to law, where factual reliability is essential. Despite advances in alignment and instruction tuning, LLMs can still generate outputs that are fluent yet fundamentally untrue. Understanding the cognitive dynamics that underlie these hallucinations remains an open problem. In this study, we propose a prompt-based framework to systematically trigger and quantify hallucination: a Hallucination-Inducing Prompt (HIP), which synthetically fuses semantically distant concepts (e.g., periodic table of elements and tarot divination) in a misleading way, and a Hallucination Quantifying Prompt (HQP), which scores the plausibility, confidence, and coherence of the output. Controlled experiments across multiple LLMs revealed that HIPs consistently produced less coherent and more hallucinated responses than their null-fusion controls. These effects varied across models, with reasoning-oriented LLMs showing distinct profiles from general-purpose ones. Our framework provides a reproducible testbed for studying hallucination vulnerability, and opens the door to developing safer, more introspective LLMs that can detect and self-regulate the onset of conceptual instability.
zh

[NLP-12] 100 Days After DeepSeek -R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

【速读】：该论文旨在解决如何复现DeepSeek-R1等推理语言模型（Reasoning Language Models, RLMs）的高性能表现问题，特别是在模型实现细节未完全开源的情况下。其解决方案的关键在于通过监督微调（Supervised Fine-Tuning, SFT）和基于可验证奖励的强化学习（Reinforcement Learning from Verifiable Rewards, RLVR）策略，结合数据准备与方法设计，探索有效的模型优化路径。研究重点在于总结当前复现工作的数据构建、方法设计及训练流程，并提炼出关键发现以推动未来RLMs的研究与发展。

链接: https://arxiv.org/abs/2505.00551
作者: Chong Zhang,Yue Deng,Xiang Lin,Bin Wang,Dianwen Ng,Hai Ye,Xingxuan Li,Yao Xiao,Zhanfeng Mo,Qi Zhang,Lidong Bing
机构: MiroMind; Fudan University; National University of Singapore; Singapore University of Technology and Design; Nanyang Technological University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.
zh

[NLP-13] HalluMix: A Task-Agnostic Multi-Domain Benchmark for Real-World Hallucination Detection

【速读】：该论文旨在解决生成式 AI (Generative AI) 在高风险领域部署时，检测其生成内容中缺乏支持证据的幻觉（hallucination）问题。现有幻觉检测基准多为合成生成，聚焦于抽取式问答任务，无法反映涉及多文档上下文和完整句子输出的真实场景复杂性。论文提出的解决方案是构建 HalluMix 基准，这是一个多样化、任务无关的数据集，涵盖多个领域和格式的示例，并基于此评估七种幻觉检测系统，揭示了不同任务、文档长度和输入表示下的性能差异，关键在于通过真实场景数据提升检测系统的泛化能力和实用性。

链接: https://arxiv.org/abs/2505.00506
作者: Deanna Emery,Michael Goitia,Freddie Vargus,Iulia Neagu
机构: Quotient AI(Quotient AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content \unicodex2013 text that is not grounded in supporting evidence \unicodex2013 has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems \unicodex2013 both open and closed source \unicodex2013 highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.
zh

[NLP-14] Computational Identification of Regulatory Statements in EU Legislation

【速读】：该论文试图解决在欧盟立法中识别监管性表述的问题，以支持对立法监管密度和严格性的度量。解决方案的关键在于提出了一种基于机构语法工具的具体定义，并开发和比较了两种不同的自动识别方法：一种基于依存句法分析，另一种基于Transformer的机器学习模型。这两种方法均表现出较高的准确率（分别为80%和84%），表明其在实际应用中的潜力。

链接: https://arxiv.org/abs/2505.00479
作者: Gijs Jan Brandsma,Jens Blom-Hansen,Christiaan Meijer,Kody Moodley
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Identifying regulatory statements in legislation is useful for developing metrics to measure the regulatory density and strictness of legislation. A computational method is valuable for scaling the identification of such statements from a growing body of EU legislation, constituting approximately 180,000 published legal acts between 1952 and 2023. Past work on extraction of these statements varies in the permissiveness of their definitions for what constitutes a regulatory statement. In this work, we provide a specific definition for our purposes based on the institutional grammar tool. We develop and compare two contrasting approaches for automatically identifying such statements in EU legislation, one based on dependency parsing, and the other on a transformer-based machine learning model. We found both approaches performed similarly well with accuracies of 80% and 84% respectively and a K alpha of 0.58. The high accuracies and not exceedingly high agreement suggests potential for combining strengths of both approaches.
zh

[NLP-15] Red Teaming Large Language Models for Healthcare

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在医疗领域应用中的潜在安全漏洞问题，特别是那些可能因模型输出导致临床危害的脆弱性。解决方案的关键在于通过红队测试（Red Teaming），结合计算与临床专家的协作，识别出LLM在面对真实临床场景时可能产生的有害响应，从而揭示LLM开发者单凭技术视角难以发现的缺陷。

链接: https://arxiv.org/abs/2505.00467
作者: Vahid Balazadeh,Michael Cooper,David Pellow,Atousa Assadi,Jennifer Bell,Jim Fackler,Gabriel Funingana,Spencer Gable-Cook,Anirudh Gangadhar,Abhishek Jaiswal,Sumanth Kaja,Christopher Khoury,Randy Lin,Kaden McKeen,Sara Naimimohasses,Khashayar Namdar,Aviraj Newatia,Allan Pang,Anshul Pattoo,Sameer Peesapati,Diana Prepelita,Bogdana Rakova,Saba Sadatamin,Rafael Schulman,Ajay Shah,Syed Azhar Shah,Syed Ahmar Shah,Babak Taati,Balagopal Unnikrishnan,Stephanie Williams,Rahul G Krishnan
机构: University of Toronto(多伦多大学); Vector Institute for AI(人工智能研究所); University Health Network(健康网络大学); NYU Langone Health(纽约大学朗格尼健康中心); Johns Hopkins Medical Institutions(约翰霍普金斯医学机构); Cancer Research UK Cambridge Institute(英国癌症研究协会剑桥研究所); University of Iowa Hospitals & Clinics(爱荷华大学医院和诊所); Algoma University(阿尔戈马大学); University of Cambridge(剑桥大学); Leeds teaching hospitals NHS Trust(利兹教学医院NHS信托); Queen’s University(皇后大学); Synthesize(合成); Basque Center for Applied Mathematics (BCAM)(巴斯克应用数学中心（BCAM）); Basque Foundation for Science (IKERBASQUE)(巴斯克科学基金会（IKERBASQUE）); University of Edinburgh(爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities – realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.
zh

[NLP-16] oward Automated Regulatory Decision-Making: Trustworthy Medical Device Risk Classification with Multimodal Transformers and Self-Training

【速读】：该论文旨在解决医疗设备风险等级准确分类的问题，这对于监管监督和临床安全至关重要。其解决方案的关键在于提出一种基于Transformer的多模态框架，该框架整合了文本描述和视觉信息以预测设备的监管分类，通过引入跨模态注意力机制捕捉模态间的依赖关系，并采用自训练策略在有限监督下提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.00422
作者: Yu Han,Aaron Ceross,Jeroen H.M. Bergmann
机构: University of Oxford (牛津大学); University of Birmingham (伯明翰大学); University of Southern Denmark (南丹麦大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate classification of medical device risk levels is essential for regulatory oversight and clinical safety. We present a Transformer-based multimodal framework that integrates textual descriptions and visual information to predict device regulatory classification. The model incorporates a cross-attention mechanism to capture intermodal dependencies and employs a self-training strategy for improved generalization under limited supervision. Experiments on a real-world regulatory dataset demonstrate that our approach achieves up to 90.4% accuracy and 97.9% AUROC, significantly outperforming text-only (77.2%) and image-only (54.8%) baselines. Compared to standard multimodal fusion, the self-training mechanism improved SVM performance by 3.3 percentage points in accuracy (from 87.1% to 90.4%) and 1.4 points in macro-F1, suggesting that pseudo-labeling can effectively enhance generalization under limited supervision. Ablation studies further confirm the complementary benefits of both cross-modal attention and self-training.
zh

[NLP-17] CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass SIGIR2025

【速读】：该论文试图解决如何在生成式语言模型（Generative Language Models）上高效实现无监督句子表示学习的问题，尤其是在计算资源有限的情况下。现有方法多集中于判别式预训练语言模型（Discriminative PLMs），而生成式PLMs因其参数量大、计算成本高，尚未被充分探索。论文提出的解决方案CSE-SFP的关键在于利用生成式模型的结构特性，通过单次前向传播实现有效的无监督对比学习，从而在保证嵌入质量的同时显著降低训练时间和内存消耗。

链接: https://arxiv.org/abs/2505.00389
作者: Bowen Zhang,Zixin Song,Chunping Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Accepted by SIGIR 2025 (Full)

点击查看摘要

Abstract:As a fundamental task in Information Retrieval and Computational Linguistics, sentence representation has profound implications for a wide range of practical applications such as text clustering, content analysis, question-answering systems, and web search. Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field, particularly through unsupervised embedding derivation methods centered on discriminative PLMs like BERT. However, due to time and computational constraints, few efforts have attempted to integrate unsupervised sentence representation with generative PLMs, which typically possess much larger parameter sizes. Given that state-of-the-art models in both academia and industry are predominantly based on generative architectures, there is a pressing need for an efficient unsupervised text representation framework tailored to decoder-only PLMs. To address this concern, we propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models. Compared to existing strategies, CSE-SFP requires only a single forward pass to perform effective unsupervised contrastive learning. Rigorous experimentation demonstrates that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption. Furthermore, we introduce two ratio metrics that jointly assess alignment and uniformity, thereby providing a more robust means for evaluating the semantic spatial properties of encoding models.
zh

[NLP-18] KoACD: The First Korean Adolescent Dataset for Cognitive Distortion Analysis

【速读】：该论文试图解决青少年认知扭曲检测中数据稀缺与模型泛化能力不足的问题，尤其是在大规模、针对青少年群体的语料库方面研究有限。解决方案的关键在于构建了首个针对韩国青少年的大规模认知扭曲数据集KoACD，并采用多大型语言模型（Large Language Model, LLM）协商方法进行扭曲分类优化，结合认知澄清与认知平衡两种策略生成合成数据，以提升模型对上下文依赖性推理的能力。

链接: https://arxiv.org/abs/2505.00367
作者: JunSeo Kim,HyeHyeon Kim
机构: Gachon University College of IT Convergence (嘉泉大学IT融合学院); Yonsei University College of Medicine (延世大学医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cognitive distortion refers to negative thinking patterns that can lead to mental health issues like depression and anxiety in adolescents. Previous studies using natural language processing (NLP) have focused mainly on small-scale adult datasets, with limited research on adolescents. This study introduces KoACD, the first large-scale dataset of cognitive distortions in Korean adolescents, containing 108,717 instances. We applied a multi-Large Language Model (LLM) negotiation method to refine distortion classification and generate synthetic data using two approaches: cognitive clarification for textual clarity and cognitive balancing for diverse distortion representation. Validation through LLMs and expert evaluations showed that while LLMs classified distortions with explicit markers, they struggled with context-dependent reasoning, where human evaluators demonstrated higher accuracy. KoACD aims to enhance future research on cognitive distortion detection.
zh

[NLP-19] RB: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training

【速读】：该论文旨在解决数据混合策略在训练语言模型时存在的两个问题：一是依赖预定义的数据领域（如数据源、任务类型），可能无法捕捉关键语义细节，导致性能未达最优；二是随着领域数量增加，计算成本呈指数级增长。其解决方案的关键在于RB框架，该框架通过基于语义相似性的重新分组（Regroup）生成更细粒度的领域，并利用训练过程中获得的领域梯度诱导的Gram矩阵高效优化数据组成（Balance），从而在无需额外计算资源的情况下提升性能。

链接: https://arxiv.org/abs/2505.00358
作者: Albert Ge,Tzu-Heng Huang,John Cooper,Avi Trost,Ziyi Chu,Satya Sai Srinath Namburi GNVV,Ziyang Cai,Kendall Park,Nicholas Roberts,Frederic Sala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data mixing strategies have successfully reduced the costs involved in training language models. While promising, such methods suffer from two flaws. First, they rely on predetermined data domains (e.g., data sources, task types), which may fail to capture critical semantic nuances, leaving performance on the table. Second, these methods scale with the number of domains in a computationally prohibitive way. We address these challenges via RB, a framework that re-partitions training data based on semantic similarity (Regroup) to create finer-grained domains, and efficiently optimizes the data composition (Balance) by leveraging a Gram matrix induced by domain gradients obtained throughout training. Unlike prior works, it removes the need for additional compute to obtain evaluation information such as losses or gradients. We analyze this technique under standard regularity conditions and provide theoretical insights that justify RB’s effectiveness compared to non-adaptive mixing approaches. Empirically, we demonstrate the effectiveness of RB on five diverse datasets ranging from natural language to reasoning and multimodal tasks. With as little as 0.01% additional compute overhead, RB matches or exceeds the performance of state-of-the-art data mixing strategies.
zh

[NLP-20] Enhancing AI-Driven Education: Integrating Cognitive Frameworks Linguistic Feedback Analysis and Ethical Considerations for Improved Content Generation IJCNN2025

【速读】：该论文试图解决如何在教育领域中有效且负责任地利用生成式 AI (Generative AI) 的问题，特别是在提升 AI 生成教学材料的质量、认知深度和伦理合规性方面。解决方案的关键在于提出一个综合框架，该框架整合了认知评估框架（如布鲁姆分类法和 SOLO 分类法）、AI 生成反馈的语言分析以及伦理设计原则，从而指导 AI 教育工具的开发与应用。该框架通过三个阶段——认知对齐、语言反馈整合和伦理保障——实现对 AI 教学工具的系统性优化，并在 OneClickQuiz 这一 Moodle 插件中进行了实践验证。

链接: https://arxiv.org/abs/2505.00339
作者: Antoun Yaacoub,Sansiri Tarnpradab,Phattara Khumprom,Zainab Assaghir,Lionel Prevost,Jérôme Da-Rugna
机构: ESIEA Lab, ESIEA (ESIEA 实验室，ESIEA); King Mongkut’s University of Technology Thonburi (泰国吞武里皇家理工大学); Lebanese University (黎巴嫩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This article will be presented in IJCNN 2025 “AI Innovations for Education: Transforming Teaching and Learning through Cutting-Edge Technologies” workshop

点击查看摘要

Abstract:Artificial intelligence (AI) is rapidly transforming education, presenting unprecedented opportunities for personalized learning and streamlined content creation. However, realizing the full potential of AI in educational settings necessitates careful consideration of the quality, cognitive depth, and ethical implications of AI-generated materials. This paper synthesizes insights from four related studies to propose a comprehensive framework for enhancing AI-driven educational tools. We integrate cognitive assessment frameworks (Bloom’s Taxonomy and SOLO Taxonomy), linguistic analysis of AI-generated feedback, and ethical design principles to guide the development of effective and responsible AI tools. We outline a structured three-phase approach encompassing cognitive alignment, linguistic feedback integration, and ethical safeguards. The practical application of this framework is demonstrated through its integration into OneClickQuiz, an AI-powered Moodle plugin for quiz generation. This work contributes a comprehensive and actionable guide for educators, researchers, and developers aiming to harness AI’s potential while upholding pedagogical and ethical standards in educational content generation.
zh

[NLP-21] 2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation

【速读】：该论文试图解决当前文本到视频生成模型在遵循基本物理定律方面的不足，这些问题导致生成内容不真实或具有误导性。解决方案的关键在于引入了首个基于第一性原理的基准测试框架T2VPhysBench，该框架系统性地评估最先进的文本到视频系统是否遵守包括牛顿力学、守恒定律和现象学效应在内的十二项核心物理规律，并通过严格的主观评价协议及三项针对性研究揭示了现有模型在物理合规性上的显著缺陷。

链接: https://arxiv.org/abs/2505.00337
作者: Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao
机构: Guilin University of Electronic Technology (桂林电子科技大学); University of Arizona (亚利桑那大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); The Simons Institute for the Theory of Computing at the UC, Berkeley (伯克利加州大学理论计算模拟研究所); Arizona State University (亚利桑那州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video generative models have made significant strides in recent years, producing high-quality videos that excel in both aesthetic appeal and accurate instruction following, and have become central to digital art creation and user engagement online. Yet, despite these advancements, their ability to respect fundamental physical laws remains largely untested: many outputs still violate basic constraints such as rigid-body collisions, energy conservation, and gravitational dynamics, resulting in unrealistic or even misleading content. Existing physical-evaluation benchmarks typically rely on automatic, pixel-level metrics applied to simplistic, life-scenario prompts, and thus overlook both human judgment and first-principles physics. To fill this gap, we introduce \textbfT2VPhysBench, a first-principled benchmark that systematically evaluates whether state-of-the-art text-to-video systems, both open-source and commercial, obey twelve core physical laws including Newtonian mechanics, conservation principles, and phenomenological effects. Our benchmark employs a rigorous human evaluation protocol and includes three targeted studies: (1) an overall compliance assessment showing that all models score below 0.60 on average in each law category; (2) a prompt-hint ablation revealing that even detailed, law-specific hints fail to remedy physics violations; and (3) a counterfactual robustness test demonstrating that models often generate videos that explicitly break physical rules when so instructed. The results expose persistent limitations in current architectures and offer concrete insights for guiding future research toward truly physics-aware video generation.
zh

[NLP-22] Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

【速读】：该论文试图解决大规模语言模型中自注意力机制的二次计算复杂度问题，尽管已有大量研究，但现有的次二次注意力方法在实际应用中仍表现不佳。其解决方案的关键在于引入动态、学习的内容相关稀疏性，提出了一种名为Mixture of Sparse Attention (MoSA) 的新方法，该方法受Mixture of Experts (MoE) 启发，通过专家选择路由机制动态选择每个注意力头的标记，从而实现任意稀疏注意力模式。MoSA通过从长度为T的序列中选择k个标记，将每个注意力头的计算复杂度从O(T²)降低到O(k² + T)，在相同计算预算下允许使用更多注意力头，提升模型专一性，并在相同计算资源下实现了优于密集基线的性能，甚至在某些情况下将困惑度提升了27%。

链接: https://arxiv.org/abs/2505.00315
作者: Piotr Piękos,Róbert Csordás,Jürgen Schmidhuber
机构: KAUST, AI Initiative(沙特阿拉伯王国阿卜杜拉国王科技大学人工智能计划); Stanford University(斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting k tokens from a sequence of length T , MoSA reduces the computational complexity of each attention head from O(T^2) to O(k^2 + T) . This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget. MoSA can also reduce the resource usage compared to dense self-attention. Despite using torch implementation without an optimized kernel, perplexity-matched MoSA models are simultaneously faster in wall-clock time, require less memory for training, and drastically reduce the size of the KV-cache compared to the dense transformer baselines.
zh

[NLP-23] Consistency in Language Models: Current Landscape Challenges and Future Directions

【速读】：该论文试图解决当前语言模型在不同场景下难以保持可靠一致性的核心问题（consistency），这一问题体现在生成内容在逻辑规则、道德规范及事实准确性等方面的不一致性。解决方案的关键在于建立稳健的评估基准（benchmarks）以及推动跨学科方法，以确保语言模型在特定领域任务中的应用过程中维持一致性，同时不牺牲其功能性和适应性。

链接: https://arxiv.org/abs/2505.00268
作者: Jekaterina Novikova,Carol Anderson,Borhane Blili-Hamelin,Subhabrata Majumdar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The hallmark of effective language use lies in consistency – expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models struggle to maintain reliable consistency across different scenarios. This paper examines the landscape of consistency research in AI language systems, exploring both formal consistency (including logical rule adherence) and informal consistency (such as moral and factual coherence). We analyze current approaches to measure aspects of consistency, identify critical research gaps in standardization of definitions, multilingual assessment, and methods to improve consistency. Our findings point to an urgent need for robust benchmarks to measure and interdisciplinary approaches to ensure consistency in the application of language models on domain-specific tasks while preserving the utility and adaptability.
zh

[NLP-24] EnronQA: Towards Personalized RAG over Private Documents

【速读】：该论文试图解决在企业级应用中，现有的检索增强生成（Retrieval Augmented Generation, RAG）基准测试缺乏针对私有文档的个性化和隐私上下文的问题。当前的RAG评估多基于公共数据集，如维基百科或通用网页，无法有效反映实际企业环境中对私有文档的处理需求。解决方案的关键在于发布EnronQA基准，这是一个包含103,638封电子邮件及528,304个问答对的数据集，覆盖150个不同的用户邮箱，从而为私有数据上的RAG管道提供更准确的评估与实验基础，并支持在真实数据上探索个性化检索设置与记忆与检索之间权衡的分析。

链接: https://arxiv.org/abs/2505.00263
作者: Michael J. Ryan,Danmei Xu,Chris Nivera,Daniel Campos
机构: Stanford University (斯坦福大学); Snowflake (雪flake)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 26 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has become one of the most popular methods for bringing knowledge-intensive context to large language models (LLM) because of its ability to bring local context at inference time without the cost or data leakage risks associated with fine-tuning. A clear separation of private information from the LLM training has made RAG the basis for many enterprise LLM workloads as it allows the company to augment LLM’s understanding using customers’ private documents. Despite its popularity for private documents in enterprise deployments, current RAG benchmarks for validating and optimizing RAG pipelines draw their corpora from public data such as Wikipedia or generic web pages and offer little to no personal context. Seeking to empower more personal and private RAG we release the EnronQA benchmark, a dataset of 103,638 emails with 528,304 question-answer pairs across 150 different user inboxes. EnronQA enables better benchmarking of RAG pipelines over private data and allows for experimentation on the introduction of personalized retrieval settings over realistic data. Finally, we use EnronQA to explore the tradeoff in memorization and retrieval when reasoning over private documents.
zh

[NLP-25] Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring

【速读】：该论文旨在解决韩国语第二语言（L2）写作学习者语料库不足的问题，特别是在语法错误修正（GEC）评估方面缺乏标准化资源。其解决方案的关键在于扩展KoLLA韩国学习者语料库，通过添加多版本的语法错误修正参考文本，以支持更细致和灵活的GEC系统评估，并结合韩国国家语言研究所的指南，为语料库增加基于评分标准的分数，从而全面反映语法准确性、连贯性和词汇多样性。这些改进使KoLLA成为韩国语L2教育研究的重要且标准化资源。

链接: https://arxiv.org/abs/2505.00261
作者: Jayoung Song,KyungTae Lim,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite growing global interest in Korean language education, there remains a significant lack of learner corpora tailored to Korean L2 writing. To address this gap, we enhance the KoLLA Korean learner corpus by adding multiple grammatical error correction (GEC) references, thereby enabling more nuanced and flexible evaluation of GEC systems, and reflects the variability of human language. Additionally, we enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute, capturing grammatical accuracy, coherence, and lexical diversity. These enhancements make KoLLA a robust and standardized resource for research in Korean L2 education, supporting advancements in language learning, assessment, and automated error correction.
zh

[NLP-26] Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）代理在序列决策任务中性能提升依赖于任务特定知识工程的问题。传统方法如提示调优、精心挑选的上下文示例或定制化的观察与动作空间虽然有效，但需要大量人工干预。论文提出的解决方案的关键在于让LLM代理通过学习自身在相似任务中的成功经验来自动提升性能，而非依赖任务特定的知识工程。其核心是构建和优化一个由自我生成示例组成的数据库，并通过数据库层面和示例层面的选择机制进一步提升性能。

链接: https://arxiv.org/abs/2505.00234
作者: Vishnu Sarukkai,Zhiqiang Xie,Kayvon Fatahalian
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many methods for improving Large Language Model (LLM) agents for sequential decision-making tasks depend on task-specific knowledge engineering–such as prompt tuning, curated in-context examples, or customized observation and action spaces. Using these approaches, agent performance improves with the quality or amount of knowledge engineering invested. Instead, we investigate how LLM agents can automatically improve their performance by learning in-context from their own successful experiences on similar tasks. Rather than relying on task-specific knowledge engineering, we focus on constructing and refining a database of self-generated examples. We demonstrate that even a naive accumulation of successful trajectories across training tasks boosts test performance on three benchmarks: ALFWorld (73% to 89%), Wordcraft (55% to 64%), and InterCode-SQL (75% to 79%)–matching the performance the initial agent achieves if allowed two to three attempts per task. We then introduce two extensions: (1) database-level selection through population-based training to identify high-performing example collections, and (2) exemplar-level selection that retains individual trajectories based on their empirical utility as in-context examples. These extensions further enhance performance, achieving 91% on ALFWorld–matching more complex approaches that employ task-specific components and prompts. Our results demonstrate that automatic trajectory database construction offers a compelling alternative to labor-intensive knowledge engineering.
zh

[NLP-27] Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）多智能体系统中任务失败的归因问题，即识别导致任务失败的具体智能体及其关键错误步骤。其解决方案的关键在于构建了WhoWhen数据集，该数据集包含来自127个LLM多智能体系统的大量失败日志，并进行了细粒度标注，将失败与特定智能体及决定性错误步骤相联系。基于该数据集，研究者开发并评估了三种自动化失败归因方法，以探索该任务的可行性和挑战。

链接: https://arxiv.org/abs/2505.00212
作者: Shaokun Zhang,Ming Yin,Jieyu Zhang,Jiale Liu,Zhiguang Han,Jingyang Zhang,Beibin Li,Chi Wang,Huazheng Wang,Yiran Chen,Qingyun Wu
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the WhoWhen dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the WhoWhen, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task’s complexity and the need for further research in this area. Code and dataset are available at this https URL
zh

[NLP-28] IP-CRR: Information Pursuit for Interpretable Classification of Chest Radiology Reports

【速读】：该论文试图解决基于人工智能的影像报告分析方法在临床应用中因缺乏可解释性而难以推广的问题。解决方案的关键在于提出一种“设计即可解释”的框架，通过从大量报告中提取最具信息量的查询，并利用这些查询及其对应答案来预测诊断，从而使得预测结果的解释自然地由所选查询和答案构成。

链接: https://arxiv.org/abs/2505.00191
作者: Yuyan Ge,Kwan Ho Ryan Chan,Pablo Messina,René Vidal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:The development of AI-based methods for analyzing radiology reports could lead to significant advances in medical diagnosis–from improving diagnostic accuracy to enhancing efficiency and reducing workload. However, the lack of interpretability in these methods has hindered their adoption in clinical settings. In this paper, we propose an interpretable-by-design framework for classifying radiology reports. The key idea is to extract a set of most informative queries from a large set of reports and use these queries and their corresponding answers to predict a diagnosis. Thus, the explanation for a prediction is, by construction, the set of selected queries and answers. We use the Information Pursuit framework to select informative queries, the Flan-T5 model to determine if facts are present in the report, and a classifier to predict the disease. Experiments on the MIMIC-CXR dataset demonstrate the effectiveness of the proposed method, highlighting its potential to enhance trust and usability in medical AI.
zh

[NLP-29] Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models

【速读】：该论文旨在解决社交媒体中仇恨言论（hate speech）通过多模态梗图（multimodal memes）传播的问题，特别是如何有效检测并减轻此类内容。其解决方案的关键在于利用视觉-语言模型（Vision-Language Models, VLMs）的强大生成与推理能力，提出了两种核心技术：一是基于定义引导的提示技术用于检测仇恨梗图，二是名为UnHateMeme的统一框架，通过替换仇恨文本和/或视觉元素来减轻仇恨内容。该框架能够将仇恨梗图转换为符合人类对仇恨言论标准的非仇恨形式，同时保持图像与文本之间的多模态一致性。

链接: https://arxiv.org/abs/2505.00150
作者: Minh-Hao Van,Xintao Wu
机构: University of Arkanasas(阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid evolution of social media has provided enhanced communication channels for individuals to create online content, enabling them to express their thoughts and opinions. Multimodal memes, often utilized for playful or humorous expressions with visual and textual elements, are sometimes misused to disseminate hate speech against individuals or groups. While the detection of hateful memes is well-researched, developing effective methods to transform hateful content in memes remains a significant challenge. Leveraging the powerful generation and reasoning capabilities of Vision-Language Models (VLMs), we address the tasks of detecting and mitigating hateful content. This paper presents two key contributions: first, a definition-guided prompting technique for detecting hateful memes, and second, a unified framework for mitigating hateful content in memes, named UnHateMeme, which works by replacing hateful textual and/or visual components. With our definition-guided prompts, VLMs achieve impressive performance on hateful memes detection task. Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a strong capability to convert hateful memes into non-hateful forms that meet human-level criteria for hate speech and maintain multimodal coherence between image and text. Through empirical experiments, we show the effectiveness of state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the proposed tasks, providing a comprehensive analysis of their respective strengths and limitations for these tasks. This paper aims to shed light on important applications of VLMs for ensuring safe and respectful online environments.
zh

[NLP-30] AdaptMI: Adaptive Skill-based In-context Math Instruction for Small Language Models

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在基于技能的上下文学习（skill-based in-context learning, ICL）中性能提升有限的问题，这一问题凸显了SLMs与大型语言模型（Large Language Models, LLMs）在ICL能力上的差距。论文提出的解决方案关键在于设计一种自适应方法——AdaptMI，该方法受到人类教育中的认知负荷理论启发，仅在模型表现不佳时引入基于技能的上下文示例，以避免因冗余信息导致的认知过载。进一步提出的AdaptMI+通过添加针对模型响应中缺失特定技能的示例，显著提升了SLMs在数学基准测试中的准确性。

链接: https://arxiv.org/abs/2505.00147
作者: Yinghui He,Abhishek Panigrahi,Yong Lin,Sanjeev Arora
机构: Princeton Language and Intelligence (普林斯顿语言与智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-context learning (ICL) allows a language model to improve its problem-solving capability when provided with suitable information in context. Since the choice of in-context information can be determined based on the problem itself, in-context learning is analogous to human learning from teachers in a classroom. Recent works (Didolkar et al., 2024a; 2024b) show that ICL performance can be improved by leveraging a frontier large language model’s (LLM) ability to predict required skills to solve a problem, popularly referred to as an LLM’s metacognition, and using the recommended skills to construct necessary in-context examples. While this skill-based strategy boosts ICL performance in larger models, its gains on small language models (SLMs) have been minimal, highlighting a performance gap in ICL capabilities. We investigate this gap and show that skill-based prompting can hurt SLM performance on easy questions by introducing unnecessary information, akin to cognitive overload. To address this, we introduce AdaptMI, an adaptive approach to selecting skill-based in-context Math Instructions for SLMs. Inspired by cognitive load theory from human pedagogy, our method only introduces skill-based examples when the model performs poorly. We further propose AdaptMI+, which adds examples targeted to the specific skills missing from the model’s responses. On 5-shot evaluations across popular math benchmarks and five SLMs (1B–7B; Qwen, Llama), AdaptMI+ improves accuracy by up to 6% over naive skill-based strategies.
zh

[NLP-31] Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在推理长度与答案正确性之间的关系问题，具体表现为模型在简单问题上过度思考导致输出冗长，在复杂问题上则缺乏充分推理。解决方案的关键在于通过偏好优化算法对生成长度进行缩减，即使在不考虑答案正确性的前提下，也能显著减少生成长度并保持可接受的准确性。这一方法揭示了生成长度作为推理行为的重要信号，并推动了对LLMs在推理长度适应方面自我意识的进一步研究。

链接: https://arxiv.org/abs/2505.00127
作者: Jinyan Su,Jennifer Healey,Preslav Nakov,Claire Cardie
机构: Cornell University (康奈尔大学); Adobe Research (Adobe 研究院); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly optimized for long reasoning, under the assumption that more reasoning leads to better performance. However, emerging evidence suggests that longer responses can sometimes degrade accuracy rather than improve it. In this paper, we conduct a systematic empirical study of the relationship between reasoning length and answer correctness. We find that LLMs tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones, failing to extend their reasoning when it is most needed. This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately. Furthermore, we investigate the effects of length reduction with a preference optimization algorithm when simply preferring the shorter responses regardless of answer correctness. Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy. Our findings highlight generation length as a meaningful signal for reasoning behavior and motivate further exploration into LLMs’ self-awareness in reasoning length adaptation.
zh

[NLP-32] Fine-Tuning LLM s for Low-Resource Dialect Translation: The Case of Lebanese

【速读】：该论文旨在解决低资源方言（即黎巴嫩方言）翻译中数据质量和文化真实性对模型性能的影响问题。其解决方案的关键在于采用具有文化感知的小规模本地化数据集（LW）进行微调，而非依赖大规模非本土数据集。研究对比了三种微调方法：基础微调、对比微调和语法提示微调，结果表明，结合对比微调与对比提示的方案在翻译性能上表现最佳，突显了引入错误示例对提升模型鲁棒性的价值。此外，为确保评估的真实性，作者提出了一个新的基准LebEval，基于本土内容，进一步验证了文化真实性在方言翻译中的重要性。

链接: https://arxiv.org/abs/2505.00114
作者: Silvana Yakhni,Ali Chehab
机构: American University of Beirut (贝鲁特美国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines the effectiveness of Large Language Models (LLMs) in translating the low-resource Lebanese dialect, focusing on the impact of culturally authentic data versus larger translated datasets. We compare three fine-tuning approaches: Basic, contrastive, and grammar-hint tuning, using open-source Aya23 models. Experiments reveal that models fine-tuned on a smaller but culturally aware Lebanese dataset (LW) consistently outperform those trained on larger, non-native data. The best results were achieved through contrastive fine-tuning paired with contrastive prompting, which indicates the benefits of exposing translation models to bad examples. In addition, to ensure authentic evaluation, we introduce LebEval, a new benchmark derived from native Lebanese content, and compare it to the existing FLoRes benchmark. Our findings challenge the “More Data is Better” paradigm and emphasize the crucial role of cultural authenticity in dialectal translation. We made our datasets and code available on Github.
zh

[NLP-33] Optimization of embeddings storag e for RAG systems using quantization and dimensionality reduction techniques

【速读】：该论文旨在解决高维向量嵌入在外部知识库中存储时面临的显著内存挑战，特别是在检索增强生成（Retrieval-Augmented Generation）中使用浮点32位精度存储嵌入的问题。其解决方案的关键在于系统性地研究两种互补的优化策略：量化（包括标准格式如float16、int8、二进制以及低比特浮点类型float8）和降维（如PCA、核PCA、UMAP、随机投影和自编码器）。研究结果表明，float8量化在保持最小性能损失（0.3%）的情况下实现4倍存储压缩，且比int8更易实现；而PCA在降维中表现最优，结合适度PCA（如保留50%维度）与float8量化可实现8倍总压缩，性能影响小于单独使用int8。

链接: https://arxiv.org/abs/2505.00105
作者: Naamán Huerga-Pérez,Rubén Álvarez,Rubén Ferrero-Guillén,Alberto Martínez-Gutiérrez,Javier Díez-González
机构: \orgdivDepartment of Mechanical, Computer and Aerospace Engineering, \orgnameUniversidad de León
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)
备注: 13 pages, 9 figures, 1 table

点击查看摘要

Abstract:Retrieval-Augmented Generation enhances language models by retrieving relevant information from external knowledge bases, relying on high-dimensional vector embeddings typically stored in float32 precision. However, storing these embeddings at scale presents significant memory challenges. To address this issue, we systematically investigate on MTEB benchmark two complementary optimization strategies: quantization, evaluating standard formats (float16, int8, binary) and low-bit floating-point types (float8), and dimensionality reduction, assessing methods like PCA, Kernel PCA, UMAP, Random Projections and Autoencoders. Our results show that float8 quantization achieves a 4x storage reduction with minimal performance degradation (0.3%), significantly outperforming int8 quantization at the same compression level, being simpler to implement. PCA emerges as the most effective dimensionality reduction technique. Crucially, combining moderate PCA (e.g., retaining 50% dimensions) with float8 quantization offers an excellent trade-off, achieving 8x total compression with less performance impact than using int8 alone (which provides only 4x compression). To facilitate practical application, we propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration that maximizes performance within their specific memory constraints.
zh

[NLP-34] ConSens: Assessing context grounding in open-book question answering

【速读】：该论文旨在解决开放书式问答（open-book QA）中模型回答依赖于自身参数知识而非提供上下文的问题，这一问题可能导致答案过时、不完整或错误。论文提出的解决方案的关键在于设计一种新的评估指标，该指标通过对比模型在有无上下文条件下的困惑度（perplexity）差异，量化模型回答对所提供上下文的依赖程度，从而有效评估模型是否基于给定上下文生成答案。该方法具有计算效率高、可解释性强和适应性广等优势。

链接: https://arxiv.org/abs/2505.00065
作者: Ivan Vankov,Matyo Ivanov,Adriana Correia,Victor Botev
机构: Iris.ai BG(iris.ai保加利亚); INB, BAS(INB，BAS)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated considerable success in open-book question answering (QA), where the task requires generating answers grounded in a provided external context. A critical challenge in open-book QA is to ensure that model responses are based on the provided context rather than its parametric knowledge, which can be outdated, incomplete, or incorrect. Existing evaluation methods, primarily based on the LLM-as-a-judge approach, face significant limitations, including biases, scalability issues, and dependence on costly external systems. To address these challenges, we propose a novel metric that contrasts the perplexity of the model response under two conditions: when the context is provided and when it is not. The resulting score quantifies the extent to which the model’s answer relies on the provided context. The validity of this metric is demonstrated through a series of experiments that show its effectiveness in identifying whether a given answer is grounded in the provided context. Unlike existing approaches, this metric is computationally efficient, interpretable, and adaptable to various use cases, offering a scalable and practical solution to assess context utilization in open-book QA systems.
zh

[NLP-35] GDI-Bench: A Benchmark for General Document Intelligence with Vision and Reasoning Decoupling

【速读】：该论文旨在解决当前多模态大语言模型（Multimodal Large Language Models, MLLMs）在文档领域应用中缺乏全面评估基准的问题，现有基准难以定位模型弱点或指导系统性优化。其解决方案的关键在于提出一个通用文档智能基准（General Document Intelligence Benchmark, GDI-Bench），该基准包含1.9k张图像和19个文档特定任务，并通过解耦视觉复杂度与推理复杂度，构建分级任务以按难度评估模型性能，从而辅助模型弱点识别与优化指导。此外，为应对GDI-Bench中多样化任务与领域，作者还提出了一种GDI模型，采用保持智能的训练策略缓解监督微调过程中的灾难性遗忘问题。

链接: https://arxiv.org/abs/2505.00063
作者: Siqi Li,Yufan Shen,Xiangnan Chen,Jiayi Chen,Hengwei Ju,Haodong Duan,Song Mao,Hongbin Zhou,Bo Zhang,Pinlong Cai,Licheng Wen,Botian Shi,Yong Liu,Xinyu Cai,Yu Qiao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Zhejiang University (浙江大学); School of Science and Engineering, The Chinese University of Hong Kong (香港中文大学科学与工程学院); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of multimodal large language models (MLLMs) has profoundly impacted the document domain, creating a wide array of application scenarios. This progress highlights the need for a comprehensive benchmark to evaluate these models’ capabilities across various document-specific tasks. However, existing benchmarks often fail to locate specific model weaknesses or guide systematic improvements. To bridge this gap, we introduce a General Document Intelligence Benchmark (GDI-Bench), featuring 1.9k images across 9 key scenarios and 19 document-specific tasks. By decoupling visual complexity and reasoning complexity, the GDI-Bench structures graded tasks that allow performance assessment by difficulty, aiding in model weakness identification and optimization guidance. We evaluate the GDI-Bench on various open-source and closed-source models, conducting decoupled analyses in the visual and reasoning domains. For instance, the GPT-4o model excels in reasoning tasks but exhibits limitations in visual capabilities. To address the diverse tasks and domains in the GDI-Bench, we propose a GDI Model that mitigates the issue of catastrophic forgetting during the supervised fine-tuning (SFT) process through a intelligence-preserving training strategy. Our model achieves state-of-the-art performance on previous benchmarks and the GDI-Bench. Both our benchmark and model will be open source.
zh

[NLP-36] Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems

【速读】：该论文试图解决基于Transformer的自动化短答案评分系统在医学教育中的漏洞问题，特别是这些系统如何通过对抗性游戏策略被操纵，从而导致误判（false positives）。解决方案的关键在于实施多种对抗训练方法以增强系统的鲁棒性，并结合集成技术如多数投票和岭回归来提升系统对复杂对抗输入的防御能力，同时利用大型语言模型（如GPT-4）与多样化的提示技术有效识别和评分游戏策略。

链接: https://arxiv.org/abs/2505.00061
作者: Sahar Yarmohammadtoosky,Yiyun Zhou,Victoria Yaneva,Peter Baldwin,Saed Rezayi,Brian Clauser,Polina Harikeo
机构: Kennesaw State University (肯尼索州立大学); National Board of Medical Examiners (美国医学考试委员会)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:This study examines vulnerabilities in transformer-based automated short-answer grading systems used in medical education, with a focus on how these systems can be manipulated through adversarial gaming strategies. Our research identifies three main types of gaming strategies that exploit the system’s weaknesses, potentially leading to false positives. To counteract these vulnerabilities, we implement several adversarial training methods designed to enhance the systems’ robustness. Our results indicate that these methods significantly reduce the susceptibility of grading systems to such manipulations, especially when combined with ensemble techniques like majority voting and ridge regression, which further improve the system’s defense against sophisticated adversarial inputs. Additionally, employing large language models such as GPT-4 with varied prompting techniques has shown promise in recognizing and scoring gaming strategies effectively. The findings underscore the importance of continuous improvements in AI-driven educational tools to ensure their reliability and fairness in high-stakes settings.
zh

[NLP-37] Fact-Consistency Evaluation of Text-to-SQL Generation for Business Intelligence Using Exaone 3.5

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在实际业务智能（Business Intelligence, BI）场景中应用受限的问题，具体表现为语义幻觉、结构错误以及缺乏领域特定的评估框架。其解决方案的关键在于提出一种事实一致性评估框架（Fact-Consistency Evaluation Framework），用于评估LLM生成的SQL输出的语义准确性，并构建了一个包含219个自然语言业务问题的领域专用基准，覆盖五个SQL复杂度等级，基于LG Electronics内部BigQuery环境的实际销售数据。该基准包含标准答案和验证后的真值答案，通过回答准确率、执行成功率、语义错误率和非响应率等指标对模型性能进行评估。

链接: https://arxiv.org/abs/2505.00060
作者: Jeho Choi
机构: LG Electronics (LG电子)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 1 table

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in enabling natural language interfaces for structured data querying through text-to-SQL generation. However, their application in real-world Business Intelligence (BI) contexts remains limited due to semantic hallucinations, structural errors, and a lack of domain-specific evaluation frameworks. In this study, we propose a Fact-Consistency Evaluation Framework for assessing the semantic accuracy of LLM-generated SQL outputs using Exaone 3.5–an instruction-tuned, bilingual LLM optimized for enterprise tasks. We construct a domain-specific benchmark comprising 219 natural language business questions across five SQL complexity levels, derived from actual sales data in LG Electronics’ internal BigQuery environment. Each question is paired with a gold-standard SQL query and a validated ground-truth answer. We evaluate model performance using answer accuracy, execution success rate, semantic error rate, and non-response rate. Experimental results show that while Exaone 3.5 performs well on simple aggregation tasks (93% accuracy in L1), it exhibits substantial degradation in arithmetic reasoning (4% accuracy in H1) and grouped ranking tasks (31% in H4), with semantic errors and non-responses concentrated in complex cases. Qualitative error analysis further identifies common failure types such as misapplied arithmetic logic, incomplete filtering, and incorrect grouping operations. Our findings highlight the current limitations of LLMs in business-critical environments and underscore the need for fact-consistency validation layers and hybrid reasoning approaches. This work contributes a reproducible benchmark and evaluation methodology for advancing reliable natural language interfaces to structured enterprise data systems.
zh

[NLP-38] BERSting at the Screams: A Benchmark for Distanced Emotional and Shouted Speech Recognition MICRO

【速读】：该论文旨在解决在复杂、真实场景下语音识别任务（如自动语音识别，ASR）性能下降的问题，特别是针对远距离语音识别（distanced ASR）的挑战。其解决方案的关键在于构建一个名为BERSt的数据集，该数据集包含来自98位演员的近4小时英语语音，涵盖了多种区域和非母语口音、7种情绪提示、呼喊和正常发音的语句，并在19种不同位置（包括遮挡和不同房间）录制，从而模拟多样化的实际声学环境，以评估ASR、呼喊检测和语音情感识别（SER）等任务的鲁棒性。

链接: https://arxiv.org/abs/2505.00059
作者: Paige Tuttösí,Mantaj Dhillon,Luna Sang,Shane Eastwood,Poorvi Bhatia,Quang Minh Dinh,Avni Kapoor,Yewon Jin,Angelica Lim
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Computer Speech and Language, Special issue: Multi-Speaker, Multi-Microphone, and Multi-Modal Distant Speech Recognition (September 2025)

点击查看摘要

Abstract:Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.
zh

[NLP-39] A Report on the llm s evaluating the high school questions

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在解答高中科学问题中的性能评估及其在教育领域的潜在应用问题。其解决方案的关键在于选取2019至2023年高考数学试题作为评估数据，并利用至少八种LLM API提供答案，通过准确性、响应时间、逻辑推理和创造力等指标进行综合评估，从而深入分析LLMs在处理高中科学问题中的优势与不足。

链接: https://arxiv.org/abs/2505.00057
作者: Zhu Jiawei,Chen Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report aims to evaluate the performance of large language models (LLMs) in solving high school science questions and to explore their potential applications in the educational field. With the rapid development of LLMs in the field of natural language processing, their application in education has attracted widespread attention. This study selected mathematics exam questions from the college entrance examinations (2019-2023) as evaluation data and utilized at least eight LLM APIs to provide answers. A comprehensive assessment was conducted based on metrics such as accuracy, response time, logical reasoning, and creativity. Through an in-depth analysis of the evaluation results, this report reveals the strengths and weaknesses of LLMs in handling high school science questions and discusses their implications for educational practice. The findings indicate that although LLMs perform excellently in certain aspects, there is still room for improvement in logical reasoning and creative problem-solving. This report provides an empirical foundation for further research and application of LLMs in the educational field and offers suggestions for improvement.
zh

[NLP-40] Clustering Internet Memes Through Template Matching and Multi-Dimensional Similarity

【速读】：该论文试图解决互联网迷因（meme）聚类问题，该问题在毒性检测、病毒传播建模和分类中具有重要意义，但此前研究关注度较低。由于迷因的多模态性、文化背景和适应性，相似迷因的聚类面临挑战。现有方法依赖于预定义数据库，忽视语义信息，并难以处理多样化的相似性维度。论文提出了一种基于模板匹配的新方法，利用多维相似性特征，从而无需预定义数据库并支持自适应匹配。其关键在于通过形式、视觉内容、文本和身份等相似性类别中的局部与全局特征进行迷因聚类，实现了更一致和连贯的聚类效果。

链接: https://arxiv.org/abs/2505.00056
作者: Tygo Bloem,Filip Ilievski
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Meme clustering is critical for toxicity detection, virality modeling, and typing, but it has received little attention in previous research. Clustering similar Internet memes is challenging due to their multimodality, cultural context, and adaptability. Existing approaches rely on databases, overlook semantics, and struggle to handle diverse dimensions of similarity. This paper introduces a novel method that uses template-based matching with multi-dimensional similarity features, thus eliminating the need for predefined databases and supporting adaptive matching. Memes are clustered using local and global features across similarity categories such as form, visual content, text, and identity. Our combined approach outperforms existing clustering methods, producing more consistent and coherent clusters, while similarity-based feature sets enable adaptability and align with human intuition. We make all supporting code publicly available to support subsequent research. Code: this https URL
zh

[NLP-41] Emotional Analysis of Fashion Trends Using Social Media and AI: Sentiment Analysis on Twitter for Fashion Trend Forecasting

【速读】：该论文试图解决如何利用社交媒体情感分析来预测时尚趋势的问题，具体而言是探究时尚相关话题在社交媒体上的情感模式是否能够作为新兴时尚趋势的预测指标。解决方案的关键在于通过自然语言处理和机器学习技术对Twitter数据进行计算分析，包括改进的情感归一化技术、时间序列分解、统计验证的因果关系建模、跨平台情感比较以及品牌特定情感分析，从而建立一个具有78.35%平衡准确率的预测模型，为正向、中性和负向情感类别下的趋势预测提供可靠基础。

链接: https://arxiv.org/abs/2505.00050
作者: Aayam Bansal,Agneya Tharun
机构: Delhi Public School, Ruby Park, India (德里公共学校，鲁比公园，印度); Poolesville High School, USA (普尔斯维尔高中，美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:This study explores the intersection of fashion trends and social media sentiment through computational analysis of Twitter data using the T4SA (Twitter for Sentiment Analysis) dataset. By applying natural language processing and machine learning techniques, we examine how sentiment patterns in fashion-related social media conversations can serve as predictors for emerging fashion trends. Our analysis involves the identification and categorization of fashion-related content, sentiment classification with improved normalization techniques, time series decomposition, statistically validated causal relationship modeling, cross-platform sentiment comparison, and brand-specific sentiment analysis. Results indicate correlations between sentiment patterns and fashion theme popularity, with accessories and streetwear themes showing statistically significant rising trends. The Granger causality analysis establishes sustainability and streetwear as primary trend drivers, showing bidirectional relationships with several other themes. The findings demonstrate that social media sentiment analysis can serve as an effective early indicator of fashion trend trajectories when proper statistical validation is applied. Our improved predictive model achieved 78.35% balanced accuracy in sentiment classification, establishing a reliable foundation for trend prediction across positive, neutral, and negative sentiment categories.
zh

[NLP-42] Humanizing LLM s: A Survey of Psychological Measurements with Tools Datasets and Human-Agent Applications

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在心理特质评估方面的系统性不足问题，特别是针对多样化心理测试、LLM专用心理数据集以及具有心理特质的LLM应用等方面缺乏系统讨论。其解决方案的关键在于系统性地回顾了六个关键维度：评估工具、LLM专用数据集、评估指标（一致性与稳定性）、实证发现、人格模拟方法以及基于LLM的行为模拟，旨在推动更可解释、稳健且泛化的心理评估框架的发展。

链接: https://arxiv.org/abs/2505.00049
作者: Wenhan Dong,Yuemeng Zhao,Zhen Sun,Yule Liu,Zifan Peng,Jingyi Zheng,Zongmin Zhang,Ziyi Zhang,Jun Wu,Ruiming Wang,Shengmin Xu,Xinyi Huang,Xinlei He
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学（广州) ); South China Normal University (华南师范大学); Fujian Normal University (福建师范大学); Jinan University (济南大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 26 pages,7 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used in human-centered tasks, assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. While existing reviews have covered some aspects of related research, several important areas have not been systematically discussed, including detailed discussions of diverse psychological tests, LLM-specific psychological datasets, and the applications of LLMs with psychological traits. To address this gap, we systematically review six key dimensions of applying psychological theories to LLMs: (1) assessment tools; (2) LLM-specific datasets; (3) evaluation metrics (consistency and stability); (4) empirical findings; (5) personality simulation methods; and (6) LLM-based behavior simulation. Our analysis highlights both the strengths and limitations of current methods. While some LLMs exhibit reproducible personality patterns under specific prompting schemes, significant variability remains across tasks and settings. Recognizing methodological challenges such as mismatches between psychological tools and LLMs’ capabilities, as well as inconsistencies in evaluation practices, this study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
zh

[NLP-43] Base Models Beat Aligned Models at Randomness and Creativity

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）在对齐（alignment）过程中可能带来的性能下降问题，特别是当任务需要不可预测输出时，对齐模型的表现不如基础语言模型。解决方案的关键在于指出并非所有任务都适合应用对齐技术，并通过实验验证在随机数生成、混合策略游戏和创意写作等任务中，基础语言模型往往优于其对齐版本，表明在模型能力需求上存在有效的权衡。

链接: https://arxiv.org/abs/2505.00047
作者: Peter West,Christopher Potts
机构: Stanford University (斯坦福大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate “7” over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.
zh

[NLP-44] Graph RAG for Legal Norms: A Hierarchical and Temporal Approach

【速读】：该论文试图解决法律规范（Legal Norms）分析与理解中的复杂性与数据量庞大问题，这些问题源于法律文本的预定义层级结构、广泛的内部和外部引用网络以及多时间版本特性。解决方案的关键在于将结构化知识图谱与语境丰富的文本片段相结合，通过引入层级结构和时间演变的整合以及全面文本单元（Comprehensive Text Units）的概念，构建更丰富且相互关联的法律知识表示。

链接: https://arxiv.org/abs/2505.00039
作者: Hudson de Martim
机构: Federal Senate of Brazil (巴西联邦参议院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This article proposes an adaptation of Graph Retrieval Augmented Generation (Graph RAG) specifically designed for the analysis and comprehension of legal norms, which are characterized by their predefined hierarchical structure, extensive network of internal and external references and multiple temporal versions. By combining structured knowledge graphs with contextually enriched text segments, Graph RAG offers a promising solution to address the inherent complexity and vast volume of legal data. The integration of hierarchical structure and temporal evolution into knowledge graphs - along with the concept of comprehensive Text Units - facilitates the construction of richer, interconnected representations of legal knowledge. Through a detailed analysis of Graph RAG and its application to legal norm datasets, this article aims to significantly advance the field of Artificial Intelligence applied to Law, creating opportunities for more effective systems in legal research, legislative analysis, and decision support.
zh

[NLP-45] HyPerAlign: Hypotheses-driven Personalized Alignment

【速读】：该论文试图解决如何将大型语言模型（Large Language Models, LLMs）的输出个性化以适应个体用户的需求，而非仅对“平均用户”偏好进行微调的问题。其解决方案的关键在于提出一种新颖的可解释且样本高效的假设驱动个性化方法（Hypotheses-driven Personalization Approach, HyPerAlign），该方法通过少量由特定用户撰写的示例，推断出用户的沟通策略、个性和写作风格，并利用这些假设及用户特定属性来提示LLM生成定制化输出。

链接: https://arxiv.org/abs/2505.00038
作者: Cristina Garbacea,Chenhao Tan
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Alignment algorithms are widely used to align large language models (LLMs) to human users based on preference annotations that reflect their intended real-world use cases. Typically these (often divergent) preferences are aggregated over a diverse set of users, resulting in fine-tuned models that are aligned to the ``average-user’’ preference. Nevertheless, current models are used by individual users in very specific contexts and situations, emphasizing the need for user-dependent preference control. In this work we address the problem of personalizing LLM outputs to their users, aiming to generate customized responses tailored to individual users, instead of generic outputs that emulate the collective voices of diverse populations. We propose a novel interpretable and sample-efficient hypotheses-driven personalization approach (HyPerAlign) where given few-shot examples written by a particular user, we first infer hypotheses about their communication strategies, personality and writing style, then prompt LLM models with these hypotheses and user specific attributes to generate customized outputs. We conduct experiments on two different personalization tasks, authorship attribution and deliberative alignment, with datasets from diverse domains (news articles, blog posts, emails, jailbreaking benchmarks), and demonstrate the superiority of hypotheses-driven personalization approach when compared to preference-based fine-tuning methods. For deliberative alignment, the helpfulness of LLM models is improved by up to 70% on average. For authorship attribution, results indicate consistently high win-rates (commonly 90% ) against state-of-the-art preference fine-tuning approaches for LLM personalization across diverse user profiles and LLM models. Overall, our approach represents an interpretable and sample-efficient strategy for the personalization of LLM models to individual users.
zh

[NLP-46] A Framework to Assess the Persuasion Risks Large Language Model Chatbots Pose to Democratic Societies

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在政治说服中的成本效益及其相对于传统政治竞选方法的有效性。研究的关键在于通过两场调查实验和一个现实世界模拟，评估LLM聊天机器人在说服选民过程中的表现，特别是在考虑“接收”和“接受”两个阶段的说服机制（Zaller 1992）基础上，比较其与传统竞选手段的成本效率。研究发现，尽管LLM在暴露于选民后具有与实际竞选广告相当的说服力，但现实世界中的政治说服依赖于信息暴露及其在暴露后的影响力，而LLM-based说服在成本上可能更具优势，但目前传统方法在可扩展性方面仍占优。

链接: https://arxiv.org/abs/2505.00036
作者: Zhongren Chen,Joshua Kalla,Quan Le,Shinpei Nakamura-Sakai,Jasjeet Sekhon,Ruixiao Wang
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:In recent years, significant concern has emerged regarding the potential threat that Large Language Models (LLMs) pose to democratic societies through their persuasive capabilities. We expand upon existing research by conducting two survey experiments and a real-world simulation exercise to determine whether it is more cost effective to persuade a large number of voters using LLM chatbots compared to standard political campaign practice, taking into account both the “receive” and “accept” steps in the persuasion process (Zaller 1992). These experiments improve upon previous work by assessing extended interactions between humans and LLMs (instead of using single-shot interactions) and by assessing both short- and long-run persuasive effects (rather than simply asking users to rate the persuasiveness of LLM-produced content). In two survey experiments (N = 10,417) across three distinct political domains, we find that while LLMs are about as persuasive as actual campaign ads once voters are exposed to them, political persuasion in the real-world depends on both exposure to a persuasive message and its impact conditional on exposure. Through simulations based on real-world parameters, we estimate that LLM-based persuasion costs between \ 48-\ 74 per persuaded voter compared to \ 100 for traditional campaign methods, when accounting for the costs of exposure. However, it is currently much easier to scale traditional campaign persuasion methods than LLM-based persuasion. While LLMs do not currently appear to have substantially greater potential for large-scale political persuasion than existing non-LLM methods, this may change as LLM capabilities continue to improve and it becomes easier to scalably encourage exposure to persuasive LLMs.
zh

[NLP-47] Linguistic Complexity and Socio-cultural Patterns in Hip-Hop Lyrics

【速读】：该论文试图解决如何量化分析嘻哈歌词中的语言复杂性与社会文化趋势的问题，其核心在于建立一个计算框架以捕捉嘻哈音乐在语言特征和主题内容上的演变。解决方案的关键在于运用自然语言处理技术对大规模嘻哈歌曲数据集进行多维度分析，包括词汇多样性、押韵密度、主题建模和情感分析，并结合地理起源与时间周期进行多维关联研究，从而揭示嘻哈作为一种艺术形式及其与社会动态之间的定量关系。

链接: https://arxiv.org/abs/2505.00035
作者: Aayam Bansal,Raghav Agarwal,Kaashvi Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:This paper presents a comprehensive computational framework for analyzing linguistic complexity and socio-cultural trends in hip-hop lyrics. Using a dataset of 3,814 songs from 146 influential artists spanning four decades (1980-2020), we employ natural language processing techniques to quantify multiple dimensions of lyrical complexity. Our analysis reveals a 23.7% increase in vocabulary diversity over the study period, with East Coast artists demonstrating 17.3% higher lexical variation than other regions. Rhyme density increased by 34.2% across all regions, with Midwest artists exhibiting the highest technical complexity (3.04 rhymes per line). Topic modeling identified significant shifts in thematic content, with social justice themes decreasing from 28.5% to 13.8% of content while introspective themes increased from 7.6% to 26.3%. Sentiment analysis demon- strated that lyrics became significantly more negative during sociopolitical crises, with polarity decreasing by 0.31 following major social unrest. Multi-dimensional analysis revealed four dis- tinct stylistic approaches that correlate strongly with geographic origin (r=0.68, p!0.001) and time period (r=0.59, p0.001). These findings establish quantitative evidence for the evolution of hip- hop as both an art form and a reflection of societal dynamics, providing insights into the interplay between linguistic innovation and cultural context in popular music.
zh

[NLP-48] Improving Phishing Email Detection Performance of Small Large Language Models

【速读】：该论文试图解决小参数量生成式 AI (Generative AI) 在钓鱼邮件检测任务中表现不佳的问题，以及大语言模型（Large Language Models, LLMs）因参数量庞大导致计算资源消耗过高的问题。其解决方案的关键在于设计了一套方法，包括提示工程（Prompt Engineering）、解释增强微调（Explanation Augmented Fine-tuning）和模型集成（Model Ensemble），以提升小参数量LLMs在钓鱼邮件检测中的性能。通过实验验证，该方法显著提高了SpamAssassin数据集上的检测准确率，从基线模型如Qwen2.5-1.5B-Instruct的约0.5提升至0.976。

链接: https://arxiv.org/abs/2505.00034
作者: Zijie Lin,Zikang Liu,Hanbo Fan
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models(LLMs) have demonstrated remarkable performance on many natural language processing(NLP) tasks and have been employed in phishing email detection research. However, in current studies, well-performing LLMs typically contain billions or even tens of billions of parameters, requiring enormous computational resources. To reduce computational costs, we investigated the effectiveness of small-parameter LLMs for phishing email detection. These LLMs have around 3 billion parameters and can run on consumer-grade GPUs. However, small LLMs often perform poorly in phishing email detection task. To address these issues, we designed a set of methods including Prompt Engineering, Explanation Augmented Fine-tuning, and Model Ensemble to improve phishing email detection capabilities of small LLMs. We validated the effectiveness of our approach through experiments, significantly improving accuracy on the SpamAssassin dataset from around 0.5 for baseline models like Qwen2.5-1.5B-Instruct to 0.976.
zh

[NLP-49] From Attention to Atoms: Spectral Dictionary Learning for Fast Interpretable Language Models

【速读】：该论文旨在解决传统Transformer架构中自注意力机制（self attention）计算复杂度高（二次复杂度）导致的效率瓶颈问题，同时保持自然语言处理任务中的模型性能。其解决方案的关键在于提出一种新的频谱生成建模框架，通过联合学习全局时变傅里叶字典（global time varying Fourier dictionary）和每个词的混合系数（per token mixing coefficients），替代传统的自注意力机制。该方法在时域（嵌入重构）和频域（通过短时傅里叶变换幅度匹配）同时施加重建损失，并结合高斯混合模型（GMM）先验对混合向量进行拟合，从而在保持语言建模质量的同时实现线性计算复杂度，显著提升推理效率并降低内存占用。

链接: https://arxiv.org/abs/2505.00033
作者: Andrew Kiruluta
机构: UC Berkeley, CA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose a novel spectral generative modeling framework for natural language processing that jointly learns a global time varying Fourier dictionary and per token mixing coefficients, replacing the ubiquitous self attention mechanism in transformer architectures. By enforcing reconstruction losses in both the time domain (embedding reconstruction) and the frequency domain (via Short Time Fourier Transform magnitude matching) alongside a standard language modeling objective, and fitting a Gaussian Mixture Model (GMM) prior over the learned mixing vectors, our approach achieves competitive perplexity and generation quality on standard benchmarks such as WikiText2 and Penn Treebank. In contrast to the quadratic computation complexity of self attention, our method operates with linear complexity, delivering substantial efficiency gains. We demonstrate that spectral dictionary models can achieve competitive performance compared to transformer baselines while significantly reducing inference latency and memory footprint, offering a compelling alternative for scalable language modeling.
zh

[NLP-50] MDD-LLM : Towards Accuracy Large Language Models for Major Depressive Disorder Diagnosis

【速读】：该论文旨在解决重大抑郁症（Major Depressive Disorder, MDD）诊断中存在的医疗资源分布不均和诊断方法复杂性导致的注意力不足问题。其解决方案的关键在于提出一种名为MDD-LLM的高性能诊断工具，该工具基于微调的大规模语言模型（Large Language Models, LLMs）和广泛的真实世界样本，通过设计表格数据转换方法构建大规模语料库以提升诊断性能。实验结果表明，MDD-LLM在准确率和AUC指标上显著优于现有机器学习和深度学习框架。

链接: https://arxiv.org/abs/2505.00032
作者: Yuyang Sha,Hongxin Pan,Wei Xu,Weiyu Meng,Gang Luo,Xinyu Du,Xiaobing Zhai,Henry H. Y. Tong,Caijuan Shi,Kefeng Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Major depressive disorder (MDD) impacts more than 300 million people worldwide, highlighting a significant public health issue. However, the uneven distribution of medical resources and the complexity of diagnostic methods have resulted in inadequate attention to this disorder in numerous countries and regions. This paper introduces a high-performance MDD diagnosis tool named MDD-LLM, an AI-driven framework that utilizes fine-tuned large language models (LLMs) and extensive real-world samples to tackle challenges in MDD diagnosis. Therefore, we select 274,348 individual information from the UK Biobank cohort to train and evaluate the proposed method. Specifically, we select 274,348 individual records from the UK Biobank cohort and design a tabular data transformation method to create a large corpus for training and evaluating the proposed approach. To illustrate the advantages of MDD-LLM, we perform comprehensive experiments and provide several comparative analyses against existing model-based solutions across multiple evaluation metrics. Experimental results show that MDD-LLM (70B) achieves an accuracy of 0.8378 and an AUC of 0.8919 (95% CI: 0.8799 - 0.9040), significantly outperforming existing machine learning and deep learning frameworks for MDD diagnosis. Given the limited exploration of LLMs in MDD diagnosis, we examine numerous factors that may influence the performance of our proposed method, such as tabular data transformation techniques and different fine-tuning strategies.
zh

[NLP-51] Learning to Plan Before Answering: Self-Teaching LLM s to Learn Abstract Plans for Problem Solving

【速读】：该论文旨在解决大型语言模型（LLM）后训练中，自生成数据所包含的关键信息不足的问题。现有方法仅生成分步问题解决方案，未能捕捉跨相似问题泛化所需的抽象元知识。其解决方案的关键在于引入一种新的自训练算法——LEarning to Plan before Answering (LEPA)，该算法使LLM在处理具体问题前先制定预测性计划，作为问题求解的抽象元知识，从而不仅明确解题路径，还避免无关细节的干扰。通过在数据生成和模型优化过程中对计划进行自我反思与优化，LEPA显著提升了模型在自然语言推理任务中的表现。

链接: https://arxiv.org/abs/2505.00031
作者: Jin Zhang,Flood Sung,Zhilin Yang,Yang Gao,Chongjie Zhang
机构: Tsinghua University (清华大学); Moonshot AI; Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.
zh

[NLP-52] Can Language Models Represent the Past without Anachronism?

【速读】：该论文试图解决语言模型在模拟历史语境时可能出现的时代错位（anachronism）问题。研究发现，仅通过提示当代模型使用时期散文示例无法生成符合时代风格的输出，而微调虽然能使模型输出在风格上足够逼真以欺骗自动评判系统，但人类评估者仍能区分微调后的模型输出与真实历史文本。论文的关键解决方案是认为，为了可靠地模拟历史视角进行社会科学研究，可能需要在时期散文上进行预训练。

链接: https://arxiv.org/abs/2505.00030
作者: Ted Underwood,Laura K. Nelson,Matthew Wilkens
机构: School of Information Sciences, University of Illinois(信息科学学院，伊利诺伊大学); Department of Sociology, University of British Columbia(社会学系，不列颠哥伦比亚大学); Department of Information Science, Cornell University(信息科学系，康奈尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Before researchers can use language models to simulate the past, they need to understand the risk of anachronism. We find that prompting a contemporary model with examples of period prose does not produce output consistent with period style. Fine-tuning produces results that are stylistically convincing enough to fool an automated judge, but human evaluators can still distinguish fine-tuned model outputs from authentic historical text. We tentatively conclude that pretraining on period prose may be required in order to reliably simulate historical perspectives for social research.
zh

[NLP-53] Keep the General Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

【速读】：该论文试图解决大型视觉语言模型在引入特定领域知识时面临的灾难性遗忘问题，即在注入领域专业知识的同时容易丧失基础的视觉-语言对齐能力。解决方案的关键在于提出结构化对话微调（Structured Dialogue Fine-Tuning, SDFT），该方法通过三阶段对话结构实现知识注入与基础能力保留的平衡：基础保持阶段通过图像描述任务强化预训练的视觉-语言对齐；对比消歧阶段引入精心设计的反事实示例以维持语义边界；知识专业化阶段则通过思维链推理嵌入专业信息。

链接: https://arxiv.org/abs/2505.00029
作者: Yijie Hong,Xiaofei Yin,Xinzhong Wang,Yi Tu,Ya Guo,Sufeng Duan,Weiqiang Wang,Lingyong Fang,Depeng Wang,Huijia Zhu
机构: Shanghai Jiao Tong University (上海交通大学); Ant Security Lab, Ant Group (蚂蚁安全实验室，蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT’s effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.
zh

[NLP-54] Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation

【速读】：该论文试图解决端到端语音到语音（end-to-end S2S）对话系统在整合外部知识方面的挑战，尤其是在输入语音与检索到的文本知识之间存在的模态差异问题。解决方案的关键在于提出一种新颖的端到端检索增强生成（RAG）框架，该框架可以直接从语音查询中检索相关文本知识，无需通过自动语音识别（ASR）等中间步骤进行语音到文本的转换，从而有效减少模态间隙带来的影响。

链接: https://arxiv.org/abs/2505.00028
作者: Pengchao Feng,Ziyang Ma,Wenxi Chen,Yao Li,Sheng Wang,Kai Yu,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Aviation Electric Co., Ltd (上海航空电器有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In recent years, end-to-end speech-to-speech (S2S) dialogue systems have garnered increasing research attention due to their advantages over traditional cascaded systems, including achieving lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these end-to-end systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries, eliminating the need for intermediate speech-to-text conversion via techniques like ASR. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. We will release the code and dataset to support reproducibility and promote further research in this area.
zh

[NLP-55] Extracting Abstraction Dimensions by Identifying Syntax Pattern from Texts

【速读】：该论文试图解决从文本中自动发现主语维度（subject dimension）、动词维度（action dimension）、宾语维度（object dimension）和状语维度（adverbial dimension）的问题，以高效操作文本并支持自然语言查询。解决方案的关键在于构建抽象树（abstraction trees），这些树通过子类关系（subclass relations）对文本中的各类成分及其子类关系进行高质量、无冗余且具有表达性的表示，从而确保大多数句子可通过单棵树访问，其余句子至少可通过一棵树访问，进而支持基于树的自然语言查询机制。实验结果表明，该方法在平均精度、召回率和F1分数上均超过80%。

链接: https://arxiv.org/abs/2505.00027
作者: Jian Zhou,Jiazheng Li,Sirui Zhuge,Hai Zhuge
机构: Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; King’s College London; Publicis Sapient; School of Computing and Information Technology, Great Bay University; Great Bay Institute for Advanced Study
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25pages, 3 figures, 8 tables

点击查看摘要

Abstract:This paper proposed an approach to automatically discovering subject dimension, action dimension, object dimension and adverbial dimension from texts to efficiently operate texts and support query in natural language. The high quality of trees guarantees that all subjects, actions, objects and adverbials and their subclass relations within texts can be represented. The independency of trees ensures that there is no redundant representation between trees. The expressiveness of trees ensures that the majority of sentences can be accessed from each tree and the rest of sentences can be accessed from at least one tree so that the tree-based search mechanism can support querying in natural language. Experiments show that the average precision, recall and F1-score of the abstraction trees constructed by the subclass relations of subject, action, object and adverbial are all greater than 80%. The application of the proposed approach to supporting query in natural language demonstrates that different types of question patterns for querying subject or object have high coverage of texts, and searching multiple trees on subject, action, object and adverbial according to the question pattern can quickly reduce search space to locate target sentences, which can support precise operation on texts.
zh

[NLP-56] heory of Mind in Large Language Models : Assessment and Enhancement

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在理论心智（Theory of Mind, ToM）能力方面的不足，即如何提升其对人类心理状态的推断与理解能力。解决方案的关键在于通过分析现有的基于故事的评估基准以及针对提升ToM能力所设计的策略，深入探讨有效的改进方法，并结合最新基准和前沿技术提出未来的研究方向。

链接: https://arxiv.org/abs/2505.00026
作者: Ruirui Chen,Weifeng Jiang,Chengwei Qin,Cheston Tan
机构: Institute of High Performance Computing (高性能计算研究所); Agency for Science, Technology and Research (科技研究局); Singapore (新加坡); Centre for Frontier AI Research (前沿人工智能研究中心); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Theory of Mind (ToM)-the ability to infer and reason about others’ mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs’ ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs’ ToM capabilities.
zh

[NLP-57] A Method for the Architecture of a Medical Vertical Large Language Model Based on Deepseek R1

【速读】：该论文旨在解决基础模型在医疗场景中应用时面临的专业知识壁垒、计算资源需求高及部署环境受限等问题。其解决方案的关键在于从知识获取、模型压缩和计算优化三个维度系统性地提升医疗垂直大语言模型的轻量化水平，具体包括通过知识蒸馏与LoRA技术实现知识迁移、采用4-bit权重量化保持核心推理能力，以及引入Flash Attention加速和连续批处理等计算优化技术，从而在保证专业准确性的前提下显著降低内存消耗和推理延迟。

链接: https://arxiv.org/abs/2505.00025
作者: Mingda Zhang,Jianglong Qin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 1 figures

点击查看摘要

Abstract:In recent years, despite foundation models like DeepSeek-R1 and ChatGPT demonstrating significant capabilities in general tasks, professional knowledge barriers, computational resource requirements, and deployment environment limitations have severely hindered their application in actual medical scenarios. Addressing these challenges, this paper proposes an efficient lightweight medical vertical large language model architecture method, systematically solving the lightweight problem of medical large models from three dimensions: knowledge acquisition, model compression, and computational optimization. At the knowledge acquisition level, a knowledge transfer pipeline is designed from the fine-tuned DeepSeek-R1-Distill-70B teacher model to the DeepSeek-R1-Distill-7B student model, and Low-Rank Adaptation (LoRA) technology is adopted to precisely adjust key attention layers. At the model compression level, compression techniques including 4-bit weight quantization are implemented while preserving the core representation ability for medical reasoning. At the computational optimization level, inference optimization techniques such as Flash Attention acceleration and continuous batching are integrated, and a professional prompt template system is constructed to adapt to different types of medical problems. Experimental results on medical question-answering datasets show that the method proposed in this paper maintains professional accuracy while reducing memory consumption by 64.7% and inference latency by 12.4%, providing an effective solution for the application of medical large models in resource-constrained environments such as edge computing devices.
zh

[NLP-58] Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

【速读】：该论文旨在解决大语言模型在使用外部工具时推理能力不足及泛化能力受限的问题。传统方法通过监督微调或从强模型中蒸馏推理过程来增强工具使用能力，但这些方法要么完全忽略推理过程，要么生成模仿性推理，限制了模型的泛化性能。论文提出的解决方案关键在于采用一种轻量级的二元奖励机制，仅评估工具调用的结构有效性和功能正确性，而非依赖于从强模型中蒸馏出的中间推理轨迹，从而让模型自主内化推理策略，提升工具调用的准确性和泛化能力。

链接: https://arxiv.org/abs/2505.00024
作者: Shaokun Zhang,Yi Dong,Jieyu Zhang,Jan Kautz,Bryan Catanzaro,Andrew Tao,Qingyun Wu,Zhiding Yu,Guilin Liu
机构: NVIDIA(英伟达); Pennsylvania State University(宾夕法尼亚州立大学); University of Washington(华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 tables, 5 figures

点击查看摘要

Abstract:Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text generation tasks. Prior work typically enhances tool-use abilities by either applying supervised fine-tuning (SFT) to enforce tool-call correctness or distilling reasoning traces from stronger models for SFT. However, both approaches fall short, either omitting reasoning entirely or producing imitative reasoning that limits generalization. Inspired by the success of DeepSeek-R1 in eliciting reasoning through rule-based reinforcement learning, we develop the Nemotron-Research-Tool-N1 series of tool-using language models using a similar training paradigm. Instead of restrictively supervising intermediate reasoning traces distilled from stronger models, Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations. This lightweight supervision allows the model to autonomously internalize reasoning strategies, without the need for annotated reasoning trajectories. Experiments on the BFCL and API-Bank benchmarks show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results, outperforming GPT-4o on both evaluations.
zh

[NLP-59] CORG: Generating Answers from Complex Interrelated Contexts NAACL2025

【速读】：该论文试图解决在真实语料库中知识因命名歧义、过时信息或错误而导致的跨文档重复但存在不一致的问题，这些问题引发了复杂的上下文间关系。现有语言模型通常仅关注单一因素，难以有效处理这些复杂关系。解决方案的关键在于引入Context Organizer (CORG)，该框架通过将多个上下文组织成独立处理的组，使模型能够高效地查找所有相关答案并确保消歧。CORG包含三个核心组件：图构造器、重排序器和聚合器，从而在性能与效率之间取得平衡，并优于现有的分组方法。

链接: https://arxiv.org/abs/2505.00023
作者: Hyunji Lee,Franck Dernoncourt,Trung Bui,Seunghyun Yoon
机构: KAIST AI (KAIST人工智能); Adobe Research (Adobe研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: published at Findings of NAACL 2025

点击查看摘要

Abstract:In a real-world corpus, knowledge frequently recurs across documents but often contains inconsistencies due to ambiguous naming, outdated information, or errors, leading to complex interrelationships between contexts. Previous research has shown that language models struggle with these complexities, typically focusing on single factors in isolation. We classify these relationships into four types: distracting, ambiguous, counterfactual, and duplicated. Our analysis reveals that no single approach effectively addresses all these interrelationships simultaneously. Therefore, we introduce Context Organizer (CORG), a framework that organizes multiple contexts into independently processed groups. This design allows the model to efficiently find all relevant answers while ensuring disambiguation. CORG consists of three key components: a graph constructor, a reranker, and an aggregator. Our results demonstrate that CORG balances performance and efficiency effectively, outperforming existing grouping methods and achieving comparable results to more computationally intensive, single-context approaches.
zh

[NLP-60] Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation

【速读】：该论文试图解决大规模语言模型（Large Language Models, LLMs）在数据质量与训练效率之间的平衡问题，即如何通过提升数据质量来优化模型性能。其解决方案的关键在于构建一个结合启发式与模型驱动过滤技术以及合成数据生成的德语数据集构建流程，从而创建出高质量的预训练数据集Aleph-Alpha-GermanWeb，该数据集融合了Common Crawl网络数据、FineWeb2以及基于真实网络数据生成的合成数据。通过实验验证，该数据集在多个德语基准测试中表现出优于仅使用FineWeb2的数据集的性能优势。

链接: https://arxiv.org/abs/2505.00022
作者: Thomas F Burns,Letitia Parcalabescu,Stephan Wäldchen,Michael Barlow,Gregor Ziegltrum,Volker Stampa,Bastian Harren,Björn Deiseroth
机构: Aleph Alpha Research (Aleph Alpha 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Scaling data quantity is essential for large language models (LLMs), yet recent findings show that data quality can significantly boost performance and training efficiency. We introduce a German-language dataset curation pipeline that combines heuristic and model-based filtering techniques with synthetic data generation. We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset which draws from: (1) Common Crawl web data, (2) FineWeb2, and (3) synthetically-generated data conditioned on actual, organic web data. We evaluate our dataset by pre-training both a 1B Llama-style model and an 8B tokenizer-free hierarchical autoregressive transformer (HAT). A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone. This advantage holds at the 8B scale even when FineWeb2 is enriched by human-curated high-quality data sources such as Wikipedia. Our findings support the growing body of evidence that model-based data curation and synthetic data generation can significantly enhance LLM pre-training datasets.
zh

[NLP-61] Ustnlp16 at SemEval-2025 Task 9: Improving Model Performance through Imbalance Handling and Focal Loss

【速读】：该论文旨在解决食品危害检测中的分类任务因数据分布不平衡所带来的挑战，这些问题包括严重的类别不平衡、短文本和非结构化文本以及语义类别重叠。其解决方案的关键在于应用数据增强技术以提升分类性能，具体采用基于Transformer的模型（如BERT和RoBERTa）作为主干分类器，并探索了多种数据平衡策略，包括随机过采样、Easy Data Augmentation (EDA) 和焦点损失（focal loss）。实验结果表明，EDA在缓解类别不平衡方面效果显著，而将焦点损失与过采样及EDA结合则进一步增强了模型的鲁棒性，尤其是在处理难以分类的样本时。

链接: https://arxiv.org/abs/2505.00021
作者: Zhuoang Cai,Zhenghao Li,Yang Liu,Liyuan Guo,Yangqiu Song
机构: The Hong Kong University of Science and Technology (HKUST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Classification tasks often suffer from imbal- anced data distribution, which presents chal- lenges in food hazard detection due to severe class imbalances, short and unstructured text, and overlapping semantic categories. In this paper, we present our system for SemEval- 2025 Task 9: Food Hazard Detection, which ad- dresses these issues by applying data augmenta- tion techniques to improve classification perfor- mance. We utilize transformer-based models, BERT and RoBERTa, as backbone classifiers and explore various data balancing strategies, including random oversampling, Easy Data Augmentation (EDA), and focal loss. Our ex- periments show that EDA effectively mitigates class imbalance, leading to significant improve- ments in accuracy and F1 scores. Furthermore, combining focal loss with oversampling and EDA further enhances model robustness, par- ticularly for hard-to-classify examples. These findings contribute to the development of more effective NLP-based classification models for food hazard detection.
zh

[NLP-62] Beyond Public Access in LLM Pre-Training Data

【速读】：该论文试图解决的问题是验证OpenAI的大规模语言模型是否在未经许可的情况下使用了受版权保护的内容进行训练。解决方案的关键在于应用DE-COP会员推断攻击方法，通过分析模型对付费和公开的O’Reilly Media书籍内容的识别能力，评估其是否可能包含未经授权的训练数据。研究结果显示，GPT-4o在识别付费内容方面表现出较高的准确率（AUROC = 82%），而GPT-3.5 Turbo则更擅长识别公开内容，GPT-4o Mini则无明显识别能力，这表明模型的训练数据可能存在差异，从而突显了提高企业预训练数据源透明度的紧迫性。

链接: https://arxiv.org/abs/2505.00020
作者: Sruly Rosenblat,Tim O’Reilly,Ilan Strauss
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we apply the DE-COP membership inference attack method to investigate whether OpenAI’s large language models were trained on copyrighted content without consent. Our AUROC scores show that GPT-4o, OpenAI’s more recent and capable model, demonstrates strong recognition of paywalled O’Reilly book content (AUROC = 82%), compared to OpenAI’s earlier model GPT-3.5 Turbo. In contrast, GPT-3.5 Turbo shows greater relative recognition of publicly accessible O’Reilly book samples. GPT-4o Mini, as a much smaller model, shows no knowledge of public or non-public O’Reilly Media content when tested (AUROC \approx 50%). Testing multiple models, with the same cutoff date, helps us account for potential language shifts over time that might bias our findings. These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training
zh

[NLP-63] An Empirical Study on Prompt Compression for Large Language Models ICLR2025

【速读】：该论文试图解决长提示（prompt）导致的计算复杂性和经济成本增加的问题，其解决方案的关键在于研究六种提示压缩方法，旨在减少提示长度的同时保持大型语言模型（Large Language Models, LLMs）响应质量。通过在13个数据集上的实验，验证了提示压缩对LLMs性能的影响，特别是在长上下文中的影响更为显著，适度的压缩甚至能够提升模型性能。

链接: https://arxiv.org/abs/2505.00019
作者: Zheng Zhang,Jinyi Li,Yihuai Lan,Xiang Wang,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou); South China University of Technology; University of Science and Technology of China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Building Trust Workshop at ICLR 2025

点击查看摘要

Abstract:Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at this https URL.
zh

[NLP-64] ReCellTy: Domain-specific knowledge graph retrieval-augmented LLM s workflow for single-cell annotation

【速读】：该论文旨在解决细胞类型注释过程中缺乏精确且完全自动化的技术问题，通过利用大语言模型（Large Language Models, LLMs）实现更高效的细胞类型标注。其解决方案的关键在于构建一个图结构特征标记数据库，用于检索与差异基因相关联的实体，以支持细胞重建，并设计了一个多任务工作流来优化注释过程。该方法在11种组织类型中相比通用大语言模型，提升了人类评估分数0.21，并在语义相似性上提高了6.1%，同时更贴近人工注释的认知逻辑。

链接: https://arxiv.org/abs/2505.00017
作者: Dezheng Han,Yibin Jia,Ruxiao Chen,Wenjie Han,Shuaishuai Guo,Jianbo Wang
机构: Shandong University (山东大学); Johns Hopkins University (约翰霍普金斯大学); Qilu Hospital (齐鲁医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.
zh

[NLP-65] Sparks of Tabular Reasoning via Text2SQL Reinforcement Learning

【速读】：该论文试图解决如何使大型语言模型（Large Language Models, LLMs）在结构化数据上进行有效推理与操作的问题，而不仅仅是专注于查询生成。其解决方案的关键在于提出一个两阶段框架：首先，从真实SQL查询中合成详细的思维链（Chain-of-Thought, CoT）轨迹，提供逐步骤、条款级别的监督以指导模型进行表字段的遍历、过滤和聚合；其次，引入一种基于组相对策略优化（Group Relative Policy Optimization, GRPO）的强化学习目标，通过鼓励超越任务特定语法的步骤来提升模型在不同数据集间的泛化推理能力。

链接: https://arxiv.org/abs/2505.00016
作者: Josefa Lia Stoisser,Marc Boubnovski Martell,Julien Fauqueur
机构: Novo Nordisk (诺和诺德)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work reframes the Text-to-SQL task as a pathway for teaching large language models (LLMs) to reason over and manipulate tabular data–moving beyond the traditional focus on query generation. We propose a two-stage framework that leverages SQL supervision to develop transferable table reasoning capabilities. First, we synthesize detailed chain-of-thought (CoT) traces from real-world SQL queries, providing step-by-step, clause-level supervision that teaches the model how to traverse, filter, and aggregate table fields. Second, we introduce a Group Relative Policy Optimization (GRPO) reinforcement learning objective that connects SQL execution accuracy to generalizable reasoning by encouraging steps that extend beyond task-specific syntax and transfer across datasets. Empirically, our approach improves performance on standard Text-to-SQL benchmarks and achieves substantial gains on reasoning-intensive datasets such as BIRD and CRT-QA, demonstrating enhanced generalization and interpretability. Specifically, the distilled-quantized LLaMA model achieved a 20% increase in accuracy when trained on Text-to-SQL tasks, while Qwen achieved a 5% increase. These results suggest that SQL can serve not only as a target formalism but also as an effective scaffold for learning robust, transferable reasoning over structured data.
zh

[NLP-66] Design and Application of Multimodal Large Language Model Based System for End to End Automation of Accident Dataset Generation

【速读】：该论文试图解决发展中国家如孟加拉国在道路交通事故数据收集方面存在的手动、碎片化和不可靠的问题，这些问题导致事故报告不足和记录不一致。解决方案的关键在于提出一个完全自动化的系统，利用大型语言模型（Large Language Models, LLMs）和网络爬虫技术，实现从在线来源自动收集、分类、提取和去重事故新闻的功能，从而提高数据的准确性和完整性。

链接: https://arxiv.org/abs/2505.00015
作者: MD Thamed Bin Zaman Chowdhury,Moazzem Hossain
机构: 未知
类目: Computation and Language (cs.CL)
备注: Shortened the abstract to fit within 1920 characters. This paper is currently under Review in Elsevier journal ‘Accident Analysis Prevention’

点击查看摘要

Abstract:Road traffic accidents remain a major public safety and socio-economic issue in developing countries like Bangladesh. Existing accident data collection is largely manual, fragmented, and unreliable, resulting in underreporting and inconsistent records. This research proposes a fully automated system using Large Language Models (LLMs) and web scraping techniques to address these challenges. The pipeline consists of four components: automated web scraping code generation, news collection from online sources, accident news classification with structured data extraction, and duplicate removal. The system uses the multimodal generative LLM Gemini-2.0-Flash for seamless automation. The code generation module classifies webpages into pagination, dynamic, or infinite scrolling categories and generates suitable Python scripts for scraping. LLMs also classify and extract key accident information such as date, time, location, fatalities, injuries, road type, vehicle types, and pedestrian involvement. A deduplication algorithm ensures data integrity by removing duplicate reports. The system scraped 14 major Bangladeshi news sites over 111 days (Oct 1, 2024 - Jan 20, 2025), processing over 15,000 news articles and identifying 705 unique accidents. The code generation module achieved 91.3% calibration and 80% validation accuracy. Chittagong reported the highest number of accidents (80), fatalities (70), and injuries (115), followed by Dhaka, Faridpur, Gazipur, and Cox’s Bazar. Peak accident times were morning (8-9 AM), noon (12-1 PM), and evening (6-7 PM). A public repository was also developed with usage instructions. This study demonstrates the viability of an LLM-powered, scalable system for accurate, low-effort accident data collection, providing a foundation for data-driven road safety policymaking in Bangladesh.
zh

[NLP-67] Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres Tori and Möbius Strips

【速读】：该论文试图解决传统句子嵌入在非约束的欧几里得空间中可能无法有效捕捉语言复杂关系的问题，其核心在于通过引入流形约束来提升嵌入的语义结构表达能力。解决方案的关键是将句子嵌入限制在连续流形（如单位球面、环面和莫比乌斯带）上，并以三元组损失作为核心训练目标，从而在输出空间中施加微分几何约束，促使学习到的嵌入既具有判别性又具备拓扑结构。

链接: https://arxiv.org/abs/2505.00014
作者: Vinit K. Chavan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 6 figures. Code available at this https URL

点击查看摘要

Abstract:Recent advances in representation learning have emphasized the role of embedding geometry in capturing semantic structure. Traditional sentence embeddings typically reside in unconstrained Euclidean spaces, which may limit their ability to reflect complex relationships in language. In this work, we introduce a novel framework that constrains sentence embeddings to lie on continuous manifolds – specifically the unit sphere, torus, and Möbius strip – using triplet loss as the core training objective. By enforcing differential geometric constraints on the output space, our approach encourages the learning of embeddings that are both discriminative and topologically structured. We evaluate our method on benchmark datasets (AG News and MBTI) and compare it to classical baselines including TF-IDF, Word2Vec, and unconstrained Keras-derived embeddings. Our results demonstrate that manifold-constrained embeddings, particularly those projected onto spheres and Möbius strips, significantly outperform traditional approaches in both clustering quality (Silhouette Score) and classification performance (Accuracy). These findings highlight the value of embedding in manifold space – where topological structure complements semantic separation – offering a new and mathematically grounded direction for geometric representation learning in NLP. Comments: 10 pages, 6 figures. Code available at this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.00014 [cs.CL] (or arXiv:2505.00014v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.00014 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-68] Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa

【速读】：该论文旨在解决日本语句中八种Plutchik情绪的二分类情感检测问题，特别是在资源稀缺和类别不平衡背景下提升模型性能。其关键解决方案是基于WRIME语料库，将读者平均强度评分转换为二元标签，并对四种预训练语言模型（BERT、RoBERTa、DeBERTa-v3-base和DeBERTa-v3-large）进行微调，其中DeBERTa-v3-large在准确率（0.860）和F1分数（0.662）上表现最佳，展现出对高频和低频情绪的稳健分类能力。

链接: https://arxiv.org/abs/2505.00013
作者: Yoichi Takenaka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 tables, 3 appendices. Submitted to New Generation Computing. Includes comparisons between fine-tuned PLMs and LLMs on Japanese emotion classification. Code available at this https URL

点击查看摘要

Abstract:Background Practical applications such as social media monitoring and customer-feedback analysis require accurate emotion detection for Japanese text, yet resource scarcity and class imbalance hinder model performance. Objective This study aims to build a high-accuracy model for predicting the presence or absence of eight Plutchik emotions in Japanese sentences. Methods Using the WRIME corpus, we transform reader-averaged intensity scores into binary labels and fine-tune four pre-trained language models (BERT, RoBERTa, DeBERTa-v3-base, DeBERTa-v3-large). For context, we also assess two large language models (TinySwallow-1.5B-Instruct and ChatGPT-4o). Accuracy and F1-score serve as evaluation metrics. Results DeBERTa-v3-large attains the best mean accuracy (0.860) and F1-score (0.662), outperforming all other models. It maintains robust F1 across both high-frequency emotions (e.g., Joy, Anticipation) and low-frequency emotions (e.g., Anger, Trust). The LLMs lag, with ChatGPT-4o and TinySwallow-1.5B-Instruct scoring 0.527 and 0.292 in mean F1, respectively. Conclusion The fine-tuned DeBERTa-v3-large model currently offers the most reliable solution for binary emotion classification in Japanese. We release this model as a pip-installable package (pip install deberta-emotion-predictor). Future work should augment data for rare emotions, reduce model size, and explore prompt engineering to improve LLM performance. This manuscript is under review for possible publication in New Generation Computing. Comments: 14 pages, 3 tables, 3 appendices. Submitted to New Generation Computing. Includes comparisons between fine-tuned PLMs and LLMs on Japanese emotion classification. Code available at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.00013 [cs.CL] (or arXiv:2505.00013v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.00013 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yoichi Takenaka [view email] [v1] Tue, 22 Apr 2025 07:51:37 UTC (164 KB) Full-text links: Access Paper: View a PDF of the paper titled Performance Evaluation of Emotion Classification in Japanese Using RoBERTa and DeBERTa, by Yoichi TakenakaView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-69] he AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?

【速读】：该论文试图解决定性研究中劳动密集型流程难以扩展同时保持分析深度的问题。其解决方案的关键在于提出一种名为AI Co-Ethnographer (AICoE) 的端到端流程，该流程不仅超越了简单自动化编码的局限性，还提供了更集成的方法，涵盖开放式编码、编码整合、编码应用以及模式发现，从而实现对定性数据的全面分析。

链接: https://arxiv.org/abs/2505.00012
作者: Fabian Retkowski,Andreas Sudmann,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NLP4DH 2025

点击查看摘要

Abstract:Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.
zh

[NLP-70] Jailbreak Detection in Clinical Training LLM s Using Feature-Based Predictive Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在敏感领域（如教育）中因“越狱”行为导致的安全问题，即用户通过特定输入绕过伦理安全机制的行为。解决方案的关键在于利用与越狱行为高度相关的四种语言学变量对超过2,300个提示进行标注，并基于提取的特征训练多种预测模型，包括决策树、模糊逻辑分类器、提升方法和逻辑回归模型。研究结果表明，基于特征的预测模型在检测越狱行为方面优于提示工程方法，其中模糊决策树表现最佳，证明了基于语言特征的模型在越狱检测中的有效性与可解释性。

链接: https://arxiv.org/abs/2505.00010
作者: Tri Nguyen,Lohith Srikanth Pentapalli,Magnus Sieverding,Laurah Turner,Seth Overla,Weibing Zheng,Chris Zhou,David Furniss,Danielle Weber,Michael Gharib,Matt Kelleher,Michael Shukis,Cameron Pawlik,Kelly Cohen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Jailbreaking in Large Language Models (LLMs) threatens their safe use in sensitive domains like education by allowing users to bypass ethical safeguards. This study focuses on detecting jailbreaks in 2-Sigma, a clinical education platform that simulates patient interactions using LLMs. We annotated over 2,300 prompts across 158 conversations using four linguistic variables shown to correlate strongly with jailbreak behavior. The extracted features were used to train several predictive models, including Decision Trees, Fuzzy Logic-based classifiers, Boosting methods, and Logistic Regression. Results show that feature-based predictive models consistently outperformed Prompt Engineering, with the Fuzzy Decision Tree achieving the best overall performance. Our findings demonstrate that linguistic-feature-based models are effective and explainable alternatives for jailbreak detection. We suggest future work explore hybrid frameworks that integrate prompt-based flexibility with rule-based robustness for real-time, spectrum-based jailbreak monitoring in educational LLMs.
zh

[NLP-71] Efficient Knowledge Transfer in Multi-Task Learning through Task-Adaptive Low-Rank Representation

【速读】：该论文试图解决预训练语言模型（Pre-trained Language Models, PLMs）在面对新兴任务时表现不佳的问题，尤其是在实际应用中无法为每个新任务单独训练模型的情况下。为了解决这一问题，研究者提出了任务自适应低秩表示（Task-Adaptive Low-Rank Representation, TA-LoRA），其关键在于利用低秩表示建模任务异质性，并引入快慢权重机制，其中慢速权重编码共享知识，快速权重捕捉任务特定细节，从而避免共享与任务特定知识的混合。此外，通过引入零初始化注意力机制，减少在预热阶段不成熟低秩组件对原始提示的干扰。

链接: https://arxiv.org/abs/2505.00009
作者: Xiao Zhang,Kangsheng Wang,Tianyu Hu,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE International Conference on Multimedia Expo 2025

点击查看摘要

Abstract:Pre-trained language models (PLMs) demonstrate remarkable intelligence but struggle with emerging tasks unseen during training in real-world applications. Training separate models for each new task is usually impractical. Multi-task learning (MTL) addresses this challenge by transferring shared knowledge from source tasks to target tasks. As an dominant parameter-efficient fine-tuning method, prompt tuning (PT) enhances MTL by introducing an adaptable vector that captures task-specific knowledge, which acts as a prefix to the original prompt that preserves shared knowledge, while keeping PLM parameters frozen. However, PT struggles to effectively capture the heterogeneity of task-specific knowledge due to its limited representational capacity. To address this challenge, we propose Task-Adaptive Low-Rank Representation (TA-LoRA), an MTL method built on PT, employing the low-rank representation to model task heterogeneity and a fast-slow weights mechanism where the slow weight encodes shared knowledge, while the fast weight captures task-specific nuances, avoiding the mixing of shared and task-specific knowledge, caused by training low-rank representations from scratch. Moreover, a zero-initialized attention mechanism is introduced to minimize the disruption of immature low-rank components on original prompts during warm-up epochs. Experiments on 16 tasks demonstrate that TA-LoRA achieves state-of-the-art performance in full-data and few-shot settings while maintaining superior parameter efficiency.
zh

[NLP-72] A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors Misinformation and Hallucination

【速读】：该论文试图解决医学领域中由自然语言处理（Natural Language Processing, NLP）技术所面临的不准确信息问题，包括错误、虚假信息和幻觉的检测、纠正与缓解。其解决方案的关键在于利用NLP技术对上述问题进行系统性分析，并通过统一这些概念的方法论基础，提升医疗领域的信息可靠性与透明度，从而保障患者安全和改善公共卫生传播。

链接: https://arxiv.org/abs/2505.00008
作者: Zhaoyi Sun,Wen-Wai Yim,Ozlem Uzuner,Fei Xia,Meliha Yetisgen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: This review aims to explore the potential and challenges of using Natural Language Processing (NLP) to detect, correct, and mitigate medically inaccurate information, including errors, misinformation, and hallucination. By unifying these concepts, the review emphasizes their shared methodological foundations and their distinct implications for healthcare. Our goal is to advance patient safety, improve public health communication, and support the development of more reliable and transparent NLP applications in healthcare. Methods: A scoping review was conducted following PRISMA guidelines, analyzing studies from 2020 to 2024 across five databases. Studies were selected based on their use of NLP to address medically inaccurate information and were categorized by topic, tasks, document types, datasets, models, and evaluation metrics. Results: NLP has shown potential in addressing medically inaccurate information on the following tasks: (1) error detection (2) error correction (3) misinformation detection (4) misinformation correction (5) hallucination detection (6) hallucination mitigation. However, challenges remain with data privacy, context dependency, and evaluation standards. Conclusion: This review highlights the advancements in applying NLP to tackle medically inaccurate information while underscoring the need to address persistent challenges. Future efforts should focus on developing real-world datasets, refining contextual methods, and improving hallucination management to ensure reliable and transparent healthcare applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.00008 [cs.CL] (or arXiv:2505.00008v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.00008 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zhaoyi Sun [view email] [v1] Wed, 16 Apr 2025 22:27:10 UTC (147 KB)
zh

[NLP-73] oward a digital twin of U.S. Congress

【速读】：该论文试图解决如何构建一个能够准确反映美国国会议员行为与语言特征的数字孪生体（Digital Twin）问题，其核心在于利用语言模型生成与真实议员发布的推文高度相似的内容，并进一步用于预测投票行为和分析党派忠诚度。解决方案的关键在于构建一个每日更新的数据集，包含所有美国国会议员在其任期内发布的每条推文，并利用该数据集训练具有议员特定子集的现代语言模型，从而生成难以与真实推文区分的虚拟内容。

链接: https://arxiv.org/abs/2505.00006
作者: Hayden Helm,Tianyi Chen,Harvey McGuinness,Paige Lee,Brandon Duderstadt,Carey E. Priebe
机构: Johns Hopkins University (约翰霍普金斯大学); Nomic AI (Nomic AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In this paper we provide evidence that a virtual model of U.S. congresspersons based on a collection of language models satisfies the definition of a digital twin. In particular, we introduce and provide high-level descriptions of a daily-updated dataset that contains every Tweet from every U.S. congressperson during their respective terms. We demonstrate that a modern language model equipped with congressperson-specific subsets of this data are capable of producing Tweets that are largely indistinguishable from actual Tweets posted by their physical counterparts. We illustrate how generated Tweets can be used to predict roll-call vote behaviors and to quantify the likelihood of congresspersons crossing party lines, thereby assisting stakeholders in allocating resources and potentially impacting real-world legislative dynamics. We conclude with a discussion of the limitations and important extensions of our analysis.
zh

[NLP-74] LangVAE and LangSpace: Building and Probing for Language Model VAEs

【速读】：该论文试图解决如何在预训练大语言模型（Large Language Models, LLMs）基础上构建模块化的变分自编码器（Variational Autoencoders, VAEs），以获得更紧凑且语义解耦的文本表示问题。解决方案的关键在于提出LangVAE框架，该框架利用预训练模型的先验知识，通过可插拔的编码器和解码器结构实现对文本的高效编码与解码，并结合配套的LangSpace工具进行表示分析，从而提供了一种灵活、高效且可扩展的文本表示构建与分析方法。

链接: https://arxiv.org/abs/2505.00004
作者: Danilo S. Carvalho,Yingji Zhang,Harriet Unsworth,André Freitas
机构: National Biomarker Centre, CRUK-MI, University of Manchester, United Kingdom; Department of Computer Science, University of Manchester, United Kingdom; Idiap Research Institute, Switzerland
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present LangVAE, a novel framework for modular construction of variational autoencoders (VAEs) on top of pre-trained large language models (LLMs). Such language model VAEs can encode the knowledge of their pre-trained components into more compact and semantically disentangled representations. The representations obtained in this way can be analysed with the LangVAE companion framework: LangSpace, which implements a collection of probing methods, such as vector traversal and interpolation, disentanglement measures, and cluster visualisations. LangVAE and LangSpace offer a flexible, efficient and scalable way of building and analysing textual representations, with simple integration for models available on the HuggingFace Hub. Additionally, we conducted a set of experiments with different encoder and decoder combinations, as well as annotated inputs, revealing a wide range of interactions across architectural families and sizes w.r.t. generalisation and disentanglement. Our findings demonstrate a promising framework for systematising the experimentation and understanding of textual representations.
zh

[NLP-75] he Mind in the Machine: A Survey of Incorporating Psychological Theories in LLM s

【速读】：该论文试图解决如何将心理学理论有效融入大规模语言模型（Large Language Models, LLMs）的开发过程中，以提升其人类认知、行为和交互的模拟能力。解决方案的关键在于系统性地整合认知、发展、行为、社会、人格心理学及心理语言学等领域的理论，以指导LLMs在数据、预训练、后训练、评估与应用等各个阶段的优化与改进。通过分析跨领域联系与矛盾点，论文旨在弥合学科间的分歧，推动心理学与自然语言处理（NLP）研究的深度融合。

链接: https://arxiv.org/abs/2505.00003
作者: Zizhou Liu,Ziwei Gong,Lin Ai,Zheng Hui,Run Chen,Colin Wayne Leach,Michelle R. Greene,Julia Hirschberg
机构: Columbia University (哥伦比亚大学); Barnard College (巴纳德学院); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychological insights have long shaped pivotal NLP breakthroughs, including the cognitive underpinnings of attention mechanisms, formative reinforcement learning, and Theory of Mind-inspired social modeling. As Large Language Models (LLMs) continue to grow in scale and complexity, there is a rising consensus that psychology is essential for capturing human-like cognition, behavior, and interaction. This paper reviews how psychological theories can inform and enhance stages of LLM development, including data, pre-training, post-training, and evaluation\application. Our survey integrates insights from cognitive, developmental, behavioral, social, personality psychology, and psycholinguistics. Our analysis highlights current trends and gaps in how psychological theories are applied. By examining both cross-domain connections and points of tension, we aim to bridge disciplinary divides and promote more thoughtful integration of psychology into future NLP research.
zh

[NLP-76] Symbol grounding in computational systems: A paradox of intentions

【速读】：该论文试图解决计算主义（computationalism）无法解释符号奠基（symbol grounding）的问题。其核心论点是，无论心智在进行计算时处理的是有意义的符号还是无意义的符号，计算主义都会导致语义先验主义（semantic nativism）。解决方案的关键在于揭示计算主义在符号奠基过程中的内在矛盾：若符号是有意义的，则系统中必须已存在语义，这暗示语义先验主义；若符号是无意义的，则在符号奠基之前缺乏意向性认知过程，从而无法实现符号奠基。因此，计算主义无论在哪种情况下都无法避免语义先验主义的结论。

链接: https://arxiv.org/abs/2505.00002
作者: Vincent C. Müller
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper presents a paradoxical feature of computational systems that suggests that computationalism cannot explain symbol grounding. If the mind is a digital computer, as computationalism claims, then it can be computing either over meaningful symbols or over meaningless symbols. If it is computing over meaningful symbols its functioning presupposes the existence of meaningful symbols in the system, i.e. it implies semantic nativism. If the mind is computing over meaningless symbols, no intentional cognitive processes are available prior to symbol grounding. In this case, no symbol grounding could take place since any grounding presupposes intentional cognitive processes. So, whether computing in the mind is over meaningless or over meaningful symbols, computationalism implies semantic nativism.
zh

[NLP-77] Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在低资源语言环境和需要深度逻辑推理的任务中表现不佳的问题。其解决方案的关键在于构建Rosetta-PL基准，通过将Lean中的逻辑命题数据集翻译成自定义逻辑语言，并以此对LLM进行微调，从而评估模型在受控环境下的逻辑推理与泛化能力。实验结果表明，保持翻译过程中的逻辑关系能够显著提升模型精度，且在约20,000个训练样本后准确率趋于稳定。

链接: https://arxiv.org/abs/2505.00001
作者: Shaun Baek,Shaun Esua-Mensah,Cyrus Tsui,Sejan Vigneswaralingam,Abdullah Alali,Michael Lu,Vasu Sharma,Kevin Zhu
机构: Emory University (埃默里大学); Algoverse AI Research (Algoverse AI Research)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs’ logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.
zh

计算机视觉

[CV-0] Controllable Weather Synthesis and Removal with Video Diffusion Models

【速读】：该论文试图解决在视频中生成真实且可控的天气效果的问题，传统物理基础的天气模拟需要精确重建，难以扩展到真实场景的视频，而现有的视频编辑方法则缺乏真实性和控制能力。解决方案的关键是提出WeatherWeaver，这是一种视频扩散模型，能够直接将多种天气效果（如雨、雪、雾和云）合成到任意输入视频中，无需3D建模，并提供对天气效果强度的精确控制以及多种天气类型的混合，从而确保真实性和适应性。为了解决配对训练数据稀缺的问题，该方法采用了一种结合合成视频、生成式图像编辑和自动标注真实视频的新数据策略。

链接: https://arxiv.org/abs/2505.00704
作者: Chih-Hao Lin,Zian Wang,Ruofan Liang,Yuxuan Zhang,Sanja Fidler,Shenlong Wang,Zan Gojcic
机构: NVIDIA(英伟达); University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); University of Toronto(多伦多大学); Vector Institute(矢量研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating realistic and controllable weather effects in videos is valuable for many applications. Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control. In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects – including rain, snow, fog, and clouds – directly into any input video without the need for 3D modeling. Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability. To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. Extensive evaluations show that our method outperforms state-of-the-art methods in weather simulation and removal, providing high-quality, physically plausible, and scene-identity-preserving results over various real-world videos.
zh

[CV-1] RayZer: A Self-supervised Large View Synthesis Model

【速读】：该论文试图解决在缺乏3D监督信息（如相机位姿和场景几何结构）的情况下，实现具有3D感知能力的多视角3D视觉建模问题。其解决方案的关键在于设计了一个自监督框架，通过解耦相机与场景表示实现输入图像的3D感知自编码，并采用基于Transformer的模型结构，仅依赖射线（ray）结构作为3D先验，同时连接相机、像素和场景，从而在无需真实相机标注的情况下，完成场景重建与新视角合成。

链接: https://arxiv.org/abs/2505.00702
作者: Hanwen Jiang,Hao Tan,Peng Wang,Haian Jin,Yue Zhao,Sai Bi,Kai Zhang,Fujun Luan,Kalyan Sunkavalli,Qixing Huang,Georgios Pavlakos
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Adobe Research (Adobe 研究院); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle’’ methods that rely on pose annotations in both training and testing. Project: this https URL
zh

[CV-2] Robotic Visual Instruction

【速读】：该论文旨在解决人机交互中自然语言在机器人控制上的空间精度不足问题，这一缺陷导致了指令的模糊性和冗长性。其解决方案的关键在于提出一种基于物体中心的、手绘符号表示的机器人视觉指令（Robotic Visual Instruction, RoVI），通过2D草图将时空信息编码为人类可理解的视觉指令，并结合视觉-语言模型（Vision-Language Models, VLMs）构建视觉指令具身工作流（Visual Instruction Embodied Workflow, VIEW），以解析RoVI输入并生成精确的3D动作序列。

链接: https://arxiv.org/abs/2505.00693
作者: Yanbang Li,Ziyang Gong,Haoyang Li,Haoyang Li,Xiaoqi Huang,Haolan Kang,Guangping Bai,Xianzheng Ma
机构: Imperial College London (帝国理工学院); Shanghai AI Laboratory (上海人工智能实验室); UC San Diego (加州大学圣地亚哥分校); VIVO; South China University of Technology (华南理工大学); Independent Researcher (独立研究员)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision for robotic control introduces challenges such as ambiguity and verbosity. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline formulated for RoVI-conditioned policies. This approach leverages Vision-Language Models (VLMs) to interpret RoVI inputs, decode spatial and temporal constraints from 2D pixel space via keypoint extraction, and then transform them into executable 3D action sequences. We additionally curate a specialized dataset of 15K instances to fine-tune small VLMs for edge deployment, enabling them to effectively learn RoVI capabilities. Our approach is rigorously validated across 11 novel tasks in both real and simulated environments, demonstrating significant generalization capability. Notably, VIEW achieves an 87.5% success rate in real-world scenarios involving unseen tasks that feature multi-step actions, with disturbances, and trajectory-following requirements. Code and Datasets in this paper will be released soon.
zh

[CV-3] owards Autonomous Micromobility through Scalable Urban Simulation CVPR2025

【速读】：该论文旨在解决当前微移动（micromobility）系统依赖人工操作所带来的安全性和效率问题，特别是在充满不可预测障碍物和行人的城市环境中。其解决方案的关键在于构建一个可扩展的城市仿真平台URBAN-SIM，该平台包含三个核心模块：分层城市生成管道、交互动力学生成策略以及异步场景采样方案，以提升机器人学习的多样性、真实性和效率；同时提出URBAN-BENCH基准测试套件，用于评估AI代理在城市环境中的自主微移动能力。

链接: https://arxiv.org/abs/2505.00690
作者: Wayne Wu,Honglin He,Chaoyuan Zhang,Jack He,Seth Z. Zhao,Ran Gong,Quanyi Li,Bolei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: CVPR 2025 Highlight. Project page: this https URL

点击查看摘要

Abstract:Micromobility, which utilizes lightweight mobile machines moving in urban public spaces, such as delivery robots and mobility scooters, emerges as a promising alternative to vehicular mobility. Current micromobility depends mostly on human manual operation (in-person or remote control), which raises safety and efficiency concerns when navigating busy urban environments full of unpredictable obstacles and pedestrians. Assisting humans with AI agents in maneuvering micromobility devices presents a viable solution for enhancing safety and efficiency. In this work, we present a scalable urban simulation solution to advance autonomous micromobility. First, we build URBAN-SIM - a high-performance robot learning platform for large-scale training of embodied agents in interactive urban scenes. URBAN-SIM contains three critical modules: Hierarchical Urban Generation pipeline, Interactive Dynamics Generation strategy, and Asynchronous Scene Sampling scheme, to improve the diversity, realism, and efficiency of robot learning in simulation. Then, we propose URBAN-BENCH - a suite of essential tasks and benchmarks to gauge various capabilities of the AI agents in achieving autonomous micromobility. URBAN-BENCH includes eight tasks based on three core skills of the agents: Urban Locomotion, Urban Navigation, and Urban Traverse. We evaluate four robots with heterogeneous embodiments, such as the wheeled and legged robots, across these tasks. Experiments on diverse terrains and urban structures reveal each robot’s strengths and limitations.
zh

[CV-4] Visual Test-time Scaling for GUI Agent Grounding

【速读】：该论文旨在解决视觉语言模型代理在理解网页时面临的挑战，即由于GUI图像的视觉复杂性和大量界面元素导致的动作选择困难问题。解决方案的关键在于提出RegionFocus方法，通过动态放大相关区域来减少背景干扰并提升定位准确性，同时引入图像作为地图（image-as-map）机制，以可视化关键地标并提供透明的动作记录，从而支持代理有效选择动作候选。

链接: https://arxiv.org/abs/2505.00684
作者: Tiange Luo,Lajanugen Logeswaran,Justin Johnson,Honglak Lee
机构: University of Michigan(密歇根大学); LG AI Research(LG人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at this https URL.
zh

[CV-5] MINERVA: Evaluating Complex Video Reasoning

【速读】：该论文旨在解决当前视频基准测试中缺乏中间推理步骤和可解释性的问题，这使得难以评估多模态大语言模型（Multimodal LLMs）是否真正能够结合感知和时间信息进行视频推理，而非仅通过偶然或语言偏见获得正确答案。其解决方案的关键是提出一个名为MINERVA的新视频推理数据集，该数据集包含带有详细人工构建推理轨迹的多答案选项，具备多模态特性、视频领域和长度的多样性，以及复杂的多步骤问题，从而为前沿开源和专有模型提供挑战。

链接: https://arxiv.org/abs/2505.00681
作者: Arsha Nagrani,Sachit Menon,Ahmet Iscen,Shyamal Buch,Ramin Mehran,Nilpa Jha,Anja Hauth,Yukun Zhu,Carl Vondrick,Mikhail Sirotenko,Cordelia Schmid,Tobias Weyand
机构: Google DeepMind(谷歌深度思维); Columbia University(哥伦比亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under this https URL#minerva.
zh

[CV-6] Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments

【速读】：该论文试图解决城市空气质量恶化问题，特别是在像德里这样人口密集且交通繁忙的大都市中，传统静态空气净化设施因部署位置不当和适应性不足而难以有效改善空气质量。解决方案的关键在于提出一种基于深度强化学习（DRL）的框架，利用近端策略优化（PPO）算法，通过多维空间和环境因素（如人口密度、交通模式、工业影响和绿地约束）迭代学习并识别高影响区域，从而优化空气净化站的布局，实现AQI的显著提升与高效覆盖。

链接: https://arxiv.org/abs/2505.00668
作者: Kirtan Rajesh,Suvidha Rupesh Kumar
机构: Vellore Institute of Technology (维洛尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy. Experimental results demonstrate that the RL-based approach outperforms baseline methods by achieving a balanced and effective distribution of air purification infrastructure. Notably, the DRL framework achieves an optimal trade-off between AQI reduction and high-coverage deployment, ensuring equitable environmental benefits across urban regions. The findings underscore the potential of AI-driven spatial optimization in advancing smart city initiatives and data-driven urban air quality management.
zh

[CV-7] Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques Applications and Outlook

【速读】：该论文旨在解决传统深度学习架构在遥感领域的局限性，特别是卷积神经网络（CNN）的有限感受野和视觉变压器（ViT）的二次计算复杂度问题，这些问题限制了其在高分辨率遥感数据中的可扩展性。论文提出的解决方案关键在于引入状态空间模型（SSM），尤其是Mamba架构，该架构结合了线性计算复杂度与全局上下文建模能力，从而为遥感分析提供了一种范式转变的框架。

链接: https://arxiv.org/abs/2505.00630
作者: Muyi Bao,Shuchang Lyu,Zhaoyang Xu,Huiyu Zhou,Jinchang Ren,Shiming Xiang,Xiangtai Li,Guangliang Cheng
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Beihang University (北京航空航天大学); Cambridge University (剑桥大学); University of Leicester (莱斯特大学); Robert Gordon University (罗伯特·戈登大学); Chinese Academy of Sciences (中国科学院); Nanyang Technological University (南洋理工大学); University of Liverpool (利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (this https URL) to foster community-driven advancements.
zh

[CV-8] Brain Foundation Models with Hypergraph Dynamic Adapter for Brain Disease Analysis

【速读】：该论文旨在解决当前脑部基础模型在任务和数据同质性、泛化能力受限以及对多样化临床任务适应效率低下等方面的问题。其关键解决方案是提出SAM-Brain3D，一个基于超过66,000个跨14种MRI子模态的脑图像-标签对训练的脑部专用基础模型，以及Hypergraph Dynamic Adapter (HyDA)，一种轻量级适配器，通过超图融合多模态数据并动态生成患者特定的卷积核，实现多尺度特征融合与个性化患者适配，从而提升模型在多种脑疾病分割与分类任务中的性能。

链接: https://arxiv.org/abs/2505.00627
作者: Zhongying Deng,Haoyu Wang,Ziyan Huang,Lipei Zhang,Angelica I. Aviles-Rivero,Chaoyu Liu,Junjun He,Zoe Kourtzi,Carola-Bibiane Schönlieb
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 4 figures

点击查看摘要

Abstract:Brain diseases, such as Alzheimer’s disease and brain tumors, present profound challenges due to their complexity and societal impact. Recent advancements in brain foundation models have shown significant promise in addressing a range of brain-related tasks. However, current brain foundation models are limited by task and data homogeneity, restricted generalization beyond segmentation or classification, and inefficient adaptation to diverse clinical tasks. In this work, we propose SAM-Brain3D, a brain-specific foundation model trained on over 66,000 brain image-label pairs across 14 MRI sub-modalities, and Hypergraph Dynamic Adapter (HyDA), a lightweight adapter for efficient and effective downstream adaptation. SAM-Brain3D captures detailed brain-specific anatomical and modality priors for segmenting diverse brain targets and broader downstream tasks. HyDA leverages hypergraphs to fuse complementary multi-modal data and dynamically generate patient-specific convolutional kernels for multi-scale feature fusion and personalized patient-wise adaptation. Together, our framework excels across a broad spectrum of brain disease segmentation and classification tasks. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art approaches, offering a new paradigm for brain disease analysis through multi-modal, multi-scale, and dynamic foundation modeling.
zh

[CV-9] Diverse Semantics-Guided Feature Alignment and Decoupling for Visible-Infrared Person Re-Identification

【速读】：该论文旨在解决可见光-红外行人重识别（VI-ReID）中由于模态差异大导致的特征对齐困难以及风格噪声（如光照和色彩对比度）降低特征身份可区分性和模态不变性的问题。其解决方案的关键在于提出一种新颖的多样化语义引导特征对齐与解耦（DSFAD）网络，通过Diverse Semantics-guided Feature Alignment（DSFA）模块生成多样化的句子结构来引导跨模态特征对齐，并利用Semantic Margin-guided Feature Decoupling（SMFD）模块分解视觉特征为行人相关和风格相关部分，同时通过Semantic Consistency-guided Feature Restitution（SCFR）模块防止语义信息丢失，从而提升模型性能。

链接: https://arxiv.org/abs/2505.00619
作者: Neng Dong,Shuanglin Yan,Liyan Zhang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible-Infrared Person Re-Identification (VI-ReID) is a challenging task due to the large modality discrepancy between visible and infrared images, which complicates the alignment of their features into a suitable common space. Moreover, style noise, such as illumination and color contrast, reduces the identity discriminability and modality invariance of features. To address these challenges, we propose a novel Diverse Semantics-guided Feature Alignment and Decoupling (DSFAD) network to align identity-relevant features from different modalities into a textual embedding space and disentangle identity-irrelevant features within each modality. Specifically, we develop a Diverse Semantics-guided Feature Alignment (DSFA) module, which generates pedestrian descriptions with diverse sentence structures to guide the cross-modality alignment of visual features. Furthermore, to filter out style information, we propose a Semantic Margin-guided Feature Decoupling (SMFD) module, which decomposes visual features into pedestrian-related and style-related components, and then constrains the similarity between the former and the textual embeddings to be at least a margin higher than that between the latter and the textual embeddings. Additionally, to prevent the loss of pedestrian semantics during feature decoupling, we design a Semantic Consistency-guided Feature Restitution (SCFR) module, which further excavates useful information for identification from the style-related features and restores it back into the pedestrian-related features, and then constrains the similarity between the features after restitution and the textual embeddings to be consistent with that between the features before decoupling and the textual embeddings. Extensive experiments on three VI-ReID datasets demonstrate the superiority of our DSFAD.
zh

[CV-10] Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction ATC WWW

【速读】：该论文旨在解决从单张RGB图像中进行人脸三维重建的问题（3D face reconstruction from a single RGB image）。其解决方案的关键在于提出Pixel3DMM，这是一种高度泛化的视觉变压器架构，能够预测每像素的几何线索，以约束三维可变形人脸模型（3D morphable face model, 3DMM）的优化过程。通过利用DINO基础模型的潜在特征，并引入定制的表面法线和UV坐标预测头，结合三个高质量三维人脸数据集与FLAME网格拓扑的配准训练，最终实现了更精确的三维人脸重建。

链接: https://arxiv.org/abs/2505.00615
作者: Simon Giebenhain,Tobias Kirschstein,Martin Rünz,Lourdes Agapito,Matthias Nießner
机构: Technical University of Munich (慕尼黑工业大学); Synthesia (Synthesia); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Website: this https URL ; Video: this https URL

点击查看摘要

Abstract:We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.
zh

[CV-11] Dietary Intake Estimation via Continuous 3D Reconstruction of Food CVPR

【速读】：该论文试图解决传统饮食习惯监测方法依赖于自我报告数据而导致的不准确性问题，这些问题可能引发肥胖、糖尿病和心血管疾病等健康风险。解决方案的关键在于利用单目2D视频构建3D食物模型，通过COLMAP和姿态估计算法生成详细的3D表示，从而准确监控食物摄入行为，并通过新的自动化状态识别方法实现对状态变化的精确检测和模型保真度的维持。

链接: https://arxiv.org/abs/2505.00606
作者: Wallace Lee,YuHao Chen
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2025 CVPR MetaFood Workshop

点击查看摘要

Abstract:Monitoring dietary habits is crucial for preventing health risks associated with overeating and undereating, including obesity, diabetes, and cardiovascular diseases. Traditional methods for tracking food intake rely on self-reported data before or after the eating, which are prone to inaccuracies. This study proposes an approach to accurately monitor ingest behaviours by leveraging 3D food models constructed from monocular 2D video. Using COLMAP and pose estimation algorithms, we generate detailed 3D representations of food, allowing us to observe changes in food volume as it is consumed. Experiments with toy models and real food items demonstrate the approach’s potential. Meanwhile, we have proposed a new methodology for automated state recognition challenges to accurately detect state changes and maintain model fidelity. The 3D reconstruction approach shows promise in capturing comprehensive dietary behaviour insights, ultimately contributing to the development of automated and accurate dietary monitoring tools.
zh

[CV-12] Visual Trajectory Prediction of Vessels for Inland Navigation

【速读】：该论文试图解决内河航行中基于视频的船舶跟踪与轨迹预测问题，特别是在复杂环境下的目标误分类问题。解决方案的关键在于集成先进的目标检测方法、卡尔曼滤波器和基于样条的插值技术，其中卡尔曼滤波器在提供平滑轨迹方面表现出较强的鲁棒性，从而提高了船舶运动预测的准确性，对碰撞避免和态势感知具有重要意义。

链接: https://arxiv.org/abs/2505.00599
作者: Alexander Puzicha,Konstantin Wüstefeld,Kathrin Wilms,Frank Weichert
机构: TU Dortmund University (多特蒙德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The future of inland navigation increasingly relies on autonomous systems and remote operations, emphasizing the need for accurate vessel trajectory prediction. This study addresses the challenges of video-based vessel tracking and prediction by integrating advanced object detection methods, Kalman filters, and spline-based interpolation. However, existing detection systems often misclassify objects in inland waterways due to complex surroundings. A comparative evaluation of tracking algorithms, including BoT-SORT, Deep OC-SORT, and ByeTrack, highlights the robustness of the Kalman filter in providing smoothed trajectories. Experimental results from diverse scenarios demonstrate improved accuracy in predicting vessel movements, which is essential for collision avoidance and situational awareness. The findings underline the necessity of customized datasets and models for inland navigation. Future work will expand the datasets and incorporate vessel classification to refine predictions, supporting both autonomous systems and human operators in complex environments.
zh

[CV-13] Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading

【速读】：该论文旨在解决医学图像疾病分级中因领域偏移和数据不平衡导致的模型偏差问题，这些问题在临床应用中带来了部署困难。其解决方案的关键在于提出一种名为Uncertainty-aware Multi-experts Knowledge Distillation (UMKD)的框架，通过从多个专家模型中迁移知识到单一学生模型，实现更鲁棒和可靠的模型性能。UMKD的核心创新包括：在特征空间中通过浅层紧凑对齐解耦任务无关与任务相关的特征，以及在输出空间中引入不确定性感知的解耦蒸馏机制，动态调整知识迁移权重以适应专家模型的不确定性。此外，该方法还有效处理了模型架构异质性和源域与目标域分布差异的问题。

链接: https://arxiv.org/abs/2505.00592
作者: Shuo Tong,Shangde Gao,Ke Liu,Zihang Huang,Hongxia Xu,Haochao Ying,Jian Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic disease image grading is a significant application of artificial intelligence for healthcare, enabling faster and more accurate patient assessments. However, domain shifts, which are exacerbated by data imbalance, introduce bias into the model, posing deployment difficulties in clinical applications. To address the problem, we propose a novel \textbfUncertainty-aware \textbfMulti-experts \textbfKnowledge \textbfDistillation (UMKD) framework to transfer knowledge from multiple expert models to a single student model. Specifically, to extract discriminative features, UMKD decouples task-agnostic and task-specific features with shallow and compact feature alignment in the feature space. At the output space, an uncertainty-aware decoupled distillation (UDD) mechanism dynamically adjusts knowledge transfer weights based on expert model uncertainties, ensuring robust and reliable distillation. Additionally, UMKD also tackles the problems of model architecture heterogeneity and distribution discrepancies between source and target domains, which are inadequately tackled by previous KD approaches. Extensive experiments on histology prostate grading (\textitSICAPv2) and fundus image grading (\textitAPTOS) demonstrate that UMKD achieves a new state-of-the-art in both source-imbalanced and target-imbalanced scenarios, offering a robust and practical solution for real-world disease image grading.
zh

[CV-14] Synthesizing and Identifying Noise Levels in Autonomous Vehicle Camera Radar Datasets

【速读】：该论文试图解决自动驾驶中目标检测与跟踪流水线在传感器故障情况下的鲁棒性不足问题（sensor failures）。其解决方案的关键在于构建一个真实的合成数据增强管道，用于相机-雷达自动驾驶（AV）数据集，以准确模拟现实世界干扰导致的传感器故障和数据退化。

链接: https://arxiv.org/abs/2505.00584
作者: Mathis Morales,Golnaz Habibi
机构: University of Oklahoma(俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Detecting and tracking objects is a crucial component of any autonomous navigation method. For the past decades, object detection has yielded promising results using neural networks on various datasets. While many methods focus on performance metrics, few projects focus on improving the robustness of these detection and tracking pipelines, notably to sensor failures. In this paper we attempt to address this issue by creating a realistic synthetic data augmentation pipeline for camera-radar Autonomous Vehicle (AV) datasets. Our goal is to accurately simulate sensor failures and data deterioration due to real-world interferences. We also present our results of a baseline lightweight Noise Recognition neural network trained and tested on our augmented dataset, reaching an overall recognition accuracy of 54.4% on 11 categories across 10086 images and 2145 radar point-clouds.
zh

[CV-15] AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis

【速读】：该论文旨在解决将预训练的视觉语言模型（如CLIP）应用于动物行为识别时所面临的两个关键问题：如何有效整合运动信息以及设计高效的时序建模方案。其解决方案的关键在于提出AnimalMotionCLIP框架，通过在CLIP结构中交错视频帧与光流信息，从而融合空间和时序特征，并采用多种时序建模策略（包括密集、半密集和稀疏聚合分类器）进行比较与优化，以实现对精细时间动作的准确识别。

链接: https://arxiv.org/abs/2505.00569
作者: Enmin Zhong,Carlos R. del-Blanco,Daniel Berjón,Fernando Jaureguizar,Narciso García
机构: Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures,Accepted for the poster session at the CV4Animals workshop: Computer Vision for Animal Behavior Tracking and Modeling In conjunction with Computer Vision and Pattern Recognition 2024

点击查看摘要

Abstract:Recently, there has been a surge of interest in applying deep learning techniques to animal behavior recognition, particularly leveraging pre-trained visual language models, such as CLIP, due to their remarkable generalization capacity across various downstream tasks. However, adapting these models to the specific domain of animal behavior recognition presents two significant challenges: integrating motion information and devising an effective temporal modeling scheme. In this paper, we propose AnimalMotionCLIP to address these challenges by interleaving video frames and optical flow information in the CLIP framework. Additionally, several temporal modeling schemes using an aggregation of classifiers are proposed and compared: dense, semi dense, and sparse. As a result, fine temporal actions can be correctly recognized, which is of vital importance in animal behavior analysis. Experiments on the Animal Kingdom dataset demonstrate that AnimalMotionCLIP achieves superior performance compared to state-of-the-art approaches.
zh

[CV-16] Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities

【速读】：该论文旨在解决多模态磁共振成像（Multimodal MRI）在临床应用中因模态缺失而导致的模型适应性差的问题。传统方法通常假设所有影像模态在预训练和微调阶段均可用，但在实际场景中，由于采集问题、专家不可用或实验设计限制，常出现模态缺失的情况，导致需要为每种模态组合单独训练模型，这不仅资源消耗大，而且不适用于临床实践。论文提出的解决方案是BM-MAE，一种针对多模态MRI数据的掩码图像建模预训练策略，其关键在于通过统一的预训练模型适应任意可用模态组合，无需改变模型结构即可提取跨模态和模态内信息，从而在下游任务中表现出色，并能高效重建缺失模态。

链接: https://arxiv.org/abs/2505.00568
作者: Lucas Robinet,Ahmad Berjaoui,Elizabeth Cohen-Jonathan Moyal
机构: Oncopole Claudius Régaud (Oncopole Claudius Régaud); IRT Saint Exupéry (IRT Saint Exupéry); INSERM Cancer Research Center of Toulouse (INSERM Cancer Research Center of Toulouse)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal magnetic resonance imaging (MRI) constitutes the first line of investigation for clinicians in the care of brain tumors, providing crucial insights for surgery planning, treatment monitoring, and biomarker identification. Pre-training on large datasets have been shown to help models learn transferable representations and adapt with minimal labeled data. This behavior is especially valuable in medical imaging, where annotations are often scarce. However, applying this paradigm to multimodal medical data introduces a challenge: most existing approaches assume that all imaging modalities are available during both pre-training and fine-tuning. In practice, missing modalities often occur due to acquisition issues, specialist unavailability, or specific experimental designs on small in-house datasets. Consequently, a common approach involves training a separate model for each desired modality combination, making the process both resource-intensive and impractical for clinical use. Therefore, we introduce BM-MAE, a masked image modeling pre-training strategy tailored for multimodal MRI data. The same pre-trained model seamlessly adapts to any combination of available modalities, extracting rich representations that capture both intra- and inter-modal information. This allows fine-tuning on any subset of modalities without requiring architectural changes, while still benefiting from a model pre-trained on the full set of modalities. Extensive experiments show that the proposed pre-training strategy outperforms or remains competitive with baselines that require separate pre-training for each modality subset, while substantially surpassing training from scratch on several downstream tasks. Additionally, it can quickly and efficiently reconstruct missing modalities, highlighting its practical value. Code and trained models are available at: this https URL
zh

[CV-17] X-ray illicit object detection using hybrid CNN-transformer neural network architectures

【速读】：该论文旨在解决X射线安全检测中由于物体被严重遮挡或故意隐藏而导致的检测难题，传统方法在面对此类情况时表现有限。其解决方案的关键在于探索将卷积神经网络（CNN）与Transformer架构相结合的混合模型，以同时捕捉局部与远距离特征，从而提升检测的鲁棒性。研究通过对比不同混合CNN-Transformer架构与基准模型YOLOv8的性能，在三个具有挑战性的公共X射线检测数据集上验证了混合架构在域分布变化条件下的优势。

链接: https://arxiv.org/abs/2505.00564
作者: Jorgen Cani,Christos Diou,Spyridon Evangelatos,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: Harokopio University of Athens(雅典哈罗科波利大学); Netcompany-Intrasoft S.A.(Netcompany-Intrasoft S.A.); University of Western Macedonia(西部马其顿大学); K3Y Ltd( K3Y Ltd); Kingston University(金斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of X-ray security applications, even the smallest details can significantly impact outcomes. Objects that are heavily occluded or intentionally concealed pose a great challenge for detection, whether by human observation or through advanced technological applications. While certain Deep Learning (DL) architectures demonstrate strong performance in processing local information, such as Convolutional Neural Networks (CNNs), others excel in handling distant information, e.g., transformers. In X-ray security imaging the literature has been dominated by the use of CNN-based methods, while the integration of the two aforementioned leading architectures has not been sufficiently explored. In this paper, various hybrid CNN-transformer architectures are evaluated against a common CNN object detection baseline, namely YOLOv8. In particular, a CNN (HGNetV2) and a hybrid CNN-transformer (Next-ViT-S) backbone are combined with different CNN/transformer detection heads (YOLOv8 and RT-DETR). The resulting architectures are comparatively evaluated on three challenging public X-ray inspection datasets, namely EDS, HiXray, and PIDray. Interestingly, while the YOLOv8 detector with its default backbone (CSP-DarkNet53) is generally shown to be advantageous on the HiXray and PIDray datasets, when a domain distribution shift is incorporated in the X-ray images (as happens in the EDS datasets), hybrid CNN-transformer architectures exhibit increased robustness. Detailed comparative evaluation results, including object-level detection performance and object-size error analysis, demonstrate the strengths and weaknesses of each architectural combination and suggest guidelines for future research. The source code and network weights of the models employed in this study are available at this https URL.
zh

[CV-18] A Robust Deep Networks based Multi-Object MultiCamera Tracking System for City Scale Traffic

【速读】：该论文旨在解决城市规模交通场景中多相机环境下车辆目标跟踪与匹配的挑战，这些问题包括处理多样化的车辆属性、遮挡、光照变化、阴影以及不同的视频分辨率。解决方案的关键在于提出一种基于深度学习的多目标多相机跟踪（MO-MCT）框架，该框架利用Mask R-CNN进行目标检测，并通过非极大值抑制（NMS）选择目标对象；采用迁移学习实现跨相机的车辆重识别，以建立和生成车辆轨迹片段；同时结合适当的损失函数和距离度量来应对遮挡、光照和阴影问题；最终通过ResNet-152与Deep SORT相结合的特征提取模块实现高效的车辆跟踪。

链接: https://arxiv.org/abs/2505.00534
作者: Muhammad Imran Zaman,Usama Ijaz Bajwa,Gulshan Saleem,Rana Hammad Raza
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision sensors are becoming more important in Intelligent Transportation Systems (ITS) for traffic monitoring, management, and optimization as the number of network cameras continues to rise. However, manual object tracking and matching across multiple non-overlapping cameras pose significant challenges in city-scale urban traffic scenarios. These challenges include handling diverse vehicle attributes, occlusions, illumination variations, shadows, and varying video resolutions. To address these issues, we propose an efficient and cost-effective deep learning-based framework for Multi-Object Multi-Camera Tracking (MO-MCT). The proposed framework utilizes Mask R-CNN for object detection and employs Non-Maximum Suppression (NMS) to select target objects from overlapping detections. Transfer learning is employed for re-identification, enabling the association and generation of vehicle tracklets across multiple cameras. Moreover, we leverage appropriate loss functions and distance measures to handle occlusion, illumination, and shadow challenges. The final solution identification module performs feature extraction using ResNet-152 coupled with Deep SORT based vehicle tracking. The proposed framework is evaluated on the 5th AI City Challenge dataset (Track 3), comprising 46 camera feeds. Among these 46 camera streams, 40 are used for model training and validation, while the remaining six are utilized for model testing. The proposed framework achieves competitive performance with an IDF1 score of 0.8289, and precision and recall scores of 0.9026 and 0.8527 respectively, demonstrating its effectiveness in robust and accurate vehicle tracking.
zh

[CV-19] InterLoc: LiDAR-based Intersection Localization using Road Segmentation with Automated Evaluation Method

【速读】：该论文旨在解决道路网络中交叉口检测的问题，尤其是在依赖稀少的手动标注数据或忽略车载已计算语义信息的现有方法中存在的局限性。其解决方案的关键在于提出一种基于激光雷达（LiDAR）的交叉口检测方法，该方法通过融合语义道路分割与车辆定位，在鸟瞰图（BEV）表示中检测交叉口候选区域，并利用最小二乘法分析分支拓扑结构以优化候选区域，从而提高检测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2505.00512
作者: Nguyen Hoang Khoi Tran,Julie Stephany Berrio,Mao Shan,Zhenxing Ming,Stewart Worrall
机构: Australian Centre for Robotics (ACFR)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Intersections are geometric and functional key points in every road network. They offer strong landmarks to correct GNSS dropouts and anchor new sensor data in up-to-date maps. Despite that importance, intersection detectors either ignore the rich semantic information already computed onboard or depend on scarce, hand-labeled intersection datasets. To close that gap, this paper presents a LiDAR-based method for intersection detection that (i) fuses semantic road segmentation with vehicle localization to detect intersection candidates in a bird’s eye view (BEV) representation and (ii) refines those candidates by analyzing branch topology with a least squares formulation. To evaluate our method, we introduce an automated benchmarking pipeline that pairs detections with OpenStreetMap (OSM) intersection nodes using precise GNSS/INS ground-truth poses. Tested on eight SemanticKITTI sequences, the approach achieves a mean localization error of 1.9 m, 89% precision, and 77% recall at a 5 m tolerance, outperforming the latest learning-based baseline. Moreover, the method is robust to segmentation errors higher than those of the benchmark model, demonstrating its applicability in the real world.
zh

[CV-20] Inconsistency-based Active Learning for LiDAR Object Detection

【速读】：该论文试图解决自动驾驶中目标检测的深度学习模型对大规模标注数据依赖过高的问题，从而降低数据获取与标注的成本。其解决方案的关键在于将主动学习（active learning）方法扩展到LiDAR领域，并通过基于不一致性的样本选择策略来优化数据筛选过程，实验结果表明，该方法在仅使用50%标注数据的情况下，能够达到与随机采样相当的mAP性能。

链接: https://arxiv.org/abs/2505.00511
作者: Esteban Rivera,Loic Stratil,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IV2025

点击查看摘要

Abstract:Deep learning models for object detection in autonomous driving have recently achieved impressive performance gains and are already being deployed in vehicles worldwide. However, current models require increasingly large datasets for training. Acquiring and labeling such data is costly, necessitating the development of new strategies to optimize this process. Active learning is a promising approach that has been extensively researched in the image domain. In our work, we extend this concept to the LiDAR domain by developing several inconsistency-based sample selection strategies and evaluate their effectiveness in various settings. Our results show that using a naive inconsistency approach based on the number of detected boxes, we achieve the same mAP as the random sampling strategy with 50% of the labeled data.
zh

[CV-21] HeAL3D: Heuristical-enhanced Active Learning for 3D Object Detection CVPR2025

【速读】：该论文旨在解决在自主驾驶场景中，针对3D目标检测任务的样本选择问题，特别是在非受控场景下如何有效筛选出对模型训练最具贡献的样本。传统方法仅关注样本选择的理论层面，而忽视了实际应用中可从大量3D检测模型文献中获得的实践洞察。论文提出的解决方案关键在于引入HeAL（基于启发式增强的主动学习），该方法将启发式特征（如目标距离和点云数量）与定位和分类信息相结合，以更准确地估计不确定性，从而提升所选样本对检测模型训练的有效性。

链接: https://arxiv.org/abs/2505.00507
作者: Esteban Rivera,Surya Prabhakaran,Markus Lienkamp
机构: Technical University of Munich (慕尼黑工业大学); Munich Institute of Robotics and Machine Intelligence (慕尼黑机器人与机器智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR2025

点击查看摘要

Abstract:Active Learning has proved to be a relevant approach to perform sample selection for training models for Autonomous Driving. Particularly, previous works on active learning for 3D object detection have shown that selection of samples in uncontrolled scenarios is challenging. Furthermore, current approaches focus exclusively on the theoretical aspects of the sample selection problem but neglect the practical insights that can be obtained from the extensive literature and application of 3D detection models. In this paper, we introduce HeAL (Heuristical-enhanced Active Learning for 3D Object Detection) which integrates those heuristical features together with Localization and Classification to deliver the most contributing samples to the model’s training. In contrast to previous works, our approach integrates heuristical features such as object distance and point-quantity to estimate the uncertainty, which enhance the usefulness of selected samples to train detection models. Our quantitative evaluation on KITTI shows that HeAL presents competitive mAP with respect to the State-of-the-Art, and achieves the same mAP as the full-supervised baseline with only 24% of the samples.
zh

[CV-22] owards Scalable Human-aligned Benchmark for Text-guided Image Editing CVPR2025

【速读】：该论文试图解决文本引导图像编辑模型缺乏广泛接受的评估方法的问题，主要原因是该任务具有主观性，导致研究者依赖人工用户研究。解决方案的关键是引入了一个名为HATIE（Human-Aligned benchmark for Text-guided Image Editing）的新型基准，它提供了一个大规模的基准集，覆盖多种编辑任务，并包含一个完全自动化且全方位的评估流程。通过结合多个衡量编辑不同方面的评分以对齐人类感知，实验证明HATIE的评估确实具有人类一致性。

链接: https://arxiv.org/abs/2505.00502
作者: Suho Ryu,Kihyun Kim,Eugene Baek,Dongsoo Shin,Joonseok Lee
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 (highlight)

点击查看摘要

Abstract:A variety of text-guided image editing models have been proposed recently. However, there is no widely-accepted standard evaluation method mainly due to the subjective nature of the task, letting researchers rely on manual user study. To address this, we introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE). Providing a large-scale benchmark set covering a wide range of editing tasks, it allows reliable evaluation, not limited to specific easy-to-evaluate cases. Also, HATIE provides a fully-automated and omnidirectional evaluation pipeline. Particularly, we combine multiple scores measuring various aspects of editing so as to align with human perception. We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects, and provide benchmark results on several state-of-the-art models to provide deeper insights on their performance.
zh

[CV-23] KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

【速读】：该论文旨在解决唇部同步（lip synchronization）任务中的关键问题，包括时间一致性、输入视频中的表情泄漏（expression leakage）以及面部遮挡（facial occlusions），这些问题在现有工作中常被忽视，严重影响了自动化配音等实际应用的效果。其解决方案的关键在于提出了一种两阶段框架KeySync，通过精心设计的掩码策略有效处理表情泄漏和遮挡问题，同时确保时间一致性，从而在唇部重建和跨同步任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2505.00497
作者: Antoni Bigata,Rodrigo Mira,Stella Bounareli,Michał Stypułkowski,Konstantinos Vougioukas,Stavros Petridis,Maja Pantic
机构: Imperial College London(帝国理工学院); University of Wrocław(弗罗茨瓦夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at this https URL.
zh

[CV-24] JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

【速读】：该论文旨在解决多模态生成任务中联合建模RGB图像与深度信息的问题，特别是如何在生成高质量图像的同时，确保深度图的几何合理性与准确性。其解决方案的关键在于提出两种简单而有效的技术：自适应调度权重（adaptive scheduling weights），该方法根据每种模态的噪声水平动态调整训练过程；以及非平衡时间步采样策略（unbalanced timestep sampling strategy），通过在所有噪声水平上对每种模态进行训练，使模型能够自然地处理多种组合生成任务，如联合生成、深度估计和深度条件图像生成。

链接: https://arxiv.org/abs/2505.00482
作者: Kwon Byung-Ki,Qi Dai,Lee Hyoseok,Chong Luo,Tae-Hyun Oh
机构: POSTECH(浦项科技大学); Microsoft Research Asia(微软亚洲研究院); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation. The project page is available at this https URL.
zh

[CV-25] ClearLines - Camera Calibration from Straight Lines

【速读】：该论文试图解决从直线进行校准的问题，这一问题是几何计算机视觉中的基础问题，尽管其理论基础已较为成熟，但在实际应用中尤其是在户外真实场景中仍面临诸多挑战。这些问题包括场景的多样性和杂乱性、3D直线的中断重投影以及变化的光照条件，使得该任务极具难度。此外，该领域缺乏专门的数据集来推动相应检测算法的发展。为此，本文提出了一种名为“ClearLines”的小型数据集，并通过详细描述其创建过程，提供了实用的见解，为开发和优化直线3D检测算法提供了指导。

链接: https://arxiv.org/abs/2505.00452
作者: Gregory Schroeder,Mohamed Sabry,Cristina Olaverri-Monreal
机构: Johannes Kepler University Linz, Austria, Department Intelligent Transport Systems(约翰内斯·开普勒林茨大学，奥地利，智能交通系统系); IAV GmbH(IAV有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The problem of calibration from straight lines is fundamental in geometric computer vision, with well-established theoretical foundations. However, its practical applicability remains limited, particularly in real-world outdoor scenarios. These environments pose significant challenges due to diverse and cluttered scenes, interrupted reprojections of straight 3D lines, and varying lighting conditions, making the task notoriously difficult. Furthermore, the field lacks a dedicated dataset encouraging the development of respective detection algorithms. In this study, we present a small dataset named “ClearLines”, and by detailing its creation process, provide practical insights that can serve as a guide for developing and refining straight 3D line detection algorithms.
zh

[CV-26] Leverag ing Pretrained Diffusion Models for Zero-Shot Part Assembly IJCAI-2025

【速读】：该论文旨在解决3D零件装配问题，即通过理解零件之间的关系并预测其6-自由度（6-DoF）位姿来构建真实的3D形状，以满足自主装配的需求。现有方法主要依赖于监督学习的神经网络来估计每个零件的变换，但需要大量手动标注的数据，而数据收集成本高以及现实世界中形状和零件的多样性使得传统方法在大规模应用中不切实际。该论文的关键解决方案是提出一种零样本（zero-shot）零件装配方法，利用预训练的点云扩散模型作为装配过程中的判别器，引导零件操作以形成真实形状，并通过理论证明将扩散模型用于零样本装配可转化为迭代最近点（ICP）过程，同时引入一种新的推开策略以解决重叠零件问题，从而提升方法的鲁棒性。

链接: https://arxiv.org/abs/2505.00426
作者: Ruiyuan Zhang,Qi Wang,Jiaxiang Liu,Yu Zhang,Yuchi Huo,Chao Wu
机构: Zhejiang University (浙江大学); North China Electric Power University (华北电力大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 12 figures, Accepted by IJCAI-2025

点击查看摘要

Abstract:3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on this https URL.
zh

[CV-27] Real-Time Animatable 2DGS-Avatars with Detail Enhancement from Monocular Videos

【速读】：该论文旨在解决从单目视频中重建高质量、可动画化的3D人体虚拟形象所面临的挑战，尤其是如何在动态或复杂姿态下捕捉精细几何细节并保持动画稳定性。其解决方案的关键在于提出一种基于2D Gaussian Splatting（2DGS）的实时框架，并结合全局SMPL姿态参数，以对齐位置和旋转偏差，实现鲁棒且自然的姿态驱动动画。此外，引入的Rotation Compensation Network（RCN）通过融合局部几何特征与全局姿态参数学习旋转残差，显著提升了非刚性变形的处理能力，确保了动画过程中的平滑无伪影的姿态过渡。

链接: https://arxiv.org/abs/2505.00421
作者: Xia Yuan,Hai Yuan,Wenyi Ge,Ying Fu,Xi Wu,Guanyu Xing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality, animatable 3D human avatar reconstruction from monocular videos offers significant potential for reducing reliance on complex hardware, making it highly practical for applications in game development, augmented reality, and social media. However, existing methods still face substantial challenges in capturing fine geometric details and maintaining animation stability, particularly under dynamic or complex poses. To address these issues, we propose a novel real-time framework for animatable human avatar reconstruction based on 2D Gaussian Splatting (2DGS). By leveraging 2DGS and global SMPL pose parameters, our framework not only aligns positional and rotational discrepancies but also enables robust and natural pose-driven animation of the reconstructed avatars. Furthermore, we introduce a Rotation Compensation Network (RCN) that learns rotation residuals by integrating local geometric features with global pose parameters. This network significantly improves the handling of non-rigid deformations and ensures smooth, artifact-free pose transitions during animation. Experimental results demonstrate that our method successfully reconstructs realistic and highly animatable human avatars from monocular videos, effectively preserving fine-grained details while ensuring stable and natural pose variation. Our approach surpasses current state-of-the-art methods in both reconstruction quality and animation robustness on public benchmarks.
zh

[CV-28] SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos IJCAI2025

【速读】：该论文试图解决现有显著性检测方法在现实场景中因运动模糊和遮挡而表现不佳的问题，以及由于脉冲相机（spike camera）成像固有的复合噪声导致的显著性检测不连续性和模型预测偏差问题。解决方案的关键在于提出Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA)框架，该框架通过引入基于脉冲的微观去偏（Spike-based Micro-debias, SM）和全局去偏（Spike-based Global-debias, SG）机制，有效捕捉帧间细微变化并减少不同条件下的预测不一致性，从而提升显著性检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2505.00394
作者: Wenxuan Liu,Yao Deng,Kang Chen,Xian Zhong,Zhaofei Yu,Tiejun Huang
机构: Peking University (北京大学); Wuhan University of Technology (武汉理工大学); Institute for Artificial Intelligence, Peking University (人工智能研究院，北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI 2025

点击查看摘要

Abstract:Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at this https URL.
zh

[CV-29] he Invisible Threat: Evaluating the Vulnerability of Cross-Spectral Face Recognition to Presentation Attacks

【速读】：该论文试图解决近红外（NIR）-可见光（VIS）跨光谱人脸识别系统在面对呈现攻击（presentation attack）时的脆弱性问题。研究的关键在于通过全面评估这些系统对特定攻击的敏感性，揭示其在实际应用中的安全风险，并强调需要进一步研究以提升系统的抗攻击能力。

链接: https://arxiv.org/abs/2505.00380
作者: Anjith George,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Cross-spectral face recognition systems are designed to enhance the performance of facial recognition systems by enabling cross-modal matching under challenging operational conditions. A particularly relevant application is the matching of near-infrared (NIR) images to visible-spectrum (VIS) images, enabling the verification of individuals by comparing NIR facial captures acquired with VIS reference images. The use of NIR imaging offers several advantages, including greater robustness to illumination variations, better visibility through glasses and glare, and greater resistance to presentation attacks. Despite these claimed benefits, the robustness of NIR-based systems against presentation attacks has not been systematically studied in the literature. In this work, we conduct a comprehensive evaluation into the vulnerability of NIR-VIS cross-spectral face recognition systems to presentation attacks. Our empirical findings indicate that, although these systems exhibit a certain degree of reliability, they remain vulnerable to specific attacks, emphasizing the need for further research in this area.
zh

[CV-30] Cues3D: Unleashing the Power of Sole NeRF for Consistent and Unique Instances in Open-Vocabulary 3D Panoptic Segmentation

【速读】：该论文旨在解决开放词汇3D语义分割中由于缺乏高保真3D点云而导致的跨视角一致性不足的问题。现有方法依赖于2D分割与几何感知的3D基元结合，但在没有高质量3D点云的情况下效果受限。其关键解决方案是提出Cues3D，该方法仅依赖NeRF（神经辐射场）来实现全局一致的3D几何表示，从而在无需显式跨视角监督的情况下有效区分物体。通过三阶段训练框架和实例消歧方法，Cues3D实现了跨视角的高一致性和唯一3D实例ID。

链接: https://arxiv.org/abs/2505.00378
作者: Feng Xue,Wenzhuang Xu,Guofeng Zhong,Anlong Minga,Nicu Sebe
机构: University of Trento (特伦托大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Information Fusion

点击查看摘要

Abstract:Open-vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top-performing methods currently integrate 2D segmentation with geometry-aware 3D primitives. However, the advantage would be lost without high-fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross-view association pre-processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre-associations. The core idea is that NeRF’s implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross-view supervision. We propose a three-phase training framework for NeRF, initialization-disambiguation-refinement, whereby the instance IDs are corrected using the initially-learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF-rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image-based methods and competes with the latest 2D-3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \hrefthis https URLgithub
zh

[CV-31] Automated segmenta-on of pediatric neuroblastoma on mul–modal MRI: Results of the SPPIN challenge at MICCAI 2023

【速读】：该论文旨在解决神经母细胞瘤（neuroblastoma）手术规划中依赖人工创建磁共振成像（MRI）解剖三维模型所带来的耗时和主观性问题。其关键解决方案是通过组织Surgical Planning in Pediatric Neuroblastoma (SPPIN)挑战，推动多模态MRI下神经母细胞瘤的全自动分割技术发展，并设立基准评估标准。最高排名团队采用了一种名为STU-Net的大规模预训练网络，取得了较高的Dice分数（中位数0.82）、HD95（中位数7.69 mm）和体积相似性（VS，0.91），表明预训练模型在小而异质的数据集上具有应用潜力，但仍有改进空间，特别是在处理小体积、术前治疗后的肿瘤分割方面。

链接: https://arxiv.org/abs/2505.00369
作者: M.A.D. Buser,D.C. Simons,M. Fitski,M.H.W.A. Wijnen,A.S. Littooij,A.H. ter Brugge,I.N. Vos,M.H.A. Janse,M. de Boer,R. ter Maat,J. Sato,S. Kido,S. Kondo,S. Kasai,M. Wodzinski,H. Muller,J. Ye,J. He,Y. Kirchhoff,M.R. Rokkus,G. Haokai,S. Zitong,M. Fernández-Patón,D. Veiga-Canuto,D.G. Ellis,M.R. Aizenberg,B.H.M. van der Velden,H. Kuijf,A. De Luca,A.F.W. van der Steeg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Surgery plays an important role within the treatment for neuroblastoma, a common pediatric cancer. This requires careful planning, often via magnetic resonance imaging (MRI)-based anatomical 3D models. However, creating these models is often time-consuming and user dependent. We organized the Surgical Planning in Pediatric Neuroblastoma (SPPIN) challenge, to stimulate developments on this topic, and set a benchmark for fully automatic segmentation of neuroblastoma on multi-model MRI. The challenge started with a training phase, where teams received 78 sets of MRI scans from 34 patients, consisting of both diagnostic and post-chemotherapy MRI scans. The final test phase, consisting of 18 MRI sets from 9 patients, determined the ranking of the teams. Ranking was based on the Dice similarity coefficient (Dice score), the 95th percentile of the Hausdorff distance (HD95) and the volumetric similarity (VS). The SPPIN challenge was hosted at MICCAI 2023. The final leaderboard consisted of 9 teams. The highest-ranking team achieved a median Dice score 0.82, a median HD95 of 7.69 mm and a VS of 0.91, utilizing a large, pretrained network called STU-Net. A significant difference for the segmentation results between diagnostic and post-chemotherapy MRI scans was observed (Dice = 0.89 vs Dice = 0.59, P = 0.01) for the highest-ranking team. SPPIN is the first medical segmentation challenge in extracranial pediatric oncology. The highest-ranking team used a large pre-trained network, suggesting that pretraining can be of use in small, heterogenous datasets. Although the results of the highest-ranking team were high for most patients, segmentation especially in small, pre-treated tumors were insufficient. Therefore, more reliable segmentation methods are needed to create clinically applicable models to aid surgical planning in pediatric neuroblastoma.
zh

[CV-32] Efficient Neural Video Representation with Temporally Coherent Modulation ECCV2024

【速读】：该论文试图解决隐式神经表示（Implicit Neural Representations, INR）在视频应用中参数效率低和比特率高的问题，特别是在使用网格型参数编码时因未考虑视频动态特性而导致的冗余参数使用。解决方案的关键在于提出一种新的框架——时空一致调制的神经视频表示（Neural Video representation with Temporally coherent Modulation, NVTM），通过将时空三维视频数据分解为带有运动信息的二维网格，实现快速学习视频表示并高效利用参数。该方法能够同时处理时间对应的像素，从而在保持合理视频质量的前提下显著提升编码速度，并在PSNR/LPIPS指标上优于现有网格型方法。

链接: https://arxiv.org/abs/2505.00335
作者: Seungjun Shin,Suji Kim,Dokwan Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2024

点击查看摘要

Abstract:Implicit neural representations (INR) has found successful applications across diverse domains. To employ INR in real-life, it is important to speed up training. In the field of INR for video applications, the state-of-the-art approach employs grid-type parametric encoding and successfully achieves a faster encoding speed in comparison to its predecessors. However, the grid usage, which does not consider the video’s dynamic nature, leads to redundant use of trainable parameters. As a result, it has significantly lower parameter efficiency and higher bitrate compared to NeRV-style methods that do not use a parametric encoding. To address the problem, we propose Neural Video representation with Temporally coherent Modulation (NVTM), a novel framework that can capture dynamic characteristics of video. By decomposing the spatio-temporal 3D video data into a set of 2D grids with flow information, NVTM enables learning video representation rapidly and uses parameter efficiently. Our framework enables to process temporally corresponding pixels at once, resulting in the fastest encoding speed for a reasonable video quality, especially when compared to the NeRV-style method, with a speed increase of over 3 times. Also, it remarks an average of 1.54dB/0.019 improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters) and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic), compared to previous grid-type works. By expanding this to compression tasks, we demonstrate comparable performance to video compression standards (H.264, HEVC) and recent INR approaches for video compression. Additionally, we perform extensive experiments demonstrating the superior performance of our algorithm across diverse tasks, encompassing super resolution, frame interpolation and video inpainting. Project page is this https URL.
zh

[CV-33] Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution IJCNN2025

【速读】：该论文旨在解决图像超分辨率（Image Super-Resolution, ISR）问题，特别是如何在高放大倍数下生成具有精细细节和真实纹理的高质量图像。现有方法在感知质量和结构保真度之间难以取得平衡，而基于扩散模型的方法虽有潜力，但仍有局限。该论文提出的关键解决方案是引入ResQu框架，该框架结合了四元数小波预处理机制与潜在扩散模型，并采用一种新的四元数小波与时序感知编码器。通过在去噪的不同阶段动态整合四元数小波嵌入，增强了条件生成过程，同时利用基础模型如Stable Diffusion的生成先验，从而提升了重建图像的质量和效果。

链接: https://arxiv.org/abs/2505.00334
作者: Luigi Sigillo,Christian Bianchi,Danilo Comminiello
机构: Sapienza University of Rome (罗马第一大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for presentation at IJCNN 2025

点击查看摘要

Abstract:Image Super-Resolution is a fundamental problem in computer vision with broad applications spacing from medical imaging to satellite analysis. The ability to reconstruct high-resolution images from low-resolution inputs is crucial for enhancing downstream tasks such as object detection and segmentation. While deep learning has significantly advanced SR, achieving high-quality reconstructions with fine-grained details and realistic textures remains challenging, particularly at high upscaling factors. Recent approaches leveraging diffusion models have demonstrated promising results, yet they often struggle to balance perceptual quality with structural fidelity. In this work, we introduce ResQu a novel SR framework that integrates a quaternion wavelet preprocessing framework with latent diffusion models, incorporating a new quaternion wavelet- and time-aware encoder. Unlike prior methods that simply apply wavelet transforms within diffusion models, our approach enhances the conditioning process by exploiting quaternion wavelet embeddings, which are dynamically integrated at different stages of denoising. Furthermore, we also leverage the generative priors of foundation models such as Stable Diffusion. Extensive experiments on domain-specific datasets demonstrate that our method achieves outstanding SR results, outperforming in many cases existing approaches in perceptual quality and standard evaluation metrics. The code will be available after the revision process.
zh

[CV-34] AWARE-NET: Adaptive Weighted Averag ing for Robust Ensemble Network in Deepfake Detection

【速读】：该论文旨在解决深度伪造（deepfake）检测中模型在不同数据集和篡改类型下性能不一致的问题。其解决方案的关键在于提出了一种基于深度学习的两阶段集成框架，该框架通过层次化结合三种先进架构（Xception、Res2Net101 和 EfficientNet-B7）的多个实例来增强模型多样性，并采用可学习的加权机制动态融合预测结果。第一阶段对同一架构家族内的预测进行平均以降低模型方差，第二阶段通过反向传播学习最优贡献权重，自动调整各架构的影响，从而提升检测可靠性与泛化能力。

链接: https://arxiv.org/abs/2505.00312
作者: Muhammad Salman,Iqra Tariq,Mishal Zulfiqar,Muqadas Jalal,Sami Aujla,Sumbal Fatima
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deepfake detection has become increasingly important due to the rise of synthetic media, which poses significant risks to digital identity and cyber presence for security and trust. While multiple approaches have improved detection accuracy, challenges remain in achieving consistent performance across diverse datasets and manipulation types. In response, we propose a novel two-tier ensemble framework for deepfake detection based on deep learning that hierarchically combines multiple instances of three state-of-the-art architectures: Xception, Res2Net101, and EfficientNet-B7. Our framework employs a unique approach where each architecture is instantiated three times with different initializations to enhance model diversity, followed by a learnable weighting mechanism that dynamically combines their predictions. Unlike traditional fixed-weight ensembles, our first-tier averages predictions within each architecture family to reduce model variance, while the second tier learns optimal contribution weights through backpropagation, automatically adjusting each architecture’s influence based on their detection reliability. Our experiments achieved state-of-the-art intra-dataset performance with AUC scores of 99.22% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.06% (FF++) and 99.94% (CelebDF-v2) without augmentation. With augmentation, we achieve AUC scores of 99.47% (FF++) and 100.00% (CelebDF-v2), and F1 scores of 98.43% (FF++) and 99.95% (CelebDF-v2). The framework demonstrates robust cross-dataset generalization, achieving AUC scores of 88.20% and 72.52%, and F1 scores of 93.16% and 80.62% in cross-dataset evaluations.
zh

[CV-35] AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

【速读】：该论文旨在解决放射治疗中自动勾画（auto-contours）质量评估的问题，特别是在在线自适应放射治疗（Online Adaptive Radiotherapy, OART）场景下，如何在缺乏真实标签或大量人工标注的情况下实现高效、可靠的自动勾画质量评估。解决方案的关键在于采用基于深度学习的贝叶斯序数分类（Bayesian Ordinal Classification, BOC）方法，并结合校准后的不确定性阈值，以无需依赖真实标签或大量人工标注即可实现高置信度的质量预测。

链接: https://arxiv.org/abs/2505.00308
作者: Biling Wang,Austen Maniscalco,Ti Bai,Siqiu Wang,Michael Dohopolski,Mu-Han Lin,Chenyang Shen,Dan Nguyen,Junzhou Huang,Steve Jiang,Xinlei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Purpose: This study presents a Deep Learning (DL)-based quality assessment (QA) approach for evaluating auto-generated contours (auto-contours) in radiotherapy, with emphasis on Online Adaptive Radiotherapy (OART). Leveraging Bayesian Ordinal Classification (BOC) and calibrated uncertainty thresholds, the method enables confident QA predictions without relying on ground truth contours or extensive manual labeling. Methods: We developed a BOC model to classify auto-contour quality and quantify prediction uncertainty. A calibration step was used to optimize uncertainty thresholds that meet clinical accuracy needs. The method was validated under three data scenarios: no manual labels, limited labels, and extensive labels. For rectum contours in prostate cancer, we applied geometric surrogate labels when manual labels were absent, transfer learning when limited, and direct supervision when ample labels were available. Results: The BOC model delivered robust performance across all scenarios. Fine-tuning with just 30 manual labels and calibrating with 34 subjects yielded over 90% accuracy on test data. Using the calibrated threshold, over 93% of the auto-contours’ qualities were accurately predicted in over 98% of cases, reducing unnecessary manual reviews and highlighting cases needing correction. Conclusion: The proposed QA model enhances contouring efficiency in OART by reducing manual workload and enabling fast, informed clinical decisions. Through uncertainty quantification, it ensures safer, more reliable radiotherapy workflows.
zh

[CV-36] Fine-grained spatial-temporal perception for gas leak segmentation ICIP2025

【速读】：该论文旨在解决气体泄漏（gas leak）的高效且准确的检测与分割问题，因其隐蔽性及形状的随机性使得传统方法难以有效处理。解决方案的关键在于提出一种细粒度时空感知（Fine-grained Spatial-Temporal Perception, FGSTP）算法，该算法通过构建相关体积以捕捉连续帧间的运动信息，并结合细化后的目标特征，在端到端网络中进行多尺度特征融合与优化，从而提升非刚性目标如气体泄漏的边界分割精度。

链接: https://arxiv.org/abs/2505.00295
作者: Xinlong Zhao,Shan Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, ICIP 2025 Conference

点击查看摘要

Abstract:Gas leaks pose significant risks to human health and the environment. Despite long-standing concerns, there are limited methods that can efficiently and accurately detect and segment leaks due to their concealed appearance and random shapes. In this paper, we propose a Fine-grained Spatial-Temporal Perception (FGSTP) algorithm for gas leak segmentation. FGSTP captures critical motion clues across frames and integrates them with refined object features in an end-to-end network. Specifically, we first construct a correlation volume to capture motion information between consecutive frames. Then, the fine-grained perception progressively refines the object-level features using previous outputs. Finally, a decoder is employed to optimize boundary segmentation. Because there is no highly precise labeled dataset for gas leak segmentation, we manually label a gas leak video dataset, GasVid. Experimental results on GasVid demonstrate that our model excels in segmenting non-rigid objects such as gas leaks, generating the most accurate mask compared to other state-of-the-art (SOTA) models.
zh

[CV-37] AdCare-VLM: Leverag ing Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care

【速读】：该论文旨在解决慢性疾病患者药物依从性监测中的难题，特别是通过分析患者视频来评估其是否正确服药。解决方案的关键在于提出一种基于Video-LLaVA的多模态大视觉语言模型（LVLM）——AdCare-VLM，该模型专注于视觉问答（VQA）任务，能够识别与药物依从性相关的视觉特征，并将其与医学概念对齐，从而提升多模态交互的效果。此外，研究还构建了LLM-TB-VQA数据集，涵盖正向、负向和模糊的依从性案例，以支持模型的训练与验证。

链接: https://arxiv.org/abs/2505.00275
作者: Md Asaduzzaman Jabin,Hanqi Jiang,Yiwei Li,Patrick Kaggwa,Eugene Douglass,Juliet N. Sekandi,Tianming Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based multimodal large vision language model (LVLM) aimed at visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.
zh

[CV-38] Pack-PTQ: Advancing Post-training Quantization of Neural Networks by Pack-wise Reconstruction

【速读】：该论文旨在解决后训练量化（Post-training quantization, PTQ）中因采用块状重建而导致的跨块依赖性被忽略以及在低比特情况下精度显著下降的问题。其解决方案的关键在于提出了一种名为Pack-PTQ的新方法，通过设计一种基于Hessian的自适应打包机制，将块划分为非重叠的组（pack），以保留跨块依赖性并实现更精确的量化参数估计，同时结合基于组配置的混合精度量化策略，根据各组的不同敏感性分配不同的位宽，从而进一步提升性能。

链接: https://arxiv.org/abs/2505.00259
作者: Changjun Li,Runqing Jiang,Zhuo Song,Pengpeng Yu,Ye Zhang,Yulan Guo
机构: Shenzhen Campus, Sun Yat-sen University (深圳校区，中山大学); Aviation University of Air Force (空军航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) has evolved as a prominent solution for compressing complex models, which advocates a small calibration dataset and avoids end-to-end retraining. However, most existing PTQ methods employ block-wise reconstruction, which neglects cross-block dependency and exhibits a notable accuracy drop in low-bit cases. To address these limitations, this paper presents a novel PTQ method, dubbed Pack-PTQ. First, we design a Hessian-guided adaptive packing mechanism to partition blocks into non-overlapping packs, which serve as the base unit for reconstruction, thereby preserving the cross-block dependency and enabling accurate quantization parameters estimation. Second, based on the pack configuration, we propose a mixed-precision quantization approach to assign varied bit-widths to packs according to their distinct sensitivities, thereby further enhancing performance. Extensive experiments on 2D image and 3D point cloud classification tasks, using various network architectures, demonstrate the superiority of our method over the state-of-the-art PTQ methods.
zh

[CV-39] Empowering Agent ic Video Analytics Systems with Video Language Models

【速读】：该论文旨在解决现有AI驱动的视频分析系统在开放性分析场景中适应性不足的问题，尤其是在处理超长视频内容时，由于上下文窗口限制导致的性能瓶颈。其解决方案的关键在于提出一种基于Video-Language Models (VLMs) 的系统AVA，该系统通过两个核心创新实现高效、灵活的视频分析：一是近实时构建事件知识图谱（Event Knowledge Graphs, EKGs）以高效索引长时或连续视频流；二是采用基于EKG的代理式检索-生成机制，以应对复杂多样的查询需求。

链接: https://arxiv.org/abs/2505.00254
作者: Yuxuan Yan,Shiqi Jiang,Ting Cao,Yifan Yang,Qianqian Yang,Yuanchao Shu,Yuqing Yang,Lili Qiu
机构: Microsoft Research; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.
zh

[CV-40] ReXGradient-160K: A Large-Scale Publicly Available Dataset of Chest Radiographs with Free-text Reports

【速读】：该论文旨在解决医疗影像领域中缺乏大规模、高质量标注数据的问题，从而限制了生成式 AI (Generative AI) 在医学影像分析和自动化报告生成中的发展。其解决方案的关键在于构建并公开发布 ReXGradient-160K 数据集，该数据集包含来自3个美国医疗系统共计109,487名患者的160,000例胸部X光检查及其对应的放射学报告，具有多图像和详细报告的特点，为AI系统的开发与评估提供了宝贵资源。

链接: https://arxiv.org/abs/2505.00228
作者: Xiaoman Zhang,Julián N. Acosta,Josh Miller,Ouwen Huang,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Gradient Health (梯度健康); Duke University (杜克大学); Laplace Institute (拉普拉斯研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present ReXGradient-160K, representing the largest publicly available chest X-ray dataset to date in terms of the number of patients. This dataset contains 160,000 chest X-ray studies with paired radiological reports from 109,487 unique patients across 3 U.S. health systems (79 medical sites). This comprehensive dataset includes multiple images per study and detailed radiology reports, making it particularly valuable for the development and evaluation of AI systems for medical imaging and automated report generation models. The dataset is divided into training (140,000 studies), validation (10,000 studies), and public test (10,000 studies) sets, with an additional private test set (10,000 studies) reserved for model evaluation on the ReXrank benchmark. By providing this extensive dataset, we aim to accelerate research in medical imaging AI and advance the state-of-the-art in automated radiological analysis. Our dataset will be open-sourced at this https URL.
zh

[CV-41] owards Robust and Generalizable Gerchberg Saxton based Physics Inspired Neural Networks for Computer Generated Holography: A Sensitivity Analysis Framework

【速读】：该论文旨在解决计算机全息（Computer-generated holography, CGH）中相位恢复的逆问题，即从强度测量中重建相位信息。其解决方案的关键在于通过基于Gerchberg-Saxton的物理启发神经网络（GS-PINNs）提升相位恢复能力，同时通过系统敏感性分析框架量化前向模型及其超参数（FMHs）对GS-PINN性能的影响，从而优化模型选择与硬件实现。

链接: https://arxiv.org/abs/2505.00220
作者: Ankit Amrutkar,Björn Kampa,Volkmar Schulz,Johannes Stegmaier,Markus Rothermel,Dorit Merhof
机构: RWTH Aachen University(亚琛工业大学); Otto-von-Guericke-University(奥托·冯·格里克大学); Regensburg University(雷根斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Computer-generated holography (CGH) enables applications in holographic augmented reality (AR), 3D displays, systems neuroscience, and optical trapping. The fundamental challenge in CGH is solving the inverse problem of phase retrieval from intensity measurements. Physics-inspired neural networks (PINNs), especially Gerchberg-Saxton-based PINNs (GS-PINNs), have advanced phase retrieval capabilities. However, their performance strongly depends on forward models (FMs) and their hyperparameters (FMHs), limiting generalization, complicating benchmarking, and hindering hardware optimization. We present a systematic sensitivity analysis framework based on Saltelli’s extension of Sobol’s method to quantify FMH impacts on GS-PINN performance. Our analysis demonstrates that SLM pixel-resolution is the primary factor affecting neural network sensitivity, followed by pixel-pitch, propagation distance, and wavelength. Free space propagation forward models demonstrate superior neural network performance compared to Fourier holography, providing enhanced parameterization and generalization. We introduce a composite evaluation metric combining performance consistency, generalization capability, and hyperparameter perturbation resilience, establishing a unified benchmarking standard across CGH configurations. Our research connects physics-inspired deep learning theory with practical CGH implementations through concrete guidelines for forward model selection, neural network architecture, and performance evaluation. Our contributions advance the development of robust, interpretable, and generalizable neural networks for diverse holographic applications, supporting evidence-based decisions in CGH research and implementation.
zh

[CV-42] Direct Motion Models for Assessing Generated Videos

【速读】：该论文试图解决当前生成式视频模型在运动质量评估上的不足，即生成的视频虽然帧看起来合理，但运动效果较差，而现有的评估方法如FVD等无法有效捕捉这一问题。解决方案的关键在于开发一种基于点轨迹自动编码的新度量方法，该方法能够更准确地衡量物体间的合理互动和运动特性，不仅可用于比较视频分布（包括单个生成视频与真实视频或两个数据集之间的对比），还可用于评估单个视频的运动质量，同时具备对生成视频中时空不一致性的定位能力，从而提升评估的敏感性和可解释性。

链接: https://arxiv.org/abs/2505.00209
作者: Kelsey Allen,Carl Doersch,Guangyao Zhou,Mohammed Suhail,Danny Driess,Ignacio Rocco,Yulia Rubanova,Thomas Kipf,Mehdi S. M. Sajjadi,Kevin Murphy,Joao Carreira,Sjoerd van Steenkiste
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this http URL

点击查看摘要

Abstract:A current limitation of video generative video models is that they generate plausible looking frames, but poor motion – an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to not only compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), but also for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives. We also show that by using a point track representation, we can spatiotemporally localize generative video inconsistencies, providing extra interpretability of generated video errors relative to prior work. An overview of the results and link to the code can be found on the project page: this http URL.
zh

[CV-43] Neuroevolution of Self-Attention Over Proto-Objects GECCO

【速读】：该论文试图解决传统基于矩形图像块的注意力机制在神经网络中表示复杂度过高的问题，以及由此带来的计算效率低下和语义信息处理不足的问题。其解决方案的关键在于利用图像分割生成更高层次的语义区域——原型对象（proto-objects），通过在这些原型对象上操作而非固定图像块，显著降低了表示复杂度，并提升了自注意力模块的效率与语义丰富性。

链接: https://arxiv.org/abs/2505.00186
作者: Rafael C. Pinto,Anderson R. Tavares
机构: Federal Institute of Education, Science and Technology of Rio Grande do Sul (IFRS); Federal University of Rio Grande do Sul (UFRGS)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 16 figures, GECCO

点击查看摘要

Abstract:Proto-objects - image regions that share common visual properties - offer a promising alternative to traditional attention mechanisms based on rectangular-shaped image patches in neural networks. Although previous work demonstrated that evolving a patch-based hard-attention module alongside a controller network could achieve state-of-the-art performance in visual reinforcement learning tasks, our approach leverages image segmentation to work with higher-level features. By operating on proto-objects rather than fixed patches, we significantly reduce the representational complexity: each image decomposes into fewer proto-objects than regular patches, and each proto-object can be efficiently encoded as a compact feature vector. This enables a substantially smaller self-attention module that processes richer semantic information. Our experiments demonstrate that this proto-object-based approach matches or exceeds the state-of-the-art performance of patch-based implementations with 62% less parameters and 2.6 times less training time.
zh

[CV-44] V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving

【速读】：该论文旨在解决大型视觉语言模型（Large Vision Language Models, LVLMs）在自动驾驶场景中对三维环境理解能力有限的问题，这一局限性限制了其在动态周围环境中的全面和安全理解。解决方案的关键在于引入V3LMA，该方法通过将大型语言模型（Large Language Models, LLMs）与LVLMs相结合，利用从目标检测和视频输入生成的文本描述，显著提升了性能，而无需进行微调。此外，通过专门的预处理流程提取三维物体数据，提高了复杂交通场景中的情境意识和决策能力。

链接: https://arxiv.org/abs/2505.00156
作者: Jannik Lübberstedt,Esteban Rivera,Nico Uhlemann,Markus Lienkamp
机构: Technical University of Munich (TUM); Munich Institute of Robotics and Machine Intelligence (MIRMI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have shown strong capabilities in understanding and analyzing visual scenes across various domains. However, in the context of autonomous driving, their limited comprehension of 3D environments restricts their effectiveness in achieving a complete and safe understanding of dynamic surroundings. To address this, we introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating Large Language Models (LLMs) with LVLMs. V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning. Through a dedicated preprocessing pipeline that extracts 3D object data, our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark. We further explore different fusion strategies and token combinations with the goal of advancing the interpretation of traffic scenes, ultimately enabling safer autonomous driving systems.
zh

[CV-45] Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis

【速读】：该论文试图解决在缺乏充足3D视频数据的情况下，生成高质量立体3D视频的问题（stereoscopic 3D video generation）。其解决方案的关键在于将文本到视频生成器转换为视频到立体生成器，通过直接合成新视角的视频帧，避免了传统方法中需要估计视差或深度、进行图像变形以及修复不可见区域的多阶段流程。该方法利用预训练视频模型对几何结构、物体材质、光学特性及语义的先验知识，无需依赖外部几何模型或手动解耦几何信息，从而在复杂真实场景中实现了更优的3D效果。

链接: https://arxiv.org/abs/2505.00135
作者: Michal Geyer,Omer Tov,Linyi Jin,Richard Tucker,Inbar Mosseri,Tali Dekel,Noah Snavely
机构: Google DeepMind(谷歌深度思维); Weizmann Institute of Science(魏茨曼科学研究所); University of Michigan(密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rising popularity of immersive visual experiences has increased interest in stereoscopic 3D video generation. Despite significant advances in video synthesis, creating 3D videos remains challenging due to the relative scarcity of 3D video data. We propose a simple approach for transforming a text-to-video generator into a video-to-stereo generator. Given an input video, our framework automatically produces the video frames from a shifted viewpoint, enabling a compelling 3D effect. Prior and concurrent approaches for this task typically operate in multiple phases, first estimating video disparity or depth, then warping the video accordingly to produce a second view, and finally inpainting the disoccluded regions. This approach inherently fails when the scene involves specular surfaces or transparent objects. In such cases, single-layer disparity estimation is insufficient, resulting in artifacts and incorrect pixel shifts during warping. Our work bypasses these restrictions by directly synthesizing the new viewpoint, avoiding any intermediate steps. This is achieved by leveraging a pre-trained video model’s priors on geometry, object materials, optics, and semantics, without relying on external geometry models or manually disentangling geometry from the synthesis process. We demonstrate the advantages of our approach in complex, real-world scenarios featuring diverse object materials and compositions. See videos on this https URL
zh

[CV-46] Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design

【速读】：该论文试图解决生成式 AI (Generative AI) 在计算病理学中对大规模临床数据、任务设计和提示工程的敏感性问题，特别是如何提升其在组织病理图像分析中的诊断准确性。解决方案的关键在于通过系统化的消融实验，开发一个全面的提示工程框架，该框架通过调整领域特异性、解剖精度、指令框架和输出约束来优化模型性能，其中精确的解剖参考被证明对模型表现具有显著影响。

链接: https://arxiv.org/abs/2505.00134
作者: Vasudev Sharma,Ahmed Alagha,Abdelhakim Khellaf,Vincent Quoc-Huy Trinh,Mahdi S. Hosseini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have gained significant attention in computational pathology due to their multimodal learning capabilities that enhance big-data analytics of giga-pixel whole slide image (WSI). However, their sensitivity to large-scale clinical data, task formulations, and prompt design remains an open question, particularly in terms of diagnostic accuracy. In this paper, we present a systematic investigation and analysis of three state of the art VLMs for histopathology, namely Quilt-Net, Quilt-LLAVA, and CONCH, on an in-house digestive pathology dataset comprising 3,507 WSIs, each in giga-pixel form, across distinct tissue types. Through a structured ablative study on cancer invasiveness and dysplasia status, we develop a comprehensive prompt engineering framework that systematically varies domain specificity, anatomical precision, instructional framing, and output constraints. Our findings demonstrate that prompt engineering significantly impacts model performance, with the CONCH model achieving the highest accuracy when provided with precise anatomical references. Additionally, we identify the critical importance of anatomical context in histopathological image analysis, as performance consistently degraded when reducing anatomical precision. We also show that model complexity alone does not guarantee superior performance, as effective domain alignment and domain-specific training are critical. These results establish foundational guidelines for prompt engineering in computational pathology and highlight the potential of VLMs to enhance diagnostic accuracy when properly instructed with domain-appropriate prompts.
zh

[CV-47] Learning to Borrow Features for Improved Detection of Small Objects in Single-Shot Detectors

【速读】：该论文旨在解决单次检测器中小目标检测的难题，这一问题源于卷积特征图中空间分辨率与语义丰富性之间的固有权衡。其解决方案的关键在于提出一种新颖框架，使小目标表示能够“借用”同一类别中较大且语义更丰富的实例的判别特征。该框架包含三个核心组件：用于跨层识别语义相似描述符的特征匹配块（Feature Matching Block, FMB），通过加权聚合生成增强浅层特征的特征表示块（Feature Representing Block, FRB），以及通过整合原始特征、借用特征和上下文信息来优化特征图的特征融合块（Feature Fusion Block, FFB）。

链接: https://arxiv.org/abs/2505.00044
作者: Richard Schmit
机构: Zewail City of Science & Technology (泽维尔科学与技术城)
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Detecting small objects remains a significant challenge in single-shot object detectors due to the inherent trade-off between spatial resolution and semantic richness in convolutional feature maps. To address this issue, we propose a novel framework that enables small object representations to “borrow” discriminative features from larger, semantically richer instances within the same class. Our architecture introduces three key components: the Feature Matching Block (FMB) to identify semantically similar descriptors across layers, the Feature Representing Block (FRB) to generate enhanced shallow features through weighted aggregation, and the Feature Fusion Block (FFB) to refine feature maps by integrating original, borrowed, and context information. Built upon the SSD framework, our method improves the descriptive capacity of shallow layers while maintaining real-time detection performance. Experimental results demonstrate that our approach significantly boosts small object detection accuracy over baseline methods, offering a promising direction for robust object detection in complex visual environments.
zh

[CV-48] Recursive KL Divergence Optimization: A Dynamic Framework for Representation Learning

【速读】：该论文试图解决传统表示学习目标在优化过程中缺乏对递归结构的充分建模的问题，这限制了模型的稳定性和局部适应能力。其解决方案的关键在于引入递归KL散度优化（Recursive KL Divergence Optimization, RKDO），将表示学习建模为数据邻域间KL散度的动态演化过程，从而捕捉对比聚类和降维方法作为静态切片，并提供一种更高效的优化路径。实验表明，RKDO在降低损失值和减少计算资源消耗方面均表现出显著优势。

链接: https://arxiv.org/abs/2504.21707
作者: Anthony D Martin
机构: Cadenzai, Inc. (Cadenzai公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We propose a generalization of modern representation learning objectives by reframing them as recursive divergence alignment processes over localized conditional distributions While recent frameworks like Information Contrastive Learning I-Con unify multiple learning paradigms through KL divergence between fixed neighborhood conditionals we argue this view underplays a crucial recursive structure inherent in the learning process. We introduce Recursive KL Divergence Optimization RKDO a dynamic formalism where representation learning is framed as the evolution of KL divergences across data neighborhoods. This formulation captures contrastive clustering and dimensionality reduction methods as static slices while offering a new path to model stability and local adaptation. Our experiments demonstrate that RKDO offers dual efficiency advantages approximately 30 percent lower loss values compared to static approaches across three different datasets and 60 to 80 percent reduction in computational resources needed to achieve comparable results. This suggests that RKDOs recursive updating mechanism provides a fundamentally more efficient optimization landscape for representation learning with significant implications for resource constrained applications.
zh

[CV-49] Deep Learning for automated multi-scale functional field boundaries extraction using multi-date Sentinel-2 and PlanetScope imagery: Case Study of Netherlands and Pakistan

【速读】：该论文旨在解决在不同地理和多尺度农业系统中，如何通过深度学习语义分割架构更准确地划定功能田块边界的问题。其解决方案的关键在于利用多时相卫星遥感影像（包括PlanetScope和Sent哨-2数据）与归一化植被指数（NDVI）堆栈，以提供额外的时间维度信息，从而反映作物在不同生长阶段的变化。研究还强调了多尺度地面信息的重要性，并通过迁移学习和结合荷兰与巴基斯坦数据训练模型，提升了模型的泛化能力和适用性。

链接: https://arxiv.org/abs/2411.15923
作者: Saba Zahid,Sajid Ghuffar,Obaid-ur-Rehman,Syed Roshaan Ali Shah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 09 pages, To be published

点击查看摘要

Abstract:This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.
zh

[CV-50] GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution

【速读】：该论文旨在解决基于扩散模型的图像超分辨率（SR）方法在保持图像结构保真度方面的不足。现有方法通过在VAE下采样表示上添加额外条件来适应生成模型，但这一过程往往导致结构保真度下降。其解决方案的关键在于提出一种双分支架构——引导分支（Guidance Branch）和扩散分支（Diffusion Branch），其中引导分支通过结合全分辨率块（Full Resolution Blocks, FRBs）与通道注意力机制以及图像引导网络（Image Guidance Network, IGN）与引导注意力机制，保留原始分辨率退化输入中的高保真结构，从而有效提升图像恢复的视觉一致性和清晰度。

链接: https://arxiv.org/abs/2505.00687
作者: Aditya Arora,Zhengzhong Tu,Yufei Wang,Ruizheng Bai,Jian Wang,Sizhuo Ma
机构: TU Darmstadt; Texas A&M University; Snap Inc.
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose GuideSR, a novel single-step diffusion-based image super-resolution (SR) model specifically designed to enhance image fidelity. Existing diffusion-based SR approaches typically adapt pre-trained generative models to image restoration tasks by adding extra conditioning on a VAE-downsampled representation of the degraded input, which often compromises structural fidelity. GuideSR addresses this limitation by introducing a dual-branch architecture comprising: (1) a Guidance Branch that preserves high-fidelity structures from the original-resolution degraded input, and (2) a Diffusion Branch, which a pre-trained latent diffusion model to enhance perceptual quality. Unlike conventional conditioning mechanisms, our Guidance Branch features a tailored structure for image restoration tasks, combining Full Resolution Blocks (FRBs) with channel attention and an Image Guidance Network (IGN) with guided attention. By embedding detailed structural information directly into the restoration pipeline, GuideSR produces sharper and more visually consistent results. Extensive experiments on benchmark datasets demonstrate that GuideSR achieves state-of-the-art performance while maintaining the low computational cost of single-step approaches, with up to 1.39dB PSNR gain on challenging real-world datasets. Our approach consistently outperforms existing methods across various reference-based metrics including PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement for real-world image restoration.
zh

[CV-51] Deep Learning Assisted Outer Volume Removal for Highly-Accelerated Real-Time Dynamic MRI

【速读】：该论文旨在解决实时动态磁共振成像（RT dynamic MRI）中，特别是在实时电影磁共振成像（RT cine MRI）应用中，由于高采样不足因子导致的来自非心脏区域的混叠伪影问题。解决方案的关键在于提出一种新颖的外体积去除（OVR）方法，通过后处理框架消除非心脏区域的混叠贡献。该方法利用时间交错采样模式生成的复合时间图像估计每个时间帧的外体积信号，并采用深度学习（DL）模型识别和去除伪周期性鬼影伪影，最终通过物理驱动的深度学习（PD-DL）方法结合OVR专用损失函数进行重建，从而在不改变采集方式的情况下有效降低伪影并保持诊断质量。

链接: https://arxiv.org/abs/2505.00643
作者: Merve Gülle,Sebastian Weingärtner,Mehmet Akçakaya
机构: University of Minnesota(明尼苏达大学); Delft University of Technology(代尔夫特理工大学); HollandPTC(荷兰PTC)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Real-time (RT) dynamic MRI plays a vital role in capturing rapid physiological processes, offering unique insights into organ motion and function. Among these applications, RT cine MRI is particularly important for functional assessment of the heart with high temporal resolution. RT imaging enables free-breathing, ungated imaging of cardiac motion, making it a crucial alternative for patients who cannot tolerate conventional breath-hold, ECG-gated acquisitions. However, achieving high acceleration rates in RT cine MRI is challenging due to aliasing artifacts from extra-cardiac tissues, particularly at high undersampling factors. In this study, we propose a novel outer volume removal (OVR) method to address this challenge by eliminating aliasing contributions from non-cardiac regions in a post-processing framework. Our approach estimates the outer volume signal for each timeframe using composite temporal images from time-interleaved undersampling patterns, which inherently contain pseudo-periodic ghosting artifacts. A deep learning (DL) model is trained to identify and remove these artifacts, producing a clean outer volume estimate that is subsequently subtracted from the corresponding k-space data. The final reconstruction is performed with a physics-driven DL (PD-DL) method trained using an OVR-specific loss function to restore high spatio-temporal resolution images. Experimental results show that the proposed method at high accelerations achieves image quality that is visually comparable to clinical baseline images, while outperforming conventional reconstruction techniques, both qualitatively and quantitatively. The proposed approach provides a practical and effective solution for artifact reduction in RT cine MRI without requiring acquisition modifications, offering a pathway to higher acceleration rates while preserving diagnostic quality.
zh

[CV-52] A Methodological and Structural Review of Parkinsons Disease Detection Across Diverse Data Modalities

【速读】：该论文试图解决帕金森病（Parkinson’s Disease, PD）诊断中现有研究在数据模态单一、未能充分利用多模态方法潜力的问题。其解决方案的关键在于通过全面综述多种数据模态（包括磁共振成像、步态姿态分析、步态传感数据、书写分析、语音测试数据、脑电图以及多模态融合技术）下的PD识别系统，分析数据采集方法、特征表示及系统性能，从而为下一代PD识别系统的开发提供指导。

链接: https://arxiv.org/abs/2505.00525
作者: Abu Saleh Musa Miah,taro Suzuki,Jungpil Shin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parkinsons Disease (PD) is a progressive neurological disorder that primarily affects motor functions and can lead to mild cognitive impairment (MCI) and dementia in its advanced stages. With approximately 10 million people diagnosed globally 1 to 1.8 per 1,000 individuals, according to reports by the Japan Times and the Parkinson Foundation early and accurate diagnosis of PD is crucial for improving patient outcomes. While numerous studies have utilized machine learning (ML) and deep learning (DL) techniques for PD recognition, existing surveys are limited in scope, often focusing on single data modalities and failing to capture the potential of multimodal approaches. To address these gaps, this study presents a comprehensive review of PD recognition systems across diverse data modalities, including Magnetic Resonance Imaging (MRI), gait-based pose analysis, gait sensory data, handwriting analysis, speech test data, Electroencephalography (EEG), and multimodal fusion techniques. Based on over 347 articles from leading scientific databases, this review examines key aspects such as data collection methods, settings, feature representations, and system performance, with a focus on recognition accuracy and robustness. This survey aims to serve as a comprehensive resource for researchers, providing actionable guidance for the development of next generation PD recognition systems. By leveraging diverse data modalities and cutting-edge machine learning paradigms, this work contributes to advancing the state of PD diagnostics and improving patient care through innovative, multimodal approaches.
zh

[CV-53] CORSTITCH - A free open source software for stitching and georeferencing underwater coral reef videos

【速读】：该论文旨在解决从自动快速珊瑚礁评估系统（Automated Rapid Reef Assessment System）获取的视频轨迹中自动生成准确地理参考的珊瑚礁马赛克图像的问题。其解决方案的关键在于采用基于傅里叶变换的图像相关算法，将顺序视频帧进行拼接，并利用同步的全球导航卫星系统（GNSS）时间戳进行对齐，从而生成适用于地理信息系统（GIS）的压缩Keyhole Markup Language文件。

链接: https://arxiv.org/abs/2505.00462
作者: Julian Christopher L. Maya,Johnenn R. Manalang,Maricor N. Soriano
机构: National Institute of Physics(国家物理研究所); University of the Philippines Diliman(菲律宾大学迪里曼分校); Philippines(菲律宾)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:CorStitch is an open-source software developed to automate the creation of accurate georeferenced reef mosaics from video transects obtained through Automated Rapid Reef Assessment System surveys. We utilized a Fourier-based image correlation algorithm to stitch sequential video frames, aligning them with synchronized GNSS timestamps. The resulting compressed Keyhole Markup Language files, compatible with geographic information systems such as Google Earth, enable detailed spatial analysis. Validation through comparative analysis of mosaics from two temporally distinct surveys of the same reef demonstrated the software’s consistent and reliable performance.
zh

[CV-54] owards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network

【速读】：该论文旨在解决高光谱图像超分辨率（hyperspectral super-resolution）问题，该问题由于数据的高光谱维度和可用训练样本的稀缺性而具有病态性质。现有方法通常依赖于参数量大的模型或需要融合全色或RGB图像，这在实际应用中往往不可行。论文提出的解决方案的关键是引入一种轻量级的深度可分离扩张卷积网络（DSDCN），该网络借鉴了MobileNet架构，利用多层深度可分离卷积，并结合扩张卷积融合块以增强空间与光谱特征的提取能力，同时设计了一种包含均方误差、L2范数正则化约束和基于光谱角的损失函数的自定义损失函数，以确保光谱和空间细节的保真度。

链接: https://arxiv.org/abs/2505.00374
作者: Usman Muhammad,Jorma Laaksonen,Lyudmila Mihaylova
机构: Aalto University (阿尔托大学); University of Sheffield (谢菲尔德大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks have demonstrated highly competitive performance in super-resolution (SR) for natural images by learning mappings from low-resolution (LR) to high-resolution (HR) images. However, hyperspectral super-resolution remains an ill-posed problem due to the high spectral dimensionality of the data and the scarcity of available training samples. Moreover, existing methods often rely on large models with a high number of parameters or require the fusion with panchromatic or RGB images, both of which are often impractical in real-world scenarios. Inspired by the MobileNet architecture, we introduce a lightweight depthwise separable dilated convolutional network (DSDCN) to address the aforementioned challenges. Specifically, our model leverages multiple depthwise separable convolutions, similar to the MobileNet architecture, and further incorporates a dilated convolution fusion block to make the model more flexible for the extraction of both spatial and spectral features. In addition, we propose a custom loss function that combines mean squared error (MSE), an L2 norm regularization-based constraint, and a spectral angle-based loss, ensuring the preservation of both spectral and spatial details. The proposed model achieves very competitive performance on two publicly available hyperspectral datasets, making it well-suited for hyperspectral image super-resolution tasks. The source codes are publicly available at: \hrefthis https URLthis https URL.
zh

[CV-55] Efficient and robust 3D blind harmonization for large domain gaps

【速读】：该论文旨在解决医学影像（MR图像）在不同设备或扫描参数下存在的域间差异问题，即通过无监督的盲谐波化（Blind Harmonization）技术实现跨域图像的一致性，从而获得尺度不变的表示。现有方法在处理三维图像时存在切片间异质性、图像质量中等以及大域差距下的性能受限等问题。解决方案的关键在于提出一种新颖的盲式三维谐波化框架——BlindHarmonyDiff，其核心是利用专门针对谐波化的边缘到图像模型，通过在目标域图像上训练的三维校正流来从边缘图重建原始图像，并进一步从源域边缘生成谐波化图像，同时引入多步幅块训练和细化模块以提升训练效率与推理鲁棒性。

链接: https://arxiv.org/abs/2505.00133
作者: Hwihun Jeong,Hayeon Lee,Se Young Chun,Jongho Lee
机构: Seoul National University (首尔大学); Harvard Medical School (哈佛医学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind harmonization has emerged as a promising technique for MR image harmonization to achieve scale-invariant representations, requiring only target domain data (i.e., no source domain data necessary). However, existing methods face limitations such as inter-slice heterogeneity in 3D, moderate image quality, and limited performance for a large domain gap. To address these challenges, we introduce BlindHarmonyDiff, a novel blind 3D harmonization framework that leverages an edge-to-image model tailored specifically to harmonization. Our framework employs a 3D rectified flow trained on target domain images to reconstruct the original image from an edge map, then yielding a harmonized image from the edge of a source domain image. We propose multi-stride patch training for efficient 3D training and a refinement module for robust inference by suppressing hallucination. Extensive experiments demonstrate that BlindHarmonyDiff outperforms prior arts by harmonizing diverse source domain images to the target domain, achieving higher correspondence to the target domain characteristics. Downstream task-based quality assessments such as tissue segmentation and age prediction on diverse MR scanners further confirm the effectiveness of our approach and demonstrate the capability of our robust and generalizable blind harmonization.
zh

[CV-56] Rootlets-based registration to the spinal cord PAM50 template

【速读】：该论文旨在解决脊髓功能磁共振成像（fMRI）中由于个体间解剖结构差异导致的脊髓水平定位不准确问题，从而影响 voxelwise 群体分析的可靠性。传统基于椎间盘的配准方法因个体间椎体与脊髓水平的显著解剖变异而存在局限性。该研究提出的解决方案关键在于利用脊髓后颈神经根（dorsal cervical rootlets）进行非线性配准，以提高个体间的对齐精度和可重复性。通过在PAM50脊髓模板上分割并对齐这些根丝，该方法在多受试者、多站点及不同颈部位置的数据集中均表现出更优的对齐效果，并在任务相关fMRI分析中提升了激活区域的显著性和空间一致性。

链接: https://arxiv.org/abs/2505.00115
作者: Sandrine Bédard,Jan Valošek,Valeria Oliva,Kenneth A. Weber II,Julien Cohen-Adad
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spinal cord functional MRI studies require precise localization of spinal levels for reliable voxelwise group analyses. Traditional template-based registration of the spinal cord uses intervertebral discs for alignment. However, substantial anatomical variability across individuals exists between vertebral and spinal levels. This study proposes a novel registration approach that leverages spinal nerve rootlets to improve alignment accuracy and reproducibility across individuals. We developed a registration method leveraging dorsal cervical rootlets segmentation and aligning them non-linearly with the PAM50 spinal cord template. Validation was performed on a multi-subject, multi-site dataset (n=267, 44 sites) and a multi-subject dataset with various neck positions (n=10, 3 sessions). We further validated the method on task-based functional MRI (n=23) to compare group-level activation maps using rootlet-based registration to traditional disc-based methods. Rootlet-based registration showed superior alignment across individuals compared to the traditional disc-based method. Notably, rootlet positions were more stable across neck positions. Group-level analysis of task-based functional MRI using rootlet-based increased Z scores and activation cluster size compared to disc-based registration (number of active voxels from 3292 to 7978). Rootlet-based registration enhances both inter- and intra-subject anatomical alignment and yields better spatial normalization for group-level fMRI analyses. Our findings highlight the potential of rootlet-based registration to improve the precision and reliability of spinal cord neuroimaging group analysis.
zh

[CV-57] SR-NeRV: Improving Embedding Efficiency of Neural Video Representation via Super-Resolution

【速读】：该论文试图解决传统基于隐式神经表示（Implicit Neural Representations, INRs）的视频压缩方法在模型尺寸受限条件下难以重建高频细节的问题。其解决方案的关键在于引入一个通用的超分辨率（Super-Resolution, SR）网络，利用高频率成分在帧间冗余较低的特点，将细节重建任务交由SR网络完成，从而在保持模型规模相近的前提下提升重建质量。

链接: https://arxiv.org/abs/2505.00046
作者: Taiga Hayami,Kakeru Koizumi,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have garnered significant attention for their ability to model complex signals across a variety of domains. Recently, INR-based approaches have emerged as promising frameworks for neural video compression. While conventional methods primarily focus on embedding video content into compact neural networks for efficient representation, they often struggle to reconstruct high-frequency details under stringent model size constraints, which are critical in practical compression scenarios. To address this limitation, we propose an INR-based video representation method that integrates a general-purpose super-resolution (SR) network. Motivated by the observation that high-frequency components exhibit low temporal redundancy across frames, our method entrusts the reconstruction of fine details to the SR network. Experimental results demonstrate that the proposed method outperforms conventional INR-based baselines in terms of reconstruction quality, while maintaining comparable model sizes.
zh

人工智能

[AI-0] Wasserstein Policy Optimization ICML2025

【速读】：该论文试图解决连续动作空间中强化学习的策略优化问题，特别是如何在保持算法通用性和简洁性的同时，有效结合确定性策略梯度和经典策略梯度方法的优点。解决方案的关键在于引入Wasserstein策略优化（Wasserstein Policy Optimization, WPO），该方法通过将策略空间上的Wasserstein梯度流投影到有限维参数空间（如神经网络权重）中，得到一种简单且完全通用的闭式更新规则，从而实现了对任意动作分布的随机策略的有效优化。

链接: https://arxiv.org/abs/2505.00663
作者: David Pfau,Ian Davies,Diana Borsa,Joao G. M. Araujo,Brendan Tracey,Hado van Hasselt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions – without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.
zh

[AI-1] Open-Source LLM -Driven Federated Transformer for Predictive IoV Management

【速读】：该论文旨在解决车联网（IoV）生态系统中可扩展性、实时性和隐私保护在交通管理中的关键挑战，现有集中式IoV解决方案因高延迟、可扩展性受限和依赖专有人工智能（AI）模型而难以广泛部署，尤其是在动态和隐私敏感的环境中。同时，大型语言模型（LLMs）在车载系统中的集成仍处于探索阶段，特别是在提示优化和联邦环境中的有效利用方面。论文提出的解决方案是联邦提示优化交通变换器（FPoTT），其关键在于引入动态提示优化机制，通过迭代优化文本提示以提升轨迹预测性能，并采用双层联邦学习架构，结合轻量级边缘模型与云端LLMs，实现实时推理与全局智能的协同，从而提升IoV管理的安全性、适应性和可扩展性。

链接: https://arxiv.org/abs/2505.00651
作者: Yazan Otoum,Arghavan Asad,Ishtiaq Ahmad
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Preprint version; submitted for academic peer review

点击查看摘要

Abstract:The proliferation of connected vehicles within the Internet of Vehicles (IoV) ecosystem presents critical challenges in ensuring scalable, real-time, and privacy-preserving traffic management. Existing centralized IoV solutions often suffer from high latency, limited scalability, and reliance on proprietary Artificial Intelligence (AI) models, creating significant barriers to widespread deployment, particularly in dynamic and privacy-sensitive environments. Meanwhile, integrating Large Language Models (LLMs) in vehicular systems remains underexplored, especially concerning prompt optimization and effective utilization in federated contexts. To address these challenges, we propose the Federated Prompt-Optimized Traffic Transformer (FPoTT), a novel framework that leverages open-source LLMs for predictive IoV management. FPoTT introduces a dynamic prompt optimization mechanism that iteratively refines textual prompts to enhance trajectory prediction. The architecture employs a dual-layer federated learning paradigm, combining lightweight edge models for real-time inference with cloud-based LLMs to retain global intelligence. A Transformer-driven synthetic data generator is incorporated to augment training with diverse, high-fidelity traffic scenarios in the Next Generation Simulation (NGSIM) format. Extensive evaluations demonstrate that FPoTT, utilizing EleutherAI Pythia-1B, achieves 99.86% prediction accuracy on real-world data while maintaining high performance on synthetic datasets. These results underscore the potential of open-source LLMs in enabling secure, adaptive, and scalable IoV management, offering a promising alternative to proprietary solutions in smart mobility ecosystems.
zh

[AI-2] OmicsCL: Unsupervised Contrastive Learning for Cancer Subtype Discovery and Survival Stratification

【速读】：该论文旨在解决从多组学数据中无监督学习疾病亚型的问题，以推动个性化医学的发展。其解决方案的关键在于提出OmicsCL，一个模块化的对比学习框架，该框架将异质组学模态（如基因表达、DNA甲基化和miRNA表达）联合嵌入到统一的潜在空间中，并引入了生存感知的对比损失函数，使模型能够学习与生存相关模式对齐的表示，而无需依赖标签结果。

链接: https://arxiv.org/abs/2505.00650
作者: Atahan Karagoz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
备注: Code available at: this https URL

点击查看摘要

Abstract:Unsupervised learning of disease subtypes from multi-omics data presents a significant opportunity for advancing personalized medicine. We introduce OmicsCL, a modular contrastive learning framework that jointly embeds heterogeneous omics modalities-such as gene expression, DNA methylation, and miRNA expression-into a unified latent space. Our method incorporates a survival-aware contrastive loss that encourages the model to learn representations aligned with survival-related patterns, without relying on labeled outcomes. Evaluated on the TCGA BRCA dataset, OmicsCL uncovers clinically meaningful clusters and achieves strong unsupervised concordance with patient survival. The framework demonstrates robustness across hyperparameter configurations and can be tuned to prioritize either subtype coherence or survival stratification. Ablation studies confirm that integrating survival-aware loss significantly enhances the predictive power of learned embeddings. These results highlight the promise of contrastive objectives for biological insight discovery in high-dimensional, heterogeneous omics data.
zh

[AI-3] Neural Network Verification for Gliding Drone Control: A Case Study

【速读】：该论文试图解决在自主系统中部署神经网络控制器时的验证问题，特别是针对厘米级仿生滑翔无人机的轨迹跟踪任务。其解决方案的关键在于提出一种用于回归网络鲁棒训练的新方法，并在Vehicle和CORA工具中对案例进行形式化验证，以评估神经网络控制器在复杂系统中的性能与鲁棒性。然而，研究结果表明，当前验证工具的局限性以及系统复杂性限制了验证的规模和效果。

链接: https://arxiv.org/abs/2505.00622
作者: Colin Kessler,Ekaterina Komendantskaya,Marco Casadio,Ignazio Maria Viola,Thomas Flinkow,Albaraa Ammar Othman,Alistair Malhotra,Robbie McPherson
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 18 page pre print, submitted to SAIV 2025 (conference)

点击查看摘要

Abstract:As machine learning is increasingly deployed in autonomous systems, verification of neural network controllers is becoming an active research domain. Existing tools and annual verification competitions suggest that soon this technology will become effective for real-world applications. Our application comes from the emerging field of microflyers that are passively transported by the wind, which may have various uses in weather or pollution monitoring. Specifically, we investigate centimetre-scale bio-inspired gliding drones that resemble Alsomitra macrocarpa diaspores. In this paper, we propose a new case study on verifying Alsomitra-inspired drones with neural network controllers, with the aim of adhering closely to a target trajectory. We show that our system differs substantially from existing VNN and ARCH competition benchmarks, and show that a combination of tools holds promise for verifying such systems in the future, if certain shortcomings can be overcome. We propose a novel method for robust training of regression networks, and investigate formalisations of this case study in Vehicle and CORA. Our verification results suggest that the investigated training methods do improve performance and robustness of neural network controllers in this application, but are limited in scope and usefulness. This is due to systematic limitations of both Vehicle and CORA, and the complexity of our system reducing the scale of reachability, which we investigate in detail. If these limitations can be overcome, it will enable engineers to develop safe and robust technologies that improve people’s lives and reduce our impact on the environment.
zh

[AI-4] Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

【速读】：该论文指出，生成式 AI（Generative AI）的实证评估正面临危机，因为传统机器学习的评估和基准测试策略无法满足现代 GenAI 模型和系统的评估需求。其关键问题包括模型具有几乎无限的输入输出空间、缺乏明确的真实标签以及依赖上下文的强反馈循环和预测依赖性。论文进一步强调，数据泄露（leakage）和数据污染（contamination）是 GenAI 评估中最重要且最难解决的问题。文章提出，人工智能竞赛（AI Competitions）领域已发展出有效的措施来应对泄露问题，因此应将 AI 竞赛视为 GenAI 评估中的黄金标准，并充分挖掘其成果的价值。

链接: https://arxiv.org/abs/2505.00612
作者: D. Sculley,Will Cukierski,Phil Culliton,Sohier Dane,Maggie Demkin,Ryan Holbrook,Addison Howard,Paul Mooney,Walter Reade,Megan Risdal,Nate Keating
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this position paper, we observe that empirical evaluation in Generative AI is at a crisis point since traditional ML evaluation and benchmarking strategies are insufficient to meet the needs of evaluating modern GenAI models and systems. There are many reasons for this, including the fact that these models typically have nearly unbounded input and output spaces, typically do not have a well defined ground truth target, and typically exhibit strong feedback loops and prediction dependence based on context of previous model outputs. On top of these critical issues, we argue that the problems of \em leakage and \em contamination are in fact the most important and difficult issues to address for GenAI evaluations. Interestingly, the field of AI Competitions has developed effective measures and practices to combat leakage for the purpose of counteracting cheating by bad actors within a competition setting. This makes AI Competitions an especially valuable (but underutilized) resource. Now is time for the field to view AI Competitions as the gold standard for empirical rigor in GenAI evaluation, and to harness and harvest their results with according value.
zh

[AI-5] Combining LLM s with Logic-Based Framework to Explain MCTS AAMAS-25

【速读】：该论文旨在解决人工智能（Artificial Intelligence, AI）在序列规划中缺乏可信度的问题，特别是针对蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）算法的可解释性不足。其解决方案的关键在于设计一种基于计算树逻辑（Computational Tree Logic）的大型语言模型（Large Language Model, LLM）自然语言解释框架，该框架能够将用户查询转化为逻辑与变量陈述，从而确保从搜索树中获得的证据在事实一致性上与环境动态及实际随机控制过程中的约束保持一致。

链接: https://arxiv.org/abs/2505.00610
作者: Ziyan An,Xia Wang,Hendrik Baier,Zirong Chen,Abhishek Dubey,Taylor T. Johnson,Jonathan Sprinkle,Ayan Mukhopadhyay,Meiyi Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAMAS-25 as an extended abstract

点击查看摘要

Abstract:In response to the lack of trust in Artificial Intelligence (AI) for sequential planning, we design a Computational Tree Logic-guided large language model (LLM)-based natural language explanation framework designed for the Monte Carlo Tree Search (MCTS) algorithm. MCTS is often considered challenging to interpret due to the complexity of its search trees, but our framework is flexible enough to handle a wide range of free-form post-hoc queries and knowledge-based inquiries centered around MCTS and the Markov Decision Process (MDP) of the application domain. By transforming user queries into logic and variable statements, our framework ensures that the evidence obtained from the search tree remains factually consistent with the underlying environmental dynamics and any constraints in the actual stochastic control process. We evaluate the framework rigorously through quantitative assessments, where it demonstrates strong performance in terms of accuracy and factual consistency.
zh

[AI-6] Can LLM s Help Improve Analogical Reasoning For Strategic Decisions? Experimental Evidence from Humans and GPT -4

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在战略决策情境中的类比推理能力是否能够与人类相媲美这一问题。研究通过一种新颖的实验设计，评估了GPT4在源到目标匹配中的表现，发现其虽然在召回率上表现出色，能够检索出所有可能的类比，但在精确度上存在不足，常因表面相似性而应用不恰当的类比；相比之下，人类参与者在精确度上占优，但召回率较低，选择的类比更少但因果关系更契合。解决方案的关键在于识别出类比推理中的“匹配”阶段作为一个独立步骤，该步骤需要超越简单检索的准确因果映射，表明当前LLMs在生成候选类比方面具有优势，而人类在识别跨领域深层结构相似性方面仍具比较优势。

链接: https://arxiv.org/abs/2505.00603
作者: Phanish Puranam,Prothit Sen,Maciej Workiewicz
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study investigates whether large language models, specifically GPT4, can match human capabilities in analogical reasoning within strategic decision making contexts. Using a novel experimental design involving source to target matching, we find that GPT4 achieves high recall by retrieving all plausible analogies but suffers from low precision, frequently applying incorrect analogies based on superficial similarities. In contrast, human participants exhibit high precision but low recall, selecting fewer analogies yet with stronger causal alignment. These findings advance theory by identifying matching, the evaluative phase of analogical reasoning, as a distinct step that requires accurate causal mapping beyond simple retrieval. While current LLMs are proficient in generating candidate analogies, humans maintain a comparative advantage in recognizing deep structural similarities across domains. Error analysis reveals that AI errors arise from surface level matching, whereas human errors stem from misinterpretations of causal structure. Taken together, the results suggest a productive division of labor in AI assisted organizational decision making where LLMs may serve as broad analogy generators, while humans act as critical evaluators, applying the most contextually appropriate analogies to strategic problems.
zh

[AI-7] Fast and Low-Cost Genomic Foundation Models via Outlier Removal ICML

【速读】：该论文试图解决基因组基础模型（Genomic Foundation Models, GFMs）在面对对抗攻击时的脆弱性评估问题，其解决方案的关键在于提出首个统一的对抗攻击基准框架GERM，该框架能够系统地评估GFMs对对抗攻击的敏感性，并通过多种攻击算法和防御策略对其进行全面分析，同时关注模型架构、量化方案及训练数据集对模型脆弱性的影响。

链接: https://arxiv.org/abs/2505.00598
作者: Haozheng Luo,Chenghao Qiu,Maojiang Su,Zhihan Zhou,Zoe Mehta,Guo Ye,Jerry Yao-Chieh Hu,Han Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: International Conference on Machine Learning (ICML) 2025

点击查看摘要

Abstract:We propose the first unified adversarial attack benchmark for Genomic Foundation Models (GFMs), named GERM. Unlike existing GFM benchmarks, GERM offers the first comprehensive evaluation framework to systematically assess the vulnerability of GFMs to adversarial attacks. Methodologically, we evaluate the adversarial robustness of five state-of-the-art GFMs using four widely adopted attack algorithms and three defense strategies. Importantly, our benchmark provides an accessible and comprehensive framework to analyze GFM vulnerabilities with respect to model architecture, quantization schemes, and training datasets. Empirically, transformer-based models exhibit greater robustness to adversarial perturbations compared to HyenaDNA, highlighting the impact of architectural design on vulnerability. Moreover, adversarial attacks frequently target biologically significant genomic regions, suggesting that these models effectively capture meaningful sequence features.
zh

[AI-8] A Finite-State Controller Based Offline Solver for Deterministic POMDPs IJCAI2025

【速读】：该论文旨在解决确定性部分可观测马尔可夫决策过程（DetPOMDPs）中的规划问题，这类问题的特点是智能体对环境状态存在不确定性，但能够进行确定性行动和观测。论文提出的解决方案是DetMCVI，它是蒙特卡洛值迭代（MCVI）算法在DetPOMDPs上的改进版本，其关键在于构建有限状态控制器（FSCs）形式的策略，从而高效求解大规模问题，并在成功率和性能上优于现有基线方法。

链接: https://arxiv.org/abs/2505.00596
作者: Alex Schutz,Yang You,Matias Mattamala,Ipek Caliskanelli,Bruno Lacerda,Nick Hawes
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 6 figures. Appendix attached. To be published in Proceedings of IJCAI 2025. For code see this http URL

点击查看摘要

Abstract:Deterministic partially observable Markov decision processes (DetPOMDPs) often arise in planning problems where the agent is uncertain about its environmental state but can act and observe deterministically. In this paper, we propose DetMCVI, an adaptation of the Monte Carlo Value Iteration (MCVI) algorithm for DetPOMDPs, which builds policies in the form of finite-state controllers (FSCs). DetMCVI solves large problems with a high success rate, outperforming existing baselines for DetPOMDPs. We also verify the performance of the algorithm in a real-world mobile robot forest mapping scenario.
zh

[AI-9] Voice Cloning: Comprehensive Survey

【速读】：该论文试图解决语音克隆（Voice Cloning）领域术语不统一以及技术变体复杂的问题，旨在建立标准化的术语体系并系统梳理相关技术。其解决方案的关键在于从说话人适应（Speaker Adaptation）这一基础概念出发，深入探讨少样本（Few-shot）、零样本（Zero-shot）和多语言文本转语音（Multilingual TTS）等关键技术，并总结常用的评估指标与数据集，以促进语音克隆技术的生成与检测研究，从而限制其潜在的滥用风险。

链接: https://arxiv.org/abs/2505.00579
作者: Hussam Azzuni,Abdulmotaleb El Saddik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 26 pages, 7 figures

点击查看摘要

Abstract:Voice Cloning has rapidly advanced in today’s digital world, with many researchers and corporations working to improve these algorithms for various applications. This article aims to establish a standardized terminology for voice cloning and explore its different variations. It will cover speaker adaptation as the fundamental concept and then delve deeper into topics such as few-shot, zero-shot, and multilingual TTS within that context. Finally, we will explore the evaluation metrics commonly used in voice cloning research and related datasets. This survey compiles the available voice cloning algorithms to encourage research toward its generation and detection to limit its misuse.
zh

[AI-10] LoGraF: Temporal Logic Planning via Graph-encoded Flow Matching ICML2025

【速读】：该论文试图解决在信号时序逻辑（STL）规范下学习解决复杂任务的问题，现有方法由于缺乏多样化的STL数据集和有效的时序逻辑信息提取编码器，通常仅考虑固定或参数化的STL规范。解决方案的关键在于提出TeLoGraF，一种结合图神经网络（GNN）编码器和流匹配的框架，以学习通用STL规范的解决方案，并通过收集200K个配对演示的STL规范进行验证。

链接: https://arxiv.org/abs/2505.00562
作者: Yue Meng,Chuchu Fan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: Accepted to ICML2025

点击查看摘要

Abstract:Learning to solve complex tasks with signal temporal logic (STL) specifications is crucial to many real-world applications. However, most previous works only consider fixed or parametrized STL specifications due to the lack of a diverse STL dataset and encoders to effectively extract temporal logic information for downstream tasks. In this paper, we propose TeLoGraF, Temporal Logic Graph-encoded Flow, which utilizes Graph Neural Networks (GNN) encoder and flow-matching to learn solutions for general STL specifications. We identify four commonly used STL templates and collect a total of 200K specifications with paired demonstrations. We conduct extensive experiments in five simulation environments ranging from simple dynamical models in the 2D space to high-dimensional 7DoF Franka Panda robot arm and Ant quadruped navigation. Results show that our method outperforms other baselines in the STL satisfaction rate. Compared to classical STL planning algorithms, our approach is 10-100X faster in inference and can work on any system dynamics. Besides, we show our graph-encoding method’s capability to solve complex STLs and robustness to out-distribution STL specifications. Code is available at this https URL
zh

[AI-11] st-time Correlation Alignment ICML2025

【速读】：该论文旨在解决深度神经网络在训练数据与测试数据分布发生变化时性能下降的问题，特别是在隐私限制导致无法访问训练数据的现实场景中，传统领域自适应方法受限，因此研究者关注于测试时间自适应（TTA）方法。当前TTA方法面临三个主要挑战：过度关注实例对齐而忽视相关性对齐、模型更新过程中复杂的反向传播操作导致计算开销大以及领域遗忘问题。该论文的关键解决方案是提出测试时间相关性对齐（TCA），通过理论分析证明高置信度实例与测试实例之间的相关性对齐可以提升测试性能，并基于此提出两种简单有效的算法：LinearTCA和LinearTCA+，其中LinearTCA通过简单的线性变换实现实例和相关性对齐，无需额外模型更新，而LinearTCA+则作为即插即用模块提升现有TTA方法的性能。

链接: https://arxiv.org/abs/2505.00533
作者: Linjing You,Jiabao Lu,Xiayuan Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Deep neural networks often experience performance drops due to distribution shifts between training and test data. Although domain adaptation offers a solution, privacy concerns restrict access to training data in many real-world scenarios. This restriction has spurred interest in Test-Time Adaptation (TTA), which adapts models using only unlabeled test data. However, current TTA methods still face practical challenges: (1) a primary focus on instance-wise alignment, overlooking CORrelation ALignment (CORAL) due to missing source correlations; (2) complex backpropagation operations for model updating, resulting in overhead computation and (3) domain forgetting. To address these challenges, we provide a theoretical analysis to investigate the feasibility of Test-time Correlation Alignment (TCA), demonstrating that correlation alignment between high-certainty instances and test instances can enhance test performances with a theoretical guarantee. Based on this, we propose two simple yet effective algorithms: LinearTCA and LinearTCA+. LinearTCA applies a simple linear transformation to achieve both instance and correlation alignment without additional model updates, while LinearTCA+ serves as a plug-and-play module that can easily boost existing TTA methods. Extensive experiments validate our theoretical insights and show that TCA methods significantly outperforms baselines across various tasks, benchmarks and backbones. Notably, LinearTCA improves adaptation accuracy by 5.88% on OfficeHome dataset, while using only 4% maximum GPU memory usage and 0.6% computation time compared to the best baseline TTA method. Comments: Accepted by ICML2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.00533 [cs.LG] (or arXiv:2505.00533v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.00533 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-12] Safety-Critical Traffic Simulation with Guided Latent Diffusion Model

【速读】：该论文旨在解决现有安全关键交通场景模拟方法在物理合理性不足和生成效率低下方面的问题。其解决方案的关键在于提出一种基于图的变分自编码器（Variational Autoencoder, VAE）与引导潜在扩散模型（Guided Latent Diffusion Model, LDM）相结合的方法，通过学习紧凑的潜在空间来捕捉多智能体复杂交互，并利用扩散模型进行去噪以生成真实轨迹，同时引入新的引导目标以实现可控且具有对抗性的驾驶行为生成，从而提升场景的物理合理性和生成效率。

链接: https://arxiv.org/abs/2505.00515
作者: Mingxing Peng,Ruoyu Yao,Xusen Guo,Yuting Xie,Xianda Chen,Jun Ma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Safety-critical traffic simulation plays a crucial role in evaluating autonomous driving systems under rare and challenging scenarios. However, existing approaches often generate unrealistic scenarios due to insufficient consideration of physical plausibility and suffer from low generation efficiency. To address these limitations, we propose a guided latent diffusion model (LDM) capable of generating physically realistic and adversarial safety-critical traffic scenarios. Specifically, our model employs a graph-based variational autoencoder (VAE) to learn a compact latent space that captures complex multi-agent interactions while improving computational efficiency. Within this latent space, the diffusion model performs the denoising process to produce realistic trajectories. To enable controllable and adversarial scenario generation, we introduce novel guidance objectives that drive the diffusion process toward producing adversarial and behaviorally realistic driving behaviors. Furthermore, we develop a sample selection module based on physical feasibility checks to further enhance the physical plausibility of the generated scenarios. Extensive experiments on the nuScenes dataset demonstrate that our method achieves superior adversarial effectiveness and generation efficiency compared to existing baselines while maintaining a high level of realism. Our work provides an effective tool for realistic safety-critical scenario simulation, paving the way for more robust evaluation of autonomous driving systems.
zh

[AI-13] Variational OOD State Correction for Offline Reinforcement Learning

【速读】：该论文旨在解决离线强化学习中状态分布偏移（state distributional shift）带来的性能下降问题，其核心挑战在于如何有效纠正分布外（out-of-distribution, OOD）状态以提升策略的稳定性与泛化能力。论文提出的解决方案关键在于设计一种名为密度感知安全感知（Density-Aware Safety Perception, DASP）的方法，该方法通过鼓励智能体优先选择能够导向高数据密度区域的动作，从而引导其在或返回到分布内（in-distribution）的安全区域进行决策。该方法在变分框架下优化目标函数，同时考虑决策的潜在结果及其密度信息，为安全决策提供关键的上下文依据。

链接: https://arxiv.org/abs/2505.00503
作者: Ke Jiang,Wen Jiang,Xiaoyang Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The performance of Offline reinforcement learning is significantly impacted by the issue of state distributional shift, and out-of-distribution (OOD) state correction is a popular approach to address this problem. In this paper, we propose a novel method named Density-Aware Safety Perception (DASP) for OOD state correction. Specifically, our method encourages the agent to prioritize actions that lead to outcomes with higher data density, thereby promoting its operation within or the return to in-distribution (safe) regions. To achieve this, we optimize the objective within a variational framework that concurrently considers both the potential outcomes of decision-making and their density, thus providing crucial contextual information for safe decision-making. Finally, we validate the effectiveness and feasibility of our proposed method through extensive experimental evaluations on the offline MuJoCo and AntMaze suites.
zh

[AI-14] Optimal Interactive Learning on the Job via Facility Location Planning

【速读】：该论文旨在解决协作机器人在持续多任务协作中需要不断适应新任务和用户偏好，同时避免给用户带来过多负担的问题。其解决方案的关键在于提出COIL（Cost-Optimal Interactive Learning）——一种多任务交互规划方法，通过战略性地选择三种查询类型（技能、偏好和帮助）来最小化人类努力。当用户偏好已知时，COIL被建模为无容量限制设施选址（Uncapacitated Facility Location, UFL）问题，从而在多项式时间内使用现成的近似算法实现有界次优规划；在用户偏好存在不确定性的情况下，通过引入一步信念空间规划来扩展该模型，保持多项式时间性能。

链接: https://arxiv.org/abs/2505.00490
作者: Shivam Vats,Michelle Zhao,Patrick Callaghan,Mingxi Jia,Maxim Likhachev,Oliver Kroemer,George Konidaris
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to Robotics: Science and Systems (RSS) 2025

点击查看摘要

Abstract:Collaborative robots must continually adapt to novel tasks and user preferences without overburdening the user. While prior interactive robot learning methods aim to reduce human effort, they are typically limited to single-task scenarios and are not well-suited for sustained, multi-task collaboration. We propose COIL (Cost-Optimal Interactive Learning) – a multi-task interaction planner that minimizes human effort across a sequence of tasks by strategically selecting among three query types (skill, preference, and help). When user preferences are known, we formulate COIL as an uncapacitated facility location (UFL) problem, which enables bounded-suboptimal planning in polynomial time using off-the-shelf approximation algorithms. We extend our formulation to handle uncertainty in user preferences by incorporating one-step belief space planning, which uses these approximation algorithms as subroutines to maintain polynomial-time performance. Simulated and physical experiments on manipulation tasks show that our framework significantly reduces the amount of work allocated to the human while maintaining successful task completion.
zh

[AI-15] MULE: Multi-terrain and Unknown Load Adaptation for Effective Quadrupedal Locomotion

【速读】：该论文旨在解决四足机器人在负载变化和复杂地形下适应性不足的问题。传统基于模型预测控制（Model Predictive Control, MPC）的方法依赖于预定义的步态计划或轨迹生成器，限制了其在非结构化环境中的灵活性。论文提出的解决方案是采用一种自适应强化学习（Adaptive Reinforcement Learning）框架，其关键在于包含一个基准策略用于基础运动控制，以及一个自适应策略用于学习修正动作，以在负载变化情况下保持稳定性和提高指令跟踪性能。

链接: https://arxiv.org/abs/2505.00488
作者: Vamshi Kumar Kurva,Shishir Kolathaya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Preprint under review

点击查看摘要

Abstract:Quadrupedal robots are increasingly deployed for load-carrying tasks across diverse terrains. While Model Predictive Control (MPC)-based methods can account for payload variations, they often depend on predefined gait schedules or trajectory generators, limiting their adaptability in unstructured environments. To address these limitations, we propose an Adaptive Reinforcement Learning (RL) framework that enables quadrupedal robots to dynamically adapt to both varying payloads and diverse terrains. The framework consists of a nominal policy responsible for baseline locomotion and an adaptive policy that learns corrective actions to preserve stability and improve command tracking under payload variations. We validate the proposed approach through large-scale simulation experiments in Isaac Gym and real-world hardware deployment on a Unitree Go1 quadruped. The controller was tested on flat ground, slopes, and stairs under both static and dynamic payload changes. Across all settings, our adaptive controller consistently outperformed the controller in tracking body height and velocity commands, demonstrating enhanced robustness and adaptability without requiring explicit gait design or manual tuning.
zh

[AI-16] Analysis of the vulnerability of machine learning regression models to adversarial attacks using data from 5G wireless networks

【速读】：该论文试图解决对抗性攻击对回归机器学习模型性能的影响问题，以及如何有效检测带有对抗性异常的数据。解决方案的关键在于利用DeepMIMO模拟器生成脚本并进行数据分析，通过FGSM（Fast Gradient Sign Method）方法实施对抗性攻击以最大化梯度，并采用LightGBM二分类器对异常数据进行高精度识别，从而实现对网络流量和传输数据中恶意活动的快速检测。

链接: https://arxiv.org/abs/2505.00487
作者: Leonid Legashev,Artur Zhigalov,Denis Parfenov
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article describes the process of creating a script and conducting an analytical study of a dataset using the DeepMIMO emulator. An advertorial attack was carried out using the FGSM method to maximize the gradient. A comparison is made of the effectiveness of binary classifiers in the task of detecting distorted data. The dynamics of changes in the quality indicators of the regression model were analyzed in conditions without adversarial attacks, during an adversarial attack and when the distorted data was isolated. It is shown that an adversarial FGSM attack with gradient maximization leads to an increase in the value of the MSE metric by 33% and a decrease in the R2 indicator by 10% on average. The LightGBM binary classifier effectively identifies data with adversarial anomalies with 98% accuracy. Regression machine learning models are susceptible to adversarial attacks, but rapid analysis of network traffic and data transmitted over the network makes it possible to identify malicious activity
zh

[AI-17] Rule-based Classifier Models

【速读】：该论文试图解决法律领域分类器模型仅依赖案件事实而忽视法律规则（特别是判例中的裁判要旨，ratio decidendi）的问题。解决方案的关键在于将规则集纳入分类器框架，通过结合事实与法律规则来增强分类器的推理能力，从而更准确地推导新案件的判决结果。此方法基于Canavotto等（2023）提出的基于规则的先例约束推理模型，并进一步扩展了其在因素层次结构中的应用。

链接: https://arxiv.org/abs/2505.00474
作者: Cecilia Di Florio,Huimin Dong,Antonino Rotolo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure. Extended version of a short paper accepted to ICAIL 2025. This is the authors’ version of the work. It is posted here for your personal use

点击查看摘要

Abstract:We extend the formal framework of classifier models used in the legal domain. While the existing classifier framework characterises cases solely through the facts involved, legal reasoning fundamentally relies on both facts and rules, particularly the ratio decidendi. This paper presents an initial approach to incorporating sets of rules within a classifier. Our work is built on the work of Canavotto et al. (2023), which has developed the rule-based reason model of precedential constraint within a hierarchy of factors. We demonstrate how decisions for new cases can be inferred using this enriched rule-based classifier framework. Additionally, we provide an example of how the time element and the hierarchy of courts can be used in the new classifier framework.
zh

[AI-18] UserCentrix: An Agent ic Memory-augmented AI Framework for Smart Spaces

【速读】：该论文旨在解决智能环境中AI系统在动态适应用户需求、优化资源分配及提升决策效率方面的挑战。其解决方案的关键在于提出UserCentrix框架，该框架结合了生成式AI与多智能体系统，通过个性化大语言模型代理、记忆管理、混合分层控制系统以及协作式代理协商策略，实现上下文感知的主动决策和资源高效交互。

链接: https://arxiv.org/abs/2505.00472
作者: Alaa Saleh,Sasu Tarkoma,Praveen Kumar Donta,Naser Hossein Motlagh,Schahram Dustdar,Susanna Pirttikangas,Lauri Lovén
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Agentic AI, with its autonomous and proactive decision-making, has transformed smart environments. By integrating Generative AI (GenAI) and multi-agent systems, modern AI frameworks can dynamically adapt to user preferences, optimize data management, and improve resource allocation. This paper introduces UserCentrix, an agentic memory-augmented AI framework designed to enhance smart spaces through dynamic, context-aware decision-making. This framework integrates personalized Large Language Model (LLM) agents that leverage user preferences and LLM memory management to deliver proactive and adaptive assistance. Furthermore, it incorporates a hybrid hierarchical control system, balancing centralized and distributed processing to optimize real-time responsiveness while maintaining global situational awareness. UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies. Our key contributions include (i) a self-organizing framework with proactive scaling based on task urgency, (ii) a Value of Information (VoI)-driven decision-making process, (iii) a meta-reasoning personal LLM agent, and (iv) an intelligent multi-agent coordination system for seamless environment adaptation. Experimental results across various models confirm the effectiveness of our approach in enhancing response accuracy, system efficiency, and computational resource management in real-world application.
zh

[AI-19] Data Therapist: Eliciting Domain Knowledge from Subject Matter Experts Using Large Language Models IEEE-VIS2025

【速读】：该论文试图解决数据可视化中因缺乏领域特定上下文而带来的挑战，特别是数据来源、质量及使用意图等隐性知识难以显式表达的问题。解决方案的关键在于提出Data Therapist，这是一个基于大型语言模型的网页工具，通过混合主动性流程结合迭代问答与交互式标注，帮助领域专家将隐性知识外化，从而构建结构化的知识库以支持人工和自动化的可视化设计。

链接: https://arxiv.org/abs/2505.00455
作者: Sungbok Shin,Hyeon Jeon,Sanghyun Hong,Niklas Elmqvist
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE VIS2025

点击查看摘要

Abstract:Effective data visualization requires not only technical proficiency but also a deep understanding of the domain-specific context in which data exists. This context often includes tacit knowledge about data provenance, quality, and intended use, which is rarely explicit in the dataset itself. We present the Data Therapist, a web-based tool that helps domain experts externalize this implicit knowledge through a mixed-initiative process combining iterative QA with interactive annotation. Powered by a large language model, the system analyzes user-supplied datasets, prompts users with targeted questions, and allows annotation at varying levels of granularity. The resulting structured knowledge base can inform both human and automated visualization design. We evaluated the tool in a qualitative study involving expert pairs from Molecular Biology, Accounting, Political Science, and Usable Security. The study revealed recurring patterns in how experts reason about their data and highlights areas where AI support can improve visualization design.
zh

[AI-20] Per-Domain Generalizing Policies: On Validation Instances and Scaling Behavior

【速读】：该论文试图解决在特定领域内泛化动作策略的可扩展性问题，即如何使模型在从少量训练实例到大量测试实例的规模变化中保持性能。解决方案的关键在于动态生成验证集，而非使用固定的验证集，通过在运行时逐步增加实例规模，只要信息量充足，从而提升模型的泛化能力。此外，论文还引入了改进的评估方法，系统地生成测试实例以确保每个实例规模下的覆盖性能具有给定置信度。

链接: https://arxiv.org/abs/2505.00439
作者: Timo P. Gros,Nicola J. Müller,Daniel Fiser,Isabel Valera,Verena Wolf,Jörg Hoffmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 tables, 3 figures, 3 algorithms

点击查看摘要

Abstract:Recent work has shown that successful per-domain generalizing action policies can be learned. Scaling behavior, from small training instances to large test instances, is the key objective; and the use of validation instances larger than training instances is one key to achieve it. Prior work has used fixed validation sets. Here, we introduce a method generating the validation set dynamically, on the fly, increasing instance size so long as informative and this http URL also introduce refined methodology for evaluating scaling behavior, generating test instances systematically to guarantee a given confidence in coverage performance for each instance size. In experiments, dynamic validation improves scaling behavior of GNN policies in all 9 domains used.
zh

[AI-21] ScaleTrack: Scaling and back-tracking Automated GUI Agents

【速读】：该论文旨在解决自动化GUI代理在训练过程中面临的数据不足和历史行为回溯缺失的问题，具体而言，现有方法在GUI定位（GUI grounding）阶段缺乏足够的训练数据，且在GUI规划（GUI planning）阶段忽视了对历史行为的回溯分析。其解决方案的关键在于提出ScaleTrack框架，通过扩展GUI定位能力和引入历史行为回溯的规划策略，提升自动化GUI代理的性能。该框架通过收集多样化的GUI样本并统一模板进行训练，同时设计了一种新的训练策略，能够从当前GUI图像预测下一步操作，并回溯导致该图像的历史操作，从而有效描述GUI环境的演变规则。

链接: https://arxiv.org/abs/2505.00416
作者: Jing Huang,Zhixiong Zeng,Wenkang Han,Yufeng Zhong,Liming Zheng,Shuai Fu,Jingyuan Chen,Lin Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated GUI agents aims to facilitate user interaction by automatically performing complex tasks in digital environments, such as web, mobile, desktop devices. It receives textual task instruction and GUI description to generate executable actions (\emphe.g., click) and operation boxes step by step. Training a GUI agent mainly involves grounding and planning stages, in which the GUI grounding focuses on finding the execution coordinates according to the task, while the planning stage aims to predict the next action based on historical actions. However, previous work suffers from the limitations of insufficient training data for GUI grounding, as well as the ignorance of backtracking historical behaviors for GUI planning. To handle the above challenges, we propose ScaleTrack, a training framework by scaling grounding and backtracking planning for automated GUI agents. We carefully collected GUI samples of different synthesis criterions from a wide range of sources, and unified them into the same template for training GUI grounding models. Moreover, we design a novel training strategy that predicts the next action from the current GUI image, while also backtracking the historical actions that led to the GUI image. In this way, ScaleTrack explains the correspondence between GUI images and actions, which effectively describes the evolution rules of the GUI environment. Extensive experimental results demonstrate the effectiveness of ScaleTrack. Data and code will be available at url.
zh

[AI-22] DeepSTA: A Spatial-Temporal Attention Network for Logistics Delivery Timely Rate Prediction in Anomaly Conditions CIKM2023

【速读】：该论文旨在解决物流行业中快递员配送及时率预测的问题，特别是在异常情况（如疫情爆发）下，由于配送及时率显著下降且波动较大，传统方法难以有效应对。现有研究对物流场景关注不足，且多数预测方法未能显式建模异常事件，导致信息丢失。此外，由于某些异常事件发生频率较低，传统数据驱动方法在这些场景下的表现较差。论文提出的解决方案是构建一种深度时空注意力模型（DeepSTA），其关键在于设计异常时空学习模块以避免信息丢失，并利用Node2vec、图神经网络和长短期记忆网络捕捉快递员的空间-时间依赖性；同时引入异常模式注意力模块，通过注意力机制存储快递员的异常特征模式，以应对异常情况下训练数据不足的问题。

链接: https://arxiv.org/abs/2505.00402
作者: Jinhui Yi,Huan Yan,Haotian Wang,Jian Yuan,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM 2023

点击查看摘要

Abstract:Prediction of couriers’ delivery timely rates in advance is essential to the logistics industry, enabling companies to take preemptive measures to ensure the normal operation of delivery services. This becomes even more critical during anomaly conditions like the epidemic outbreak, during which couriers’ delivery timely rate will decline markedly and fluctuates significantly. Existing studies pay less attention to the logistics scenario. Moreover, many works focusing on prediction tasks in anomaly scenarios fail to explicitly model abnormal events, e.g., treating external factors equally with other features, resulting in great information loss. Further, since some anomalous events occur infrequently, traditional data-driven methods perform poorly in these scenarios. To deal with them, we propose a deep spatial-temporal attention model, named DeepSTA. To be specific, to avoid information loss, we design an anomaly spatio-temporal learning module that employs a recurrent neural network to model incident information. Additionally, we utilize Node2vec to model correlations between road districts, and adopt graph neural networks and long short-term memory to capture the spatial-temporal dependencies of couriers. To tackle the issue of insufficient training data in abnormal circumstances, we propose an anomaly pattern attention module that adopts a memory network for couriers’ anomaly feature patterns storage via attention mechanisms. The experiments on real-world logistics datasets during the COVID-19 outbreak in 2022 show the model outperforms the best baselines by 12.11% in MAE and 13.71% in MSE, demonstrating its superior performance over multiple competitive baselines.
zh

[AI-23] Learning to Estimate Package Delivery Time in Mixed Imbalanced Delivery and Pickup Logistics Services

【速读】：该论文旨在解决混合物流场景下包裹配送时间准确估计的问题，特别是在快递员同时处理大量配送和少量取件任务时，现有研究未能充分考虑取件对快递员决策行为的更大影响，因其具有更严格的时限约束。解决方案的关键在于提出一种基于Transformer的多任务包裹配送时间预测模型（TransPDT），通过Transformer编码器捕捉快递员历史行程与待派送包裹的空间-时间依赖关系，并利用注意力机制在不平衡数据集中学习取件模式，同时将路径预测作为配送时间预测的辅助任务，结合快递员的空间移动规律进行优化。

链接: https://arxiv.org/abs/2505.00375
作者: Jinhui Yi,Huan Yan,Haotian Wang,Jian Yuan,Yong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACM SIGSPATIAL 2024

点击查看摘要

Abstract:Accurately estimating package delivery time is essential to the logistics industry, which enables reasonable work allocation and on-time service guarantee. This becomes even more necessary in mixed logistics scenarios where couriers handle a high volume of delivery and a smaller number of pickup simultaneously. However, most of the related works treat the pickup and delivery patterns on couriers’ decision behavior equally, neglecting that the pickup has a greater impact on couriers’ decision-making compared to the delivery due to its tighter time constraints. In such context, we have three main challenges: 1) multiple spatiotemporal factors are intricately interconnected, significantly affecting couriers’ delivery behavior; 2) pickups have stricter time requirements but are limited in number, making it challenging to model their effects on couriers’ delivery process; 3) couriers’ spatial mobility patterns are critical determinants of their delivery behavior, but have been insufficiently explored. To deal with these, we propose TransPDT, a Transformer-based multi-task package delivery time prediction model. We first employ the Transformer encoder architecture to capture the spatio-temporal dependencies of couriers’ historical travel routes and pending package sets. Then we design the pattern memory to learn the patterns of pickup in the imbalanced dataset via attention mechanism. We also set the route prediction as an auxiliary task of delivery time prediction, and incorporate the prior courier spatial movement regularities in prediction. Extensive experiments on real industry-scale datasets demonstrate the superiority of our method. A system based on TransPDT is deployed internally in JD Logistics to track more than 2000 couriers handling hundreds of thousands of packages per day in Beijing.
zh

[AI-24] Urban Air Mobility as a System of Systems: An LLM -Enhanced Holonic Approach

【速读】：该论文试图解决城市空中交通（Urban Air Mobility, UAM）在系统架构、规划、任务管理和执行方面面临的挑战，特别是传统架构在动态复杂环境中难以实现可扩展性、适应性和无缝资源集成的问题。解决方案的关键在于提出一种基于智能全息架构（holonic architecture）的方法，该架构整合了大型语言模型（Large Language Model, LLM），使各单元能够半自主运行，并实现实时协调。LLM通过处理自然语言输入、生成适应性计划以及管理如天气变化或空域限制等干扰因素，提升了系统的动态资源分配与自主适应能力，从而构建出更具弹性和效率的城市交通网络。

链接: https://arxiv.org/abs/2505.00368
作者: Ahmed R. Sadik,Muhammad Ashfaq,Niko Mäkitalo,Tommi Mikkonen
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Urban Air Mobility (UAM) is an emerging System of System (SoS) that faces challenges in system architecture, planning, task management, and execution. Traditional architectural approaches struggle with scalability, adaptability, and seamless resource integration within dynamic and complex environments. This paper presents an intelligent holonic architecture that incorporates Large Language Model (LLM) to manage the complexities of UAM. Holons function semi autonomously, allowing for real time coordination among air taxis, ground transport, and vertiports. LLMs process natural language inputs, generate adaptive plans, and manage disruptions such as weather changes or airspace this http URL a case study of multimodal transportation with electric scooters and air taxis, we demonstrate how this architecture enables dynamic resource allocation, real time replanning, and autonomous adaptation without centralized control, creating more resilient and efficient urban transportation networks. By advancing decentralized control and AI driven adaptability, this work lays the groundwork for resilient, human centric UAM ecosystems, with future efforts targeting hybrid AI integration and real world validation.
zh

[AI-25] SacFL: Self-Adaptive Federated Continual Learning for Resource-Constrained End Devices

【速读】：该论文旨在解决在联邦持续学习（Federated Continual Learning, FCL）场景下，端设备面临的存储资源有限、任务迁移检测自主性差以及应对新型对抗性任务困难等问题。其解决方案的关键在于提出一种名为SacFL的新框架，该框架采用编码器-解码器结构，分离任务鲁棒性和任务敏感性组件，从而通过保留轻量级任务敏感组件显著降低存储需求；同时利用对比学习引入自主的数据漂移检测机制，使设备能够自主判断新任务是否出现及其性质，进而触发持续学习或攻击防御策略，提升了端设备的实用性和适应性。

链接: https://arxiv.org/abs/2505.00365
作者: Zhengyi Zhong,Weidong Bao,Ji Wang,Jianguo Chen,Lingjuan Lyu,Wei Yang Bryan Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by TNNLS 2025

点击查看摘要

Abstract:The proliferation of end devices has led to a distributed computing paradigm, wherein on-device machine learning models continuously process diverse data generated by these devices. The dynamic nature of this data, characterized by continuous changes or data drift, poses significant challenges for on-device models. To address this issue, continual learning (CL) is proposed, enabling machine learning models to incrementally update their knowledge and mitigate catastrophic forgetting. However, the traditional centralized approach to CL is unsuitable for end devices due to privacy and data volume concerns. In this context, federated continual learning (FCL) emerges as a promising solution, preserving user data locally while enhancing models through collaborative updates. Aiming at the challenges of limited storage resources for CL, poor autonomy in task shift detection, and difficulty in coping with new adversarial tasks in FCL scenario, we propose a novel FCL framework named SacFL. SacFL employs an Encoder-Decoder architecture to separate task-robust and task-sensitive components, significantly reducing storage demands by retaining lightweight task-sensitive components for resource-constrained end devices. Moreover, \rmSacFL leverages contrastive learning to introduce an autonomous data shift detection mechanism, enabling it to discern whether a new task has emerged and whether it is a benign task. This capability ultimately allows the device to autonomously trigger CL or attack defense strategy without additional information, which is more practical for end devices. Comprehensive experiments conducted on multiple text and image datasets, such as Cifar100 and THUCNews, have validated the effectiveness of \rmSacFL in both class-incremental and domain-incremental scenarios. Furthermore, a demo system has been developed to verify its practicality.
zh

[AI-26] NStream: Applying Tightest Neighbors to Micro-Clusters to Define Multi-Density Clusters in Streaming Data

【速读】：该论文试图解决数据流聚类中现有算法难以同时处理任意形状、多密度、高维数据并保持强异常值鲁棒性的问题，尤其是在数据密度复杂变化时聚类质量显著下降的问题。解决方案的关键在于提出基于紧邻点（Tightest Neighbors）的聚类算法，并构建基于骨架集（Skeleton Set）的数据流聚类理论，从而实现对多密度数据流的微聚类演化总结，并通过局部敏感哈希（Locality-Sensitive Hashing, LSH）提升高维情况下的效率。

链接: https://arxiv.org/abs/2505.00359
作者: Qifen Zeng,Haomin Bao,Yuanzhuo Hu,Zirui Zhang,Yuheng Zheng,Luosheng Wen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 9 figures, 8 tables, under review at Expert Systems with Applications (ESWA)

点击查看摘要

Abstract:In data stream clustering, systematic theory of stream clustering algorithms remains relatively scarce. Recently, density-based methods have gained attention. However, existing algorithms struggle to simultaneously handle arbitrarily shaped, multi-density, high-dimensional data while maintaining strong outlier resistance. Clustering quality significantly deteriorates when data density varies complexly. This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. The algorithm adaptively determines the clustering radius based on local similarity, summarizing the evolution of multi-density data streams in micro-clusters. It then applies a Tightest Neighbors-based clustering algorithm to form final clusters. To improve efficiency in high-dimensional cases, Locality-Sensitive Hashing (LSH) is employed to structure micro-clusters, addressing the challenge of storing k-nearest neighbors. TNStream is evaluated on various synthetic and real-world datasets using different clustering metrics. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory.
zh

[AI-27] Optimizing Deep Neural Networks using Safety-Guided Self Compression

【速读】：该论文旨在解决在资源受限设备上部署深度神经网络时，如何有效平衡模型规模缩减与性能保持的问题。其解决方案的关键在于提出了一种基于安全性的量化框架，该框架通过保留集系统地进行神经网络权重的剪枝与量化，从而在不牺牲准确率的前提下优化模型复杂度。

链接: https://arxiv.org/abs/2505.00350
作者: Mohammad Zbeeb,Mariam Salman,Mohammad Bazzi,Ammar Mohanna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: A Preprint

点击查看摘要

Abstract:The deployment of deep neural networks on resource-constrained devices necessitates effective model com- pression strategies that judiciously balance the reduction of model size with the preservation of performance. This study introduces a novel safety-driven quantization framework that leverages preservation sets to systematically prune and quantize neural network weights, thereby optimizing model complexity without compromising accuracy. The proposed methodology is rigorously evaluated on both a convolutional neural network (CNN) and an attention-based language model, demonstrating its applicability across diverse architectural paradigms. Experimental results reveal that our framework achieves up to a 2.5% enhancement in test accuracy relative to the original unquantized models while maintaining 60% of the initial model size. In comparison to conventional quantization techniques, our approach not only augments generalization by eliminating parameter noise and retaining essential weights but also reduces variance, thereby ensuring the retention of critical model features. These findings underscore the efficacy of safety-driven quantization as a robust and reliable strategy for the efficient optimization of deep learn- ing models. The implementation and comprehensive experimental evaluations of our framework are publicly accessible at GitHub.
zh

[AI-28] Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics

【速读】：该论文试图解决大规模模型训练和微调过程中因优化器状态信息占用大量内存而导致的高昂计算成本问题（stateful optimizers）。解决方案的关键在于提出一种新型优化器SOLO，通过超低精度量化技术实现极轻量级的状态负载，使得优化器能够在3位甚至2位精度下运行。这一突破通过解决无符号量化中的信号淹没问题和有符号量化中梯度方差增大的问题得以实现，具体方法包括针对前者设计定制化的对数量化方案，以及为后者设置特定精度的动量值。

链接: https://arxiv.org/abs/2505.00347
作者: Cong Xu,Wenbin Liang,Mo Yu,Anan Liu,Ke-Yue Zhang,Lizhuang Ma,Jianyong Wang,Jun Wang,Wei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages

点击查看摘要

Abstract:The explosion in model sizes leads to continued growth in prohibitive training/fine-tuning costs, particularly for stateful optimizers which maintain auxiliary information of even 2x the model size to achieve optimal convergence. We therefore present in this work a novel type of optimizer that carries with extremely lightweight state overloads, achieved through ultra-low-precision quantization. While previous efforts have achieved certain success with 8-bit or 4-bit quantization, our approach enables optimizers to operate at precision as low as 3 bits, or even 2 bits per state element. This is accomplished by identifying and addressing two critical challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the rapidly increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum value for the latter. Consequently, the proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss. We hope that SOLO can contribute to overcoming the bottleneck in computational resources, thereby promoting greater accessibility in fundamental research.
zh

[AI-29] CognitionNet: A Collaborative Neural Network for Play Style Discovery in Online Skill Gaming Platform

【速读】：该论文旨在解决从在线游戏平台的遥测数据中自动识别玩家心理状态和游戏策略的问题，以及为玩家参与度预测提供相关诊断解释的问题。其解决方案的关键在于提出一种两阶段深度神经网络架构CognitionNet，该架构通过在潜在空间中挖掘游戏行为作为聚类表示，并利用监督分类目标聚合这些微模式以发现玩家的持续游戏风格，从而实现对玩家心理驱动决策和策略的揭示。

链接: https://arxiv.org/abs/2505.00325
作者: Rukma Talwadker,Surajit Chakrabarty,Aditya Pareek,Tridib Mukherjee,Deepak Saini
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Games are one of the safest source of realizing self-esteem and relaxation at the same time. An online gaming platform typically has massive data coming in, e.g., in-game actions, player moves, clickstreams, transactions etc. It is rather interesting, as something as simple as data on gaming moves can help create a psychological imprint of the user at that moment, based on her impulsive reactions and response to a situation in the game. Mining this knowledge can: (a) immediately help better explain observed and predicted player behavior; and (b) consequently propel deeper understanding towards players’ experience, growth and protection. To this effect, we focus on discovery of the “game behaviours” as micro-patterns formed by continuous sequence of games and the persistent “play styles” of the players’ as a sequence of such sequences on an online skill gaming platform for Rummy. We propose a two stage deep neural network, CognitionNet. The first stage focuses on mining game behaviours as cluster representations in a latent space while the second aggregates over these micro patterns to discover play styles via a supervised classification objective around player engagement. The dual objective allows CognitionNet to reveal several player psychology inspired decision making and tactics. To our knowledge, this is the first and one-of-its-kind research to fully automate the discovery of: (i) player psychology and game tactics from telemetry data; and (ii) relevant diagnostic explanations to players’ engagement predictions. The collaborative training of the two networks with differential input dimensions is enabled using a novel formulation of “bridge loss”. The network plays pivotal role in obtaining homogeneous and consistent play style definitions and significantly outperforms the SOTA baselines wherever applicable.
zh

[AI-30] AI2-Active Safety: AI-enabled Interaction-aware Active Safety Analysis with Vehicle Dynamics

【速读】：该论文旨在解决复杂交通环境中车辆群体交互对主动安全分析的影响问题，特别是在高速公路上的多智能体协同行为与不确定性带来的安全评估挑战。解决方案的关键在于构建一个融合车辆动力学建模与概率轨迹预测的框架，其中采用改进的自行车模型（bicycle model）考虑道路坡度以精确描述车辆动态，并结合基于超图的AI模型预测周围交通的随机轨迹，最终通过求解随机常微分方程在三维路面上推导出车辆间距，从而生成高保真代理安全指标如时间到碰撞（time-to-collision, TTC）。

链接: https://arxiv.org/abs/2505.00322
作者: Keshu Wu,Zihao Li,Sixu Li,Xinyue Ye,Dominique Lord,Yang Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an AI-enabled, interaction-aware active safety analysis framework that accounts for groupwise vehicle interactions. Specifically, the framework employs a bicycle model-augmented with road gradient considerations-to accurately capture vehicle dynamics. In parallel, a hypergraph-based AI model is developed to predict probabilistic trajectories of ambient traffic. By integrating these two components, the framework derives vehicle intra-spacing over a 3D road surface as the solution of a stochastic ordinary differential equation, yielding high-fidelity surrogate safety measures such as time-to-collision (TTC). To demonstrate its effectiveness, the framework is analyzed using stochastic numerical methods comprising 4th-order Runge-Kutta integration and AI inference, generating probability-weighted high-fidelity TTC (HF-TTC) distributions that reflect complex multi-agent maneuvers and behavioral uncertainties. Evaluated with HF-TTC against traditional constant-velocity TTC and non-interaction-aware approaches on highway datasets, the proposed framework offers a systematic methodology for active safety analysis with enhanced potential for improving safety perception in complex traffic environments.
zh

[AI-31] Surrogate modeling of Cellular-Potts Agent -Based Models as a segmentation task using the U-Net neural network architecture

【速读】：该论文试图解决 Cellular-Potts 模型（CPM）在模拟复杂多细胞生物系统时计算成本过高的问题，其关键解决方案是开发一种基于 U-Net 架构的卷积神经网络（CNN）代理模型，该模型能够预测100个蒙特卡洛步（MCS）的模拟结果，从而将仿真评估速度提升590倍。该代理模型有效捕捉了原始CPM模型中出现的血管生成等涌现行为，展示了深度学习作为高效代理模型在加速CPM仿真的潜力。

链接: https://arxiv.org/abs/2505.00316
作者: Tien Comlekoglu,J. Quetzalcóatl Toledo-Marín,Tina Comlekoglu,Douglas W. DeSimone,Shayn M. Peirce,Geoffrey Fox,James A. Glazier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The Cellular-Potts model is a powerful and ubiquitous framework for developing computational models for simulating complex multicellular biological systems. Cellular-Potts models (CPMs) are often computationally expensive due to the explicit modeling of interactions among large numbers of individual model agents and diffusive fields described by partial differential equations (PDEs). In this work, we develop a convolutional neural network (CNN) surrogate model using a U-Net architecture that accounts for periodic boundary conditions. We use this model to accelerate the evaluation of a mechanistic CPM previously used to investigate \textitin vitro vasculogenesis. The surrogate model was trained to predict 100 computational steps ahead (Monte-Carlo steps, MCS), accelerating simulation evaluations by a factor of 590 times compared to CPM code execution. Over multiple recursive evaluations, our model effectively captures the emergent behaviors demonstrated by the original Cellular-Potts model of such as vessel sprouting, extension and anastomosis, and contraction of vascular lacunae. This approach demonstrates the potential for deep learning to serve as efficient surrogate models for CPM simulations, enabling faster evaluation of computationally expensive CPM of biological processes at greater spatial and temporal scales.
zh

[AI-32] Multi-Hierarchical Fine-Grained Feature Mapping Driven by Feature Contribution for Molecular Odor Prediction

【速读】：该论文旨在解决分子气味预测中因传统方法依赖基础描述符或手工设计的指纹而导致的表达能力不足以及严重类别不平衡问题。其解决方案的关键在于提出一种基于特征贡献驱动的分层多特征映射网络（HMFNet），该网络包含细粒度的局部多层级特征提取模块（LMFE）和全局多层级特征提取模块（GMFE），分别用于捕捉原子级别的细节特征和分子图拓扑的全局特征，并通过化学信息损失（CIL）缓解类别不平衡问题，从而提升模型在气味预测任务中的性能。

链接: https://arxiv.org/abs/2505.00290
作者: Hong Xin Xie,Jian De Sun,Fan Fu Xue,Zi Fei Han,Shan Shan Feng,Qi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular odor prediction is the process of using a molecule’s structure to predict its smell. While accurate prediction remains challenging, AI models can suggest potential odors. Existing methods, however, often rely on basic descriptors or handcrafted fingerprints, which lack expressive power and hinder effective learning. Furthermore, these methods suffer from severe class imbalance, limiting the training effectiveness of AI models. To address these challenges, we propose a Feature Contribution-driven Hierarchical Multi-Feature Mapping Network (HMFNet). Specifically, we introduce a fine-grained, Local Multi-Hierarchy Feature Extraction module (LMFE) that performs deep feature extraction at the atomic level, capturing detailed features crucial for odor prediction. To enhance the extraction of discriminative atomic features, we integrate a Harmonic Modulated Feature Mapping (HMFM). This module dynamically learns feature importance and frequency modulation, improving the model’s capability to capture relevant patterns. Additionally, a Global Multi-Hierarchy Feature Extraction module (GMFE) is designed to learn global features from the molecular graph topology, enabling the model to fully leverage global information and enhance its discriminative power for odor prediction. To further mitigate the issue of class imbalance, we propose a Chemically-Informed Loss (CIL). Experimental results demonstrate that our approach significantly improves performance across various deep learning models, highlighting its potential to advance molecular structure representation and accelerate the development of AI-driven technologies.
zh

[AI-33] LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving

【速读】：该论文旨在解决如何充分利用生成式 AI (Generative AI) 在端到端自动驾驶中的潜力，以实现安全可靠的车辆控制这一开放性研究挑战。其解决方案的关键在于提出 LightEMMA，一个轻量级的端到端多模态模型，用于自动驾驶任务，该模型提供了一个统一的、基于 Vision-Language Models (VLMs) 的框架，无需特定定制即可方便地集成和评估不断演进的最新商业和开源模型。

链接: https://arxiv.org/abs/2505.00284
作者: Zhijie Qiao,Haowei Li,Zhong Cao,Henry X. Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated significant potential for end-to-end autonomous driving. However, fully exploiting their capabilities for safe and reliable vehicle control remains an open research challenge. To systematically examine advances and limitations of VLMs in driving tasks, we introduce LightEMMA, a Lightweight End-to-End Multimodal Model for Autonomous driving. LightEMMA provides a unified, VLM-based autonomous driving framework without ad hoc customizations, enabling easy integration and evaluation of evolving state-of-the-art commercial and open-source models. We construct twelve autonomous driving agents using various VLMs and evaluate their performance on the nuScenes prediction task, comprehensively assessing metrics such as inference time, computational cost, and predictive accuracy. Illustrative examples highlight that, despite their strong scenario interpretation capabilities, VLMs’ practical performance in autonomous driving tasks remains concerning, emphasizing the need for further improvements. The code is available at this https URL.
zh

[AI-34] DeCo: Defect-Aware Modeling with Contrasting Matching for Optimizing Task Assignment in Online IC Testing

【速读】：该论文旨在解决集成电路（Integrated Circuit, IC）测试中缺陷识别与任务分配效率低下的问题，特别是在缺陷数据稀缺的情况下，如何有效利用缺陷特征、历史故障信息及工程师经验来优化任务分配。其解决方案的关键在于提出DeCo方法，通过构建缺陷感知图（defect-aware graph）捕捉缺陷间的共现关系，并结合局部与全局结构建模生成工程师和任务的缺陷感知表示，最终通过对比机制实现基于技能水平和工作负载的任务分配，从而提升任务处理的成功率与工作分配的均衡性。

链接: https://arxiv.org/abs/2505.00278
作者: Lo Pang-Yun Ting,Yu-Hao Chiang,Yi-Tung Tsai,Hsu-Chao Lai,Kun-Ta Chuang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the semiconductor industry, integrated circuit (IC) processes play a vital role, as the rising complexity and market expectations necessitate improvements in yield. Identifying IC defects and assigning IC testing tasks to the right engineers improves efficiency and reduces losses. While current studies emphasize fault localization or defect classification, they overlook the integration of defect characteristics, historical failures, and the insights from engineer expertise, which restrains their effectiveness in improving IC handling. To leverage AI for these challenges, we propose DeCo, an innovative approach for optimizing task assignment in IC testing. DeCo constructs a novel defect-aware graph from IC testing reports, capturing co-failure relationships to enhance defect differentiation, even with scarce defect data. Additionally, it formulates defect-aware representations for engineers and tasks, reinforced by local and global structure modeling on the defect-aware graph. Finally, a contrasting-based assignment mechanism pairs testing tasks with QA engineers by considering their skill level and current workload, thus promoting an equitable and efficient job dispatch. Experiments on a real-world dataset demonstrate that DeCo achieves the highest task-handling success rates in different scenarios, exceeding 80%, while also maintaining balanced workloads on both scarce or expanded defect data. Moreover, case studies reveal that DeCo can assign tasks to potentially capable engineers, even for their unfamiliar defects, highlighting its potential as an AI-driven solution for the real-world IC failure analysis and task handling.
zh

[AI-35] LLM -Based Threat Detection and Prevention Framework for IoT Ecosystems

【速读】：该论文旨在解决物联网（IoT）日益复杂和规模扩大所带来的安全问题。其解决方案的关键在于提出一种基于大型语言模型（Large Language Model, LLM）的框架，通过在物联网特定数据集（如IoT-23和TON_IoT）上微调轻量级LLM，实现对异常行为的实时检测以及针对资源受限设备优化的上下文感知自动化缓解策略。该框架采用模块化Docker部署方式，确保了在不同网络条件下的可扩展性和可重复性。实验结果表明，该方法在检测准确性、响应延迟和资源效率方面均优于传统安全方法。

链接: https://arxiv.org/abs/2505.00240
作者: Yazan Otoum,Arghavan Asad,Amiya Nayak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Preprint version; submitted for academic peer review

点击查看摘要

Abstract:The increasing complexity and scale of the Internet of Things (IoT) have made security a critical concern. This paper presents a novel Large Language Model (LLM)-based framework for comprehensive threat detection and prevention in IoT environments. The system integrates lightweight LLMs fine-tuned on IoT-specific datasets (IoT-23, TON_IoT) for real-time anomaly detection and automated, context-aware mitigation strategies optimized for resource-constrained devices. A modular Docker-based deployment enables scalable and reproducible evaluation across diverse network conditions. Experimental results in simulated IoT environments demonstrate significant improvements in detection accuracy, response latency, and resource efficiency over traditional security methods. The proposed framework highlights the potential of LLM-driven, autonomous security solutions for future IoT ecosystems.
zh

[AI-36] Scaling On-Device GPU Inference for Large Generative Models CVPR2025

【速读】：该论文旨在解决在设备端（on-device）执行生成式 AI (Generative AI) 工作负载的挑战，特别是针对现有设备端生成式 AI 模型参数量有限的问题。其解决方案的关键在于提出 ML Drift——一个优化的框架，通过扩展最先进的 GPU 加速推理引擎的能力，实现参数量比现有模型多 10 到 100 倍的生成式 AI 工作负载的设备端执行，并解决了跨 GPU API 开发中的复杂工程问题，确保了在移动和桌面/笔记本平台上的广泛兼容性。

链接: https://arxiv.org/abs/2505.00232
作者: Jiuqiang Tang,Raman Sarokin,Ekaterina Ignasheva,Grant Jensen,Lin Chen,Juhyun Lee,Andrei Kulik,Matthias Grundmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: to be published in CVPR 2025 Workshop on Efficient and On-Device Generation (EDGE)

点击查看摘要

Abstract:Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, we present ML Drift–an optimized framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines. ML Drift enables on-device execution of generative AI workloads which contain 10 to 100x more parameters than existing on-device generative AI models. ML Drift addresses intricate engineering challenges associated with cross-GPU API development, and ensures broad compatibility across mobile and desktop/laptop platforms, thereby facilitating the deployment of significantly more complex models on resource-constrained devices. Our GPU-accelerated ML/AI inference engine achieves an order-of-magnitude performance improvement relative to existing open-source GPU inference engines.
zh

[AI-37] Predicting Estimated Times of Restoration for Electrical Outages Using Longitudinal Tabular Transformers

【速读】：该论文旨在解决在自然灾害期间，电力供应商难以提供精确的恢复时间预估（Estimated Time of Restoration, ETR）的问题。传统方法依赖人工评估或统计方法，难以满足可靠和可操作预测的需求。论文提出的解决方案是采用纵向表格变压器（Longitudinal Tabular Transformer, LTT）模型，通过利用历史停电事件数据及事件的序列更新来提升ETR预测的准确性。该模型的关键在于结合时序信息以增强预测性能，并通过客户导向的回归指标与可解释性技术，确保模型结果符合实际客户需求并提高透明度。

链接: https://arxiv.org/abs/2505.00225
作者: Bogireddy Sai Prasanna Teja,Valliappan Muthukaruppan,Carls Benjamin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As climate variability increases, the ability of utility providers to deliver precise Estimated Times of Restoration (ETR) during natural disasters has become increasingly critical. Accurate and timely ETRs are essential for enabling customer preparedness during extended power outages, where informed decision-making can be crucial, particularly in severe weather conditions. Nonetheless, prevailing utility practices predominantly depend on manual assessments or traditional statistical methods, which often fail to achieve the level of precision required for reliable and actionable predictions. To address these limitations, we propose a Longitudinal Tabular Transformer (LTT) model that leverages historical outage event data along with sequential updates of these events to improve the accuracy of ETR predictions. The model’s performance was evaluated over 34,000 storm-related outage events from three major utility companies, collectively serving over 3 million customers over a 2-year period. Results demonstrate that the LTT model improves the Customer Satisfaction Impact (CSI) metric by an average of 19.08% (p 0.001) compared to existing methods. Additionally, we introduce customer-informed regression metrics that align model evaluation with real-world satisfaction, ensuring the outcomes resonate with customer expectations. Furthermore, we employ interpretability techniques to analyze the temporal significance of incorporating sequential updates in modeling outage events and to identify the contributions of predictive features to a given ETR. This comprehensive approach not only improves predictive accuracy but also enhances transparency, fostering greater trust in the model’s capabilities.
zh

[AI-38] AI-Enhanced Automatic Design of Efficient Underwater Gliders

【速读】：该论文旨在解决自主水下滑翔机设计中形状多样性受限的问题，这一限制主要源于传统设计工具对人工试错的依赖。其解决方案的关键在于提出一种基于人工智能的自动化计算框架，该框架通过联合优化形状与控制信号，结合降阶几何表示和可微分神经网络流体代理模型，实现水下机器人的高效设计与性能评估，从而生成具有复杂船体结构的滑翔机。

链接: https://arxiv.org/abs/2505.00222
作者: Peter Yichen Chen,Pingchuan Ma,Niklas Hagemann,John Romanishin,Wei Wang,Daniela Rus,Wojciech Matusik
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:The development of novel autonomous underwater gliders has been hindered by limited shape diversity, primarily due to the reliance on traditional design tools that depend heavily on manual trial and error. Building an automated design framework is challenging due to the complexities of representing glider shapes and the high computational costs associated with modeling complex solid-fluid interactions. In this work, we introduce an AI-enhanced automated computational framework designed to overcome these limitations by enabling the creation of underwater robots with non-trivial hull shapes. Our approach involves an algorithm that co-optimizes both shape and control signals, utilizing a reduced-order geometry representation and a differentiable neural-network-based fluid surrogate model. This end-to-end design workflow facilitates rapid iteration and evaluation of hydrodynamic performance, leading to the discovery of optimal and complex hull shapes across various control settings. We validate our method through wind tunnel experiments and swimming pool gliding tests, demonstrating that our computationally designed gliders surpass manually designed counterparts in terms of energy efficiency. By addressing challenges in efficient shape representation and neural fluid surrogate models, our work paves the way for the development of highly efficient underwater gliders, with implications for long-range ocean exploration and environmental monitoring.
zh

[AI-39] Online Federation For Mixtures of Proprietary Agents with Black-Box Encoders

【速读】：该论文试图解决在构建混合专家型集成模型时，由于生成式 AI (Generative AI) 和特征编码器的专有性导致的内部参数和架构不可见问题，这限制了用户对每个AI进行优化的能力。解决方案的关键在于将问题建模为一种非竞争性的博弈论框架，其中每个专有AI作为竞争代理，而用户则作为中央规划者以协调这些竞争AI的集成。研究证明了在线设置下唯一纳什均衡的存在性，并通过引入反馈机制计算出闭式解。此外，提出了一种去中心化的联邦学习算法，使每个代理在其本地设备上优化自身结构，而无需向其他代理泄露任何内部结构。

链接: https://arxiv.org/abs/2505.00216
作者: Xuwei Yang,Fatemeh Tavakoli,David B. Emerson,Anastasis Kratsios
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 47 pages, 16 figures, 7 tables

点击查看摘要

Abstract:Most industry-standard generative AIs and feature encoders are proprietary, offering only black-box access: their outputs are observable, but their internal parameters and architectures remain hidden from the end-user. This black-box access is especially limiting when constructing mixture-of-expert type ensemble models since the user cannot optimize each proprietary AI’s internal parameters. Our problem naturally lends itself to a non-competitive game-theoretic lens where each proprietary AI (agent) is inherently competing against the other AI agents, with this competition arising naturally due to their obliviousness of the AI’s to their internal structure. In contrast, the user acts as a central planner trying to synchronize the ensemble of competing AIs. We show the existence of the unique Nash equilibrium in the online setting, which we even compute in closed-form by eliciting a feedback mechanism between any given time series and the sequence generated by each (proprietary) AI agent. Our solution is implemented as a decentralized, federated-learning algorithm in which each agent optimizes their structure locally on their machine without ever releasing any internal structure to the others. We obtain refined expressions for pre-trained models such as transformers, random feature models, and echo-state networks. Our ``proprietary federated learning’’ algorithm is implemented on a range of real-world and synthetic time-series benchmarks. It achieves orders-of-magnitude improvements in predictive accuracy over natural benchmarks, of which there are surprisingly few due to this natural problem still being largely unexplored. Comments: 47 pages, 16 figures, 7 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) MSC classes: 68T05, 68T07, 91A80 ACMclasses: I.2.1; I.2.11; G.1.6 Cite as: arXiv:2505.00216 [cs.LG] (or arXiv:2505.00216v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.00216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropics Value Dataset

【速读】：该论文试图解决如何在实际应用中评估大型语言模型（Large Language Models, LLMs）的伦理行为问题，现有AI伦理框架虽强调公平性、透明性和问责性，但缺乏可操作的评估方法。解决方案的关键在于引入负责任的人工智能实验室（Responsible AI Labs, RAIL）框架，该框架包含八个可量化的维度，用于系统性评估LLMs的规范行为，并通过将价值观映射到RAIL维度并计算合成评分，提供对LLMs伦理行为的深入洞察。

链接: https://arxiv.org/abs/2505.00204
作者: Sumit Verma,Pritam Prasun,Arpit Jaiswal,Pritish Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems become embedded in real-world applications, ensuring they meet ethical standards is crucial. While existing AI ethics frameworks emphasize fairness, transparency, and accountability, they often lack actionable evaluation methods. This paper introduces a systematic approach using the Responsible AI Labs (RAIL) framework, which includes eight measurable dimensions to assess the normative behavior of large language models (LLMs). We apply this framework to Anthropic’s “Values in the Wild” dataset, containing over 308,000 anonymized conversations with Claude and more than 3,000 annotated value expressions. Our study maps these values to RAIL dimensions, computes synthetic scores, and provides insights into the ethical behavior of LLMs in real-world use.
zh

[AI-41] Empirical Evaluation of Progressive Coding for Sparse Autoencoders

【速读】：该论文旨在解决稀疏自编码器（Sparse Autoencoder, SAE）在计算成本上的高消耗问题，尤其是在需要多个不同规模的SAEs时。其解决方案的关键在于引入马特罗什卡稀疏自编码器（Matryoshka SAEs），通过联合训练嵌套的SAEs结构，相较于基于子集剪枝的渐进编码方法，在语言建模任务中表现出更低的重建损失、更小的语言建模损失以及更高的表征相似性。然而，剪枝后的原始SAEs在可解释性方面更具优势。

链接: https://arxiv.org/abs/2505.00190
作者: Hans Peter,Anders Søgaard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) \citepbricken2023monosemanticity,gao2024scalingevaluatingsparseautoencoders rely on dictionary learning to extract interpretable features from neural networks at scale in an unsupervised manner, with applications to representation engineering and information retrieval. SAEs are, however, computationally expensive \citeplieberum2024gemmascopeopensparse, especially when multiple SAEs of different sizes are needed. We show that dictionary importance in vanilla SAEs follows a power law. We compare progressive coding based on subset pruning of SAEs – to jointly training nested SAEs, or so-called \em Matryoshka SAEs \citepbussmann2024learning,nabeshima2024Matryoshka – on a language modeling task. We show Matryoshka SAEs exhibit lower reconstruction loss and recaptured language modeling loss, as well as higher representational similarity. Pruned vanilla SAEs are more interpretable, however. We discuss the origins and implications of this trade-off.
zh

[AI-42] Real-World Gaps in AI Governance Research

【速读】：该论文试图解决当前生成式 AI (Generative AI) 研究中企业与学术机构在安全性和可靠性研究上的分布不均及部署阶段问题关注不足的问题。其解决方案的关键在于通过扩大外部研究人员对部署数据的访问权限以及建立系统化的市场中 AI 行为可观测性，以弥补高风险部署领域中的研究缺口，并缓解因企业研究集中化导致的知识匮乏。

链接: https://arxiv.org/abs/2505.00174
作者: Ilan Strauss,Isobel Moure,Tim O’Reilly,Sruly Rosenblat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drawing on 1,178 safety and reliability papers from 9,439 generative AI papers (January 2020 - March 2025), we compare research outputs of leading AI companies (Anthropic, Google DeepMind, Meta, Microsoft, and OpenAI) and AI universities (CMU, MIT, NYU, Stanford, UC Berkeley, and University of Washington). We find that corporate AI research increasingly concentrates on pre-deployment areas – model alignment and testing evaluation – while attention to deployment-stage issues such as model bias has waned. Significant research gaps exist in high-risk deployment domains, including healthcare, finance, misinformation, persuasive and addictive features, hallucinations, and copyright. Without improved observability into deployed AI, growing corporate concentration could deepen knowledge deficits. We recommend expanding external researcher access to deployment data and systematic observability of in-market AI behaviors.
zh

[AI-43] First Order Logic with Fuzzy Semantics for Describing and Recognizing Nerves in Medical Images

【速读】：该论文试图解决医学图像中纤维束（尤其是神经）的描述与识别问题，其核心在于基于解剖学描述的纤维轨迹进行建模。解决方案的关键在于提出一种逻辑形式化方法，将解剖学知识转化为可计算的形式，结合模糊语义与一阶逻辑来处理解剖描述中固有的不精确性。通过定义一个表示空间实体、实体间关系及量词的语言，并利用模糊表示和关系满足度赋予语义，最终提出一种空间推理算法，用于从解剖和扩散磁共振图像中分割和识别神经，从而辅助外科手术规划。

链接: https://arxiv.org/abs/2505.00173
作者: Isabelle Bloch,Enzo Bonnot,Pietro Gori,Giammarco La Barbera,Sabine Sarnacki
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Logic (math.LO)
备注: Accepted for presentation at the FUZZ-IEEE 2025 conference

点击查看摘要

Abstract:This article deals with the description and recognition of fiber bundles, in particular nerves, in medical images, based on the anatomical description of the fiber trajectories. To this end, we propose a logical formalization of this anatomical knowledge. The intrinsically imprecise description of nerves, as found in anatomical textbooks, leads us to propose fuzzy semantics combined with first-order logic. We define a language representing spatial entities, relations between these entities and quantifiers. A formula in this language is then a formalization of the natural language description. The semantics are given by fuzzy representations in a concrete domain and satisfaction degrees of relations. Based on this formalization, a spatial reasoning algorithm is proposed for segmentation and recognition of nerves from anatomical and diffusion magnetic resonance images, which is illustrated on pelvic nerves in pediatric imaging, enabling surgeons to plan surgery.
zh

[AI-44] Attention-enabled Explainable AI for Bladder Cancer Recurrence Prediction

【速读】：该论文试图解决非肌层浸润性膀胱癌（Non-muscle-invasive bladder cancer, NMIBC）复发预测中存在的准确性不足和个性化分析缺失的问题。现有临床预测工具存在根本性缺陷，常高估复发风险且无法为患者管理提供个性化见解。其解决方案的关键在于提出一种可解释的深度学习框架，该框架结合向量嵌入（vector embeddings）和注意力机制（attention mechanisms），通过捕捉患者特征与复发风险之间的复杂关系，提升预测性能，并为临床医生提供基于特征注意力的患者级解释，同时识别出如手术时长和住院时间等此前未被考虑的重要影响因素。

链接: https://arxiv.org/abs/2505.00171
作者: Saram Abbas,Naeem Soomro,Rishad Shafik,Rakesh Heer,Kabita Adhikari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, Accepted to be presented at the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025)

点击查看摘要

Abstract:Non-muscle-invasive bladder cancer (NMIBC) is a relentless challenge in oncology, with recurrence rates soaring as high as 70-80%. Each recurrence triggers a cascade of invasive procedures, lifelong surveillance, and escalating healthcare costs - affecting 460,000 individuals worldwide. However, existing clinical prediction tools remain fundamentally flawed, often overestimating recurrence risk and failing to provide personalized insights for patient management. In this work, we propose an interpretable deep learning framework that integrates vector embeddings and attention mechanisms to improve NMIBC recurrence prediction performance. We incorporate vector embeddings for categorical variables such as smoking status and intravesical treatments, allowing the model to capture complex relationships between patient attributes and recurrence risk. These embeddings provide a richer representation of the data, enabling improved feature interactions and enhancing prediction performance. Our approach not only enhances performance but also provides clinicians with patient-specific insights by highlighting the most influential features contributing to recurrence risk for each patient. Our model achieves accuracy of 70% with tabular data, outperforming conventional statistical methods while providing clinician-friendly patient-level explanations through feature attention. Unlike previous studies, our approach identifies new important factors influencing recurrence, such as surgical duration and hospital stay, which had not been considered in existing NMIBC prediction models.
zh

[AI-45] GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation

【速读】：该论文试图解决当前深度生成模型在3D分子结构生成评估中的关键问题，包括错误的价态定义、键级计算中的缺陷以及对与参考数据不一致的力场依赖。解决方案的关键在于重新审视GEOM-Drugs数据集，修正数据预处理中的问题，构建化学准确的价态表，并引入基于GFN2-xTB的几何和能量基准，从而提供更可靠的评估框架。

链接: https://arxiv.org/abs/2505.00169
作者: Filipp Nikitin,Ian Dunn,David Ryan Koes,Olexandr Isayev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep generative models have shown significant promise in generating valid 3D molecular structures, with the GEOM-Drugs dataset serving as a key benchmark. However, current evaluation protocols suffer from critical flaws, including incorrect valency definitions, bugs in bond order calculations, and reliance on force fields inconsistent with the reference data. In this work, we revisit GEOM-Drugs and propose a corrected evaluation framework: we identify and fix issues in data preprocessing, construct chemically accurate valency tables, and introduce a GFN2-xTB-based geometry and energy benchmark. We retrain and re-evaluate several leading models under this framework, providing updated performance metrics and practical recommendations for future benchmarking. Our results underscore the need for chemically rigorous evaluation practices in 3D molecular generation. Our recommended evaluation methods and GEOM-Drugs processing scripts are available at this https URL.
zh

[AI-46] GPRat: Gaussian Process Regression with Asynchronous Tasks

【速读】：该论文试图解决在基于Python的AI应用中，由于低层后端并行化导致的性能和扩展性下降问题。其解决方案的关键在于将基于异步运行时模型HPX的任务化C++代码通过pybind11绑定到高层Python API，从而结合了Python库的易用性与异步运行时系统的性能和可扩展性。

链接: https://arxiv.org/abs/2505.00136
作者: Maksim Helmann,Alexander Strack,Dirk Pflüger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:Python is the de-facto language for software development in artificial intelligence (AI). Commonly used libraries, such as PyTorch and TensorFlow, rely on parallelization built into their BLAS backends to achieve speedup on CPUs. However, only applying parallelization in a low-level backend can lead to performance and scaling degradation. In this work, we present a novel way of binding task-based C++ code built on the asynchronous runtime model HPX to a high-level Python API using pybind11. We develop a parallel Gaussian process (GP) li- brary as an application. The resulting Python library GPRat combines the ease of use of commonly available GP libraries with the performance and scalability of asynchronous runtime systems. We evaluate the per- formance on a mass-spring-damper system, a standard benchmark from control theory, for varying numbers of regressors (features). The results show almost no binding overhead when binding the asynchronous HPX code using pybind11. Compared to GPyTorch and GPflow, GPRat shows superior scaling on up to 64 cores on an AMD EPYC 7742 CPU for train- ing. Furthermore, our library achieves a prediction speedup of 7.63 over GPyTorch and 25.25 over GPflow. If we increase the number of features from eight to 128, we observe speedups of 29.62 and 21.19, respectively. These results showcase the potential of using asynchronous tasks within Python-based AI applications.
zh

[AI-47] Evaluating the AI-Lab Intervention: Impact on Student Perception and Use of Generative AI in Early Undergraduate Computer Science Courses

【速读】：该论文试图解决生成式 AI (Generative AI, GenAI) 在计算机科学教育中的应用效果问题，特别是其对学生学习、技能发展和认知的影响尚不明确，同时存在对过度依赖的担忧与缺乏结构化指导工具使用的研究空白。解决方案的关键在于设计并实施一个名为“AI-Lab”的干预措施，该措施强调引导性支架（guided scaffolding）和有意识的参与（mindful engagement），旨在促进学生更深入、反思性地将 AI 工具融入课程学习中，从而提升其在概念理解、调试和作业问题上的自信与开放性，并改善调试使用模式。

链接: https://arxiv.org/abs/2505.00100
作者: Ethan Dickey,Andres Bejarano,Rhianna Kuperus,Bárbara Fagundes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 18 pages, 5 figures, 17 tables, submitted for publication

点击查看摘要

Abstract:Generative AI (GenAI) is rapidly entering computer science education, yet its effects on student learning, skill development, and perceptions remain underexplored. Concerns about overreliance coexist with a gap in research on structured scaffolding to guide tool use in formal courses. This study examines the impact of a dedicated “AI-Lab” intervention – emphasizing guided scaffolding and mindful engagement – on undergraduate students in Data Structures and Algorithms, Competitive Programming, and first-year engineering courses at Purdue University. Over three semesters, we integrated AI-Lab modules into four mandatory and elective courses, yielding 831 matched pre- and post-intervention survey responses, alongside focus group discussions. Employing a mixed-methods approach, we analyzed quantitative shifts in usage patterns and attitudes as well as qualitative narratives of student experiences. While the overall frequency of GenAI usage for homework or programming projects remained largely stable, we observed large effect sizes in comfort and openness across conceptual, debugging, and homework problems. Notably, usage patterns for debugging also shifted statistically significantly, reflecting students’ more mindful and deliberate approach. Focus group discussions corroborated these results, suggesting that the intervention “bridged the gap” between naive GenAI usage and more nuanced, reflective integration of AI tools into coursework, ultimately heightening students’ awareness of their own skill development. These findings suggest that structured, scaffolded interventions can enable students to harness GenAI’s benefits without undermining essential competencies. We offer evidence-based recommendations for educators seeking to integrate GenAI responsibly into computing curricula and identify avenues for future research on GenAI-supported pedagogy. Comments: 18 pages, 5 figures, 17 tables, submitted for publication Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET) ACMclasses: K.3 Cite as: arXiv:2505.00100 [cs.CY] (or arXiv:2505.00100v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2505.00100 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ethan Dickey [view email] [v1] Wed, 30 Apr 2025 18:12:42 UTC (819 KB)
zh

[AI-48] CoordField: Coordination Field for Agent ic UAV Task Allocation In Low-altitude Urban Scenarios ITSC2025

【速读】：该论文旨在解决异构无人机蜂群在城市复杂环境中执行任务时面临的系统设计挑战，包括高效的语义理解、灵活的任务规划以及根据环境变化和任务需求动态调整协调策略的能力。其解决方案的关键在于提出一种基于协调场的智能体系统（Coordination field agentic system），该系统利用大型语言模型（LLMs）将高层人类指令转化为可执行命令，并通过协调场机制引导无人机运动与任务选择，实现去中心化和自适应的应急任务分配。

链接: https://arxiv.org/abs/2505.00091
作者: Tengchao Zhang,Yonglin Tian,Fei Lin,Jun Huang,Rui Qin,Fei-Yue Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Submitted ITSC 2025

点击查看摘要

Abstract:With the increasing demand for heterogeneous Unmanned Aerial Vehicle (UAV) swarms to perform complex tasks in urban environments, system design now faces major challenges, including efficient semantic understanding, flexible task planning, and the ability to dynamically adjust coordination strategies in response to evolving environmental conditions and continuously changing task requirements. To address the limitations of existing approaches, this paper proposes coordination field agentic system for coordinating heterogeneous UAV swarms in complex urban scenarios. In this system, large language models (LLMs) is responsible for interpreting high-level human instructions and converting them into executable commands for the UAV swarms, such as patrol and target tracking. Subsequently, a Coordination field mechanism is proposed to guide UAV motion and task selection, enabling decentralized and adaptive allocation of emergent tasks. A total of 50 rounds of comparative testing were conducted across different models in a 2D simulation space to evaluate their performance. Experimental results demonstrate that the proposed system achieves superior performance in terms of task coverage, response time, and adaptability to dynamic changes.
zh

[AI-49] Position Paper: Towards Open Complex Human-AI Agents Collaboration System for Problem-Solving and Knowledge Management

【速读】：该论文试图解决当前人类与AI代理协作研究中缺乏统一理论框架的问题，这一框架难以将多样化的研究工作系统性地整合，尤其是在处理开放性和复杂性任务时。解决方案的关键在于提出一种新的概念架构——分层探索-利用网络（Hierarchical Exploration-Exploitation Net），该架构通过系统性地连接多智能体协调、知识管理、控制反馈回路及高层控制机制，将现有的符号AI技术、基于连接主义的大型语言模型代理以及混合组织实践映射到该框架中，从而促进对传统方法的改进并激发融合定性和定量范式的新兴研究。

链接: https://arxiv.org/abs/2505.00018
作者: Ju Wu,Calvin K.L. Or
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This position paper critically surveys a broad spectrum of recent empirical developments on human-AI agents collaboration, highlighting both their technical achievements and persistent gaps. We observe a lack of a unifying theoretical framework that can coherently integrate these varied studies, especially when tackling open-ended, complex tasks. To address this, we propose a novel conceptual architecture: one that systematically interlinks the technical details of multi-agent coordination, knowledge management, cybernetic feedback loops, and higher-level control mechanisms. By mapping existing contributions, from symbolic AI techniques and connectionist LLM-based agents to hybrid organizational practices, onto this proposed framework (Hierarchical Exploration-Exploitation Net), our approach facilitates revision of legacy methods and inspires new work that fuses qualitative and quantitative paradigms. The paper’s structure allows it to be read from any section, serving equally as a critical review of technical implementations and as a forward-looking reference for designing or extending human-AI symbioses. Together, these insights offer a stepping stone toward deeper co-evolution of human cognition and AI capability.
zh

[AI-50] Learning to Learn with Quantum Optimization via Quantum Neural Networks

【速读】：该论文试图解决在当前噪声中等规模量子（NISQ）设备上，量子近似优化算法（QAOA）因能量景观崎岖和硬件噪声导致的参数优化困难及可扩展性不足的问题。解决方案的关键在于引入一种结合量子神经网络的量子元学习框架，具体采用量子长短期记忆（QLSTM）架构作为优化器，通过在较小图实例上训练QLSTM，使其能够快速泛化到更大、更复杂的问题，从而显著减少收敛所需的迭代次数。

链接: https://arxiv.org/abs/2505.00561
作者: Kuan-Cheng Chen,Hiromichi Matsuyama,Wei-Hao Huang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum Approximate Optimization Algorithms (QAOA) promise efficient solutions to classically intractable combinatorial optimization problems by harnessing shallow-depth quantum circuits. Yet, their performance and scalability often hinge on effective parameter optimization, which remains nontrivial due to rugged energy landscapes and hardware noise. In this work, we introduce a quantum meta-learning framework that combines quantum neural networks, specifically Quantum Long Short-Term Memory (QLSTM) architectures, with QAOA. By training the QLSTM optimizer on smaller graph instances, our approach rapidly generalizes to larger, more complex problems, substantially reducing the number of iterations required for convergence. Through comprehensive benchmarks on Max-Cut and Sherrington-Kirkpatrick model instances, we demonstrate that QLSTM-based optimizers converge faster and achieve higher approximation ratios compared to classical baselines, thereby offering a robust pathway toward scalable quantum optimization in the NISQ era.
zh

[AI-51] On the Mechanistic Interpretability of Neural Networks for Causality in Bio-statistics

【速读】：该论文试图解决在生物统计学中，如何提高神经网络（Neural Networks, NNs）在因果推断中的可解释性问题，尤其是在高风险医疗应用中对其“黑箱”特性的信任与验证难题。解决方案的关键在于应用机制可解释性（Mechanistic Interpretability, MI）技术，通过揭示神经网络内部学习的计算过程，实现对模型内部表示的探测与验证、不同输入类型处理路径的发现与可视化，以及跨模型（统计模型、机器学习模型和神经网络模型）学习机制与提取见解的比较，从而增强对因果生物统计分析中各模型优势与局限的理解。

链接: https://arxiv.org/abs/2505.00555
作者: Jean-Baptiste A. Conan
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable insights from predictive models remain critical in bio-statistics, particularly when assessing causality, where classical statistical and machine learning methods often provide inherent clarity. While Neural Networks (NNs) offer powerful capabilities for modeling complex biological data, their traditional “black-box” nature presents challenges for validation and trust in high-stakes health applications. Recent advances in Mechanistic Interpretability (MI) aim to decipher the internal computations learned by these networks. This work investigates the application of MI techniques to NNs within the context of causal inference for bio-statistics. We demonstrate that MI tools can be leveraged to: (1) probe and validate the internal representations learned by NNs, such as those estimating nuisance functions in frameworks like Targeted Minimum Loss-based Estimation (TMLE); (2) discover and visualize the distinct computational pathways employed by the network to process different types of inputs, potentially revealing how confounders and treatments are handled; and (3) provide methodologies for comparing the learned mechanisms and extracted insights across statistical, machine learning, and NN models, fostering a deeper understanding of their respective strengths and weaknesses for causal bio-statistical analysis. Subjects: Applications (stat.AP); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.00555 [stat.AP] (or arXiv:2505.00555v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2505.00555 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-52] Perceptual Implications of Automatic Anonymization in Pathological Speech

【速读】：该论文旨在解决病理语音数据在伦理共享过程中面临的隐私保护与语音可理解性之间的平衡问题，特别是自动匿名化技术对语音感知质量的影响尚未得到充分研究。其解决方案的关键在于通过结构化的感知协议，对使用先进自动方法匿名化的病理语音进行系统评估，分析不同语言背景的听者在零样本和少量样本条件下的辨识准确率和质量评分，揭示匿名化对语音质量的影响及其与病理类型的相关性，从而为设计更有效的、考虑听众反馈和病理特异性的匿名化策略提供依据。

链接: https://arxiv.org/abs/2505.00409
作者: Soroosh Tayebi Arasteh,Saba Afza,Tri-Thien Nguyen,Lukas Buess,Maryam Parvin,Tomas Arias-Vergara,Paula Andrea Perez-Toro,Hiu Ching Hung,Mahshad Lotfinia,Thomas Gorges,Elmar Noeth,Maria Schuster,Seung Hee Yang,Andreas Maier
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automatic anonymization techniques are essential for ethical sharing of pathological speech data, yet their perceptual consequences remain understudied. This study presents the first comprehensive human-centered analysis of anonymized pathological speech, using a structured perceptual protocol involving ten native and non-native German listeners with diverse linguistic, clinical, and technical backgrounds. Listeners evaluated anonymized-original utterance pairs from 180 speakers spanning Cleft Lip and Palate, Dysarthria, Dysglossia, Dysphonia, and age-matched healthy controls. Speech was anonymized using state-of-the-art automatic methods (equal error rates in the range of 30-40%). Listeners completed Turing-style discrimination and quality rating tasks under zero-shot (single-exposure) and few-shot (repeated-exposure) conditions. Discrimination accuracy was high overall (91% zero-shot; 93% few-shot), but varied by disorder (repeated-measures ANOVA: p=0.007), ranging from 96% (Dysarthria) to 86% (Dysphonia). Anonymization consistently reduced perceived quality (from 83% to 59%, p0.001), with pathology-specific degradation patterns (one-way ANOVA: p=0.005). Native listeners rated original speech slightly higher than non-native listeners (Delta=4%, p=0.199), but this difference nearly disappeared after anonymization (Delta=1%, p=0.724). No significant gender-based bias was observed. Critically, human perceptual outcomes did not correlate with automatic privacy or clinical utility metrics. These results underscore the need for listener-informed, disorder- and context-specific anonymization strategies that preserve privacy while maintaining interpretability, communicative functions, and diagnostic utility, especially for vulnerable populations such as children.
zh

[AI-53] Convolutional Autoencoders for Data Compression and Anomaly Detection in Small Satellite Technologies

【速读】：该论文旨在解决小卫星在地球观测任务中数据传输效率低和异常检测能力不足的问题，其解决方案的关键在于采用卷积自编码器（convolutional autoencoder），该方法能够在卫星载荷上实现数据压缩以提高传输效率，并同时进行源端异常检测，从而优化卫星数据采集过程。

链接: https://arxiv.org/abs/2505.00040
作者: Dishanand Jayeprokash,Julia Gonski
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Small satellite technologies have enhanced the potential and feasibility of geodesic missions, through simplification of design and decreased costs allowing for more frequent launches. On-satellite data acquisition systems can benefit from the implementation of machine learning (ML), for better performance and greater efficiency on tasks such as image processing or feature extraction. This work presents convolutional autoencoders for implementation on the payload of small satellites, designed to achieve dual functionality of data compression for more efficient off-satellite transmission, and at-source anomaly detection to inform satellite data-taking. This capability is demonstrated for a use case of disaster monitoring using aerial image datasets of the African continent, offering avenues for both novel ML-based approaches in small satellite applications along with the expansion of space technology and artificial intelligence in Africa.
zh

机器学习

[LG-0] On the Importance of Gaussianizing Representations ICML2025

链接: https://arxiv.org/abs/2505.00685
作者: Daniel Eftekhari,Vardan Papyan
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Proceedings

点击查看摘要

Abstract:The normal distribution plays a central role in information theory - it is at the same time the best-case signal and worst-case noise distribution, has the greatest representational capacity of any distribution, and offers an equivalence between uncorrelatedness and independence for joint distributions. Accounting for the mean and variance of activations throughout the layers of deep neural networks has had a significant effect on facilitating their effective training, but seldom has a prescription for precisely what distribution these activations should take, and how this might be achieved, been offered. Motivated by the information-theoretic properties of the normal distribution, we address this question and concurrently present normality normalization: a novel normalization layer which encourages normality in the feature representations of neural networks using the power transform and employs additive Gaussian noise during training. Our experiments comprehensively demonstrate the effectiveness of normality normalization, in regards to its generalization performance on an array of widely used model and dataset combinations, its strong performance across various common factors of variation such as model width, depth, and training minibatch size, its suitability for usage wherever existing normalization layers are conventionally used, and as a means to improving model robustness to random perturbations.

[LG-1] Explainable AI in Spatial Analysis

链接: https://arxiv.org/abs/2505.00591
作者: Ziqi Li
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:This chapter discusses the opportunities of eXplainable Artificial Intelligence (XAI) within the realm of spatial analysis. A key objective in spatial analysis is to model spatial relationships and infer spatial processes to generate knowledge from spatial data, which has been largely based on spatial statistical methods. More recently, machine learning offers scalable and flexible approaches that complement traditional methods and has been increasingly applied in spatial data science. Despite its advantages, machine learning is often criticized for being a black box, which limits our understanding of model behavior and output. Recognizing this limitation, XAI has emerged as a pivotal field in AI that provides methods to explain the output of machine learning models to enhance transparency and understanding. These methods are crucial for model diagnosis, bias detection, and ensuring the reliability of results obtained from machine learning models. This chapter introduces key concepts and methods in XAI with a focus on Shapley value-based approaches, which is arguably the most popular XAI method, and their integration with spatial analysis. An empirical example of county-level voting behaviors in the 2020 Presidential election is presented to demonstrate the use of Shapley values and spatial analysis with a comparison to multi-scale geographically weighted regression. The chapter concludes with a discussion on the challenges and limitations of current XAI techniques and proposes new directions.

[LG-2] Unlocking the Potential of Linear Networks for Irregular Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2505.00590
作者: Chengsen Wang,Qi Qi,Jingyu Wang,Haifeng Sun,Zirui Zhuang,Jianxin Liao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting holds significant importance across various industries, including finance, transportation, energy, healthcare, and climate. Despite the widespread use of linear networks due to their low computational cost and effectiveness in modeling temporal dependencies, most existing research has concentrated on regularly sampled and fully observed multivariate time series. However, in practice, we frequently encounter irregular multivariate time series characterized by variable sampling intervals and missing values. The inherent intra-series inconsistency and inter-series asynchrony in such data hinder effective modeling and forecasting with traditional linear networks relying on static weights. To tackle these challenges, this paper introduces a novel model named AiT. AiT utilizes an adaptive linear network capable of dynamically adjusting weights according to observation time points to address intra-series inconsistency, thereby enhancing the accuracy of temporal dependencies modeling. Furthermore, by incorporating the Transformer module on variable semantics embeddings, AiT efficiently captures variable correlations, avoiding the challenge of inter-series asynchrony. Comprehensive experiments across four benchmark datasets demonstrate the superiority of AiT, improving prediction accuracy by 11% and decreasing runtime by 52% compared to existing state-of-the-art methods.

[LG-3] ParkDiffusion: Heterogeneous Multi-Agent Multi-Modal Trajectory Prediction for Automated Parking using Diffusion Models

链接: https://arxiv.org/abs/2505.00586
作者: Jiarong Wei,Niclas Vödisch,Anna Rehr,Christian Feist,Abhinav Valada
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated parking is a critical feature of Advanced Driver Assistance Systems (ADAS), where accurate trajectory prediction is essential to bridge perception and planning modules. Despite its significance, research in this domain remains relatively limited, with most existing studies concentrating on single-modal trajectory prediction of vehicles. In this work, we propose ParkDiffusion, a novel approach that predicts the trajectories of both vehicles and pedestrians in automated parking scenarios. ParkDiffusion employs diffusion models to capture the inherent uncertainty and multi-modality of future trajectories, incorporating several key innovations. First, we propose a dual map encoder that processes soft semantic cues and hard geometric constraints using a two-step cross-attention mechanism. Second, we introduce an adaptive agent type embedding module, which dynamically conditions the prediction process on the distinct characteristics of vehicles and pedestrians. Third, to ensure kinematic feasibility, our model outputs control signals that are subsequently used within a kinematic framework to generate physically feasible trajectories. We evaluate ParkDiffusion on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Our work establishes a new baseline for heterogeneous trajectory prediction in parking scenarios, outperforming existing methods by a considerable margin.

[LG-4] Parameter-Efficient Fine-Tuning with Circulant and Diagonal Vectors IJCAI-2025

链接: https://arxiv.org/abs/2505.00580
作者: Xinyu Ding,Lexuan Chen,Siyu Liao,Zhongfeng Wang
类目: Machine Learning (cs.LG)
*备注: to appear in Proceedings of the 2025 International Joint Conference on Artificial Intelligence (IJCAI-2025)

点击查看摘要

Abstract:Foundation models have achieved tremendous success in different domains. However, their huge computation and storage complexity make these models difficult to fine-tune and also less applicable in practice. Recent study shows training in Fourier domain can be an effective fine-tuning method in terms of both model performance and number of training parameters. In this work, we propose to further reduce the complexity by the factorization through the product of interleaved circulant and diagonal matrices. In addition, we address the case of non-square fine-tuning weights by partitioning the circulant matrix into blocks. Our method avoids the construction of weight change matrix and utilizes 1D fast Fourier transform (FFT) instead of 2D FFT. Experimental results show that our method achieves similar or better performance across various tasks with much less floating-point operations (FLOPs) and the number of trainable parameters.

[LG-5] Graph Spectral Filtering with Chebyshev Interpolation for Recommendation SIGIR2025

链接: https://arxiv.org/abs/2505.00552
作者: Chanwoo Kim,Jinkyu Sung,Yebonn Han,Joonseok Lee
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by SIGIR 2025; 11 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Graph convolutional networks have recently gained prominence in collaborative filtering (CF) for recommendations. However, we identify potential bottlenecks in two foundational components. First, the embedding layer leads to a latent space with limited capacity, overlooking locally observed but potentially valuable preference patterns. Also, the widely-used neighborhood aggregation is limited in its ability to leverage diverse preference patterns in a fine-grained manner. Building on spectral graph theory, we reveal that these limitations stem from graph filtering with a cut-off in the frequency spectrum and a restricted linear form. To address these issues, we introduce ChebyCF, a CF framework based on graph spectral filtering. Instead of a learned embedding, it takes a user’s raw interaction history to utilize the full spectrum of signals contained in it. Also, it adopts Chebyshev interpolation to effectively approximate a flexible non-linear graph filter, and further enhances it by using an additional ideal pass filter and degree-based normalization. Through extensive experiments, we verify that ChebyCF overcomes the aforementioned bottlenecks and achieves state-of-the-art performance across multiple benchmarks and reasonably fast inference. Our code is available at this https URL.

[LG-6] Directly Forecasting Belief for Reinforcement Learning with Delays

链接: https://arxiv.org/abs/2505.00546
作者: Qingyuan Wu,Yuhui Wang,Simon Sinong Zhan,Yixuan Wang,Chung-Wei Lin,Chen Lv,Qi Zhu,Jürgen Schmidhuber,Chao Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT’s capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines.

[LG-7] KnowEEG: Explainable Knowledge Driven EEG Classification

链接: https://arxiv.org/abs/2505.00541
作者: Amarpal Sahota,Navid Mohammadi Foumani,Raul Santos-Rodriguez,Zahraa S. Abdallah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is a method of recording brain activity that shows significant promise in applications ranging from disease classification to emotion detection and brain-computer interfaces. Recent advances in deep learning have improved EEG classification performance yet model explainability remains an issue. To address this key limitation of explainability we introduce KnowEEG; a novel explainable machine learning approach for EEG classification. KnowEEG extracts a comprehensive set of per-electrode features, filters them using statistical tests, and integrates between-electrode connectivity statistics. These features are then input to our modified Random Forest model (Fusion Forest) that balances per electrode statistics with between electrode connectivity features in growing the trees of the forest. By incorporating knowledge from both the generalized time-series and EEG-specific domains, KnowEEG achieves performance comparable to or exceeding state-of-the-art deep learning models across five different classification tasks: emotion detection, mental workload classification, eyes open/closed detection, abnormal EEG classification, and event detection. In addition to high performance, KnowEEG provides inherent explainability through feature importance scores for understandable features. We demonstrate by example on the eyes closed/open classification task that this explainability can be used to discover knowledge about the classes. This discovered knowledge for eyes open/closed classification was proven to be correct by current neuroscience literature. Therefore, the impact of KnowEEG will be significant for domains where EEG explainability is critical such as healthcare.

[LG-8] Emergence of Roles in Robotic Teams with Model Sharing and Limited Communication

链接: https://arxiv.org/abs/2505.00540
作者: Ian O’Flynn,Harun Šiljak
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: Accepted for 2025 8th International Balkan Conference on Communications and Networking (Balkancom)

点击查看摘要

Abstract:We present a reinforcement learning strategy for use in multi-agent foraging systems in which the learning is centralised to a single agent and its model is periodically disseminated among the population of non-learning agents. In a domain where multi-agent reinforcement learning (MARL) is the common approach, this approach aims to significantly reduce the computational and energy demands compared to approaches such as MARL and centralised learning models. By developing high performing foraging agents, these approaches can be translated into real-world applications such as logistics, environmental monitoring, and autonomous exploration. A reward function was incorporated into this approach that promotes role development among agents, without explicit directives. This led to the differentiation of behaviours among the agents. The implicit encouragement of role differentiation allows for dynamic actions in which agents can alter roles dependent on their interactions with the environment without the need for explicit communication between agents.

[LG-9] Leverag ing Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks ICML2025

链接: https://arxiv.org/abs/2505.00530
作者: Xinyu Wang,Jinbo Bi,Minghu Song
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Biomolecules (q-bio.BM)
*备注: 17 pages, 5 main figures, 2 appendix figures. Submitted to ICML 2025

点击查看摘要

Abstract:SMILES-based molecule generation has emerged as a powerful approach in drug discovery. Deep reinforcement learning (RL) using large language model (LLM) has been incorporated into the molecule generation process to achieve high matching score in term of likelihood of desired molecule candidates. However, a critical challenge in this approach is catastrophic forgetting during the RL phase, where knowledge such as molecule validity, which often exceeds 99% during pretraining, significantly deteriorates. Current RL algorithms applied in drug discovery, such as REINVENT, use prior models as anchors to retian pretraining knowledge, but these methods lack robust exploration mechanisms. To address these issues, we propose Partial SMILES Validation-PPO (PSV-PPO), a novel RL algorithm that incorporates real-time partial SMILES validation to prevent catastrophic forgetting while encouraging exploration. Unlike traditional RL approaches that validate molecule structures only after generating entire sequences, PSV-PPO performs stepwise validation at each auto-regressive step, evaluating not only the selected token candidate but also all potential branches stemming from the prior partial sequence. This enables early detection of invalid partial SMILES across all potential paths. As a result, PSV-PPO maintains high validity rates even during aggressive exploration of the vast chemical space. Our experiments on the PMO and GuacaMol benchmark datasets demonstrate that PSV-PPO significantly reduces the number of invalid generated structures while maintaining competitive exploration and optimization performance. While our work primarily focuses on maintaining validity, the framework of PSV-PPO can be extended in future research to incorporate additional forms of valuable domain knowledge, further enhancing reinforcement learning applications in drug discovery.

[LG-10] Self-Ablating Transformers: More Interpretability Less Sparsity ICLR2025

链接: https://arxiv.org/abs/2505.00509
作者: Jeremias Ferrao,Luhan Mikaelson,Keenan Pepper,Natalia Perez-Campanero Antolin
类目: Machine Learning (cs.LG)
*备注: Poster Presentation at Building Trust Workshop at ICLR 2025

点击查看摘要

Abstract:A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretability directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretability tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and interpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at this https URL.

[LG-11] Implicit Neural-Representation Learning for Elastic Deformable-Object Manipulations

链接: https://arxiv.org/abs/2505.00500
作者: Minseok Song,JeongHo Ha,Bonggyeong Park,Daehyung Park
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We aim to solve the problem of manipulating deformable objects, particularly elastic bands, in real-world scenarios. However, deformable object manipulation (DOM) requires a policy that works on a large state space due to the unlimited degree of freedom (DoF) of deformable objects. Further, their dense but partial observations (e.g., images or point clouds) may increase the sampling complexity and uncertainty in policy learning. To figure it out, we propose a novel implicit neural-representation (INR) learning for elastic DOMs, called INR-DOM. Our method learns consistent state representations associated with partially observable elastic objects reconstructing a complete and implicit surface represented as a signed distance function. Furthermore, we perform exploratory representation fine-tuning through reinforcement learning (RL) that enables RL algorithms to effectively learn exploitable representations while efficiently obtaining a DOM policy. We perform quantitative and qualitative analyses building three simulated environments and real-world manipulation studies with a Franka Emika Panda arm. Videos are available at this http URL.

[LG-12] Enhancing Tropical Cyclone Path Forecasting with an Improved Transformer Network

链接: https://arxiv.org/abs/2505.00495
作者: Nguyen Van Thanh,Nguyen Dang Huynh,Nguyen Ngoc Tan,Nguyen Thai Minh,Nguyen Nam Hoang
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:A storm is a type of extreme weather. Therefore, forecasting the path of a storm is extremely important for protecting human life and property. However, storm forecasting is very challenging because storm trajectories frequently change. In this study, we propose an improved deep learning method using a Transformer network to predict the movement trajectory of a storm over the next 6 hours. The storm data used to train the model was obtained from the National Oceanic and Atmospheric Administration (NOAA) [1]. Simulation results show that the proposed method is more accurate than traditional methods. Moreover, the proposed method is faster and more cost-effective

[LG-13] Interpretable Spatial-Temporal Fusion Transformers: Multi-Output Prediction for Parametric Dynamical Systems with Time-Varying Inputs

链接: https://arxiv.org/abs/2505.00473
作者: Shuwen Sun,Lihong Feng,Peter Benner
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We explore the promising performance of a transformer model in predicting outputs of parametric dynamical systems with external time-varying input signals. The outputs of such systems vary not only with physical parameters but also with external time-varying input signals. Accurately catching the dynamics of such systems is challenging. We have adapted and extended an existing transformer model for single output prediction to a multiple-output transformer that is able to predict multiple output responses of these systems. The multiple-output transformer generalizes the interpretability of the original transformer. The generalized interpretable attention weight matrix explores not only the temporal correlations in the sequence, but also the interactions between the multiple outputs, providing explanation for the spatial correlation in the output domain. This multiple-output transformer accurately predicts the sequence of multiple outputs, regardless of the nonlinearity of the system and the dimensionality of the parameter space.

[LG-14] A Generalised Framework for Property-Driven Machine Learning

链接: https://arxiv.org/abs/2505.00466
作者: Thomas Flinkow,Marco Casadio,Colin Kessler,Rosemary Monahan,Ekaterina Komendantskaya
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 22 pages, 4 tables, 4 figures. Submitted to AI Verification 2025

点击查看摘要

Abstract:Neural networks have been shown to frequently fail to satisfy critical safety and correctness properties after training, highlighting the pressing need for training methods that incorporate such properties directly. While adversarial training can be used to improve robustness to small perturbations within \epsilon -cubes, domains other than computer vision – such as control systems and natural language processing – may require more flexible input region specifications via generalised hyper-rectangles. Meanwhile, differentiable logics offer a way to encode arbitrary logical constraints as additional loss terms that guide the learning process towards satisfying these constraints. In this paper, we investigate how these two complementary approaches can be unified within a single framework for property-driven machine learning. We show that well-known properties from the literature are subcases of this general approach, and we demonstrate its practical effectiveness on a case study involving a neural network controller for a drone system. Our framework is publicly available at this https URL.

[LG-15] Subspace-Distance-Enabled Active Learning for Efficient Data-Driven Model Reduction of Parametric Dynamical Systems

链接: https://arxiv.org/abs/2505.00460
作者: Harshit Kapadia,Peter Benner,Lihong Feng
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注: 31 pages, 10 figures, 4 tables

点击查看摘要

Abstract:In situations where the solution of a high-fidelity dynamical system needs to be evaluated repeatedly, over a vast pool of parametric configurations and in absence of access to the underlying governing equations, data-driven model reduction techniques are preferable. We propose a novel active learning approach to build a parametric data-driven reduced-order model (ROM) by greedily picking the most important parameter samples from the parameter domain. As a result, during the ROM construction phase, the number of high-fidelity solutions dynamically grow in a principled fashion. The high-fidelity solution snapshots are expressed in several parameter-specific linear subspaces, with the help of proper orthogonal decomposition (POD), and the relative distance between these subspaces is used as a guiding mechanism to perform active learning. For successfully achieving this, we provide a distance measure to evaluate the similarity between pairs of linear subspaces with different dimensions, and also show that this distance measure is a metric. The usability of the proposed subspace-distance-enabled active learning (SDE-AL) framework is demonstrated by augmenting two existing non-intrusive reduced-order modeling approaches, and providing their active-learning-driven (ActLearn) extensions, namely, SDE-ActLearn-POD-KSNN, and SDE-ActLearn-POD-NN. Furthermore, we report positive results for two parametric physical models, highlighting the efficiency of the proposed SDE-AL approach.

[LG-16] CICADA: Cross-Domain Interpretable Coding for Anomaly Detection and Adaptation in Multivariate Time Series

链接: https://arxiv.org/abs/2505.00415
作者: Tian Lan,Yifei Gao,Yimeng Lu,Chen Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised Time series anomaly detection plays a crucial role in applications across industries. However, existing methods face significant challenges due to data distributional shifts across different domains, which are exacerbated by the non-stationarity of time series over time. Existing models fail to generalize under multiple heterogeneous source domains and emerging unseen new target domains. To fill the research gap, we introduce CICADA (Cross-domain Interpretable Coding for Anomaly Detection and Adaptation), with four key innovations: (1) a mixture of experts (MOE) framework that captures domain-agnostic anomaly features with high flexibility and interpretability; (2) a novel selective meta-learning mechanism to prevent negative transfer between dissimilar domains, (3) an adaptive expansion algorithm for emerging heterogeneous domain expansion, and (4) a hierarchical attention structure that quantifies expert contributions during fusion to enhance interpretability this http URL experiments on synthetic and real-world industrial datasets demonstrate that CICADA outperforms state-of-the-art methods in both cross-domain detection performance and interpretability.

[LG-17] Machine Learning Meets Transparency in Osteoporosis Risk Assessment: A Comparative Study of ML and Explainability Analysis

链接: https://arxiv.org/abs/2505.00410
作者: Farhana Elias,Md Shihab Reza,Muhammad Zawad Mahmud,Samiha Islam
类目: Machine Learning (cs.LG)
*备注: Submitted in an international conference

点击查看摘要

Abstract:The present research tackles the difficulty of predicting osteoporosis risk via machine learning (ML) approaches, emphasizing the use of explainable artificial intelligence (XAI) to improve model transparency. Osteoporosis is a significant public health concern, sometimes remaining untreated owing to its asymptomatic characteristics, and early identification is essential to avert fractures. The research assesses six machine learning classifiers: Random Forest, Logistic Regression, XGBoost, AdaBoost, LightGBM, and Gradient Boosting and utilizes a dataset based on clinical, demographic, and lifestyle variables. The models are refined using GridSearchCV to calibrate hyperparameters, with the objective of enhancing predictive efficacy. XGBoost had the greatest accuracy (91%) among the evaluated models, surpassing others in precision (0.92), recall (0.91), and F1-score (0.90). The research further integrates XAI approaches, such as SHAP, LIME, and Permutation Feature Importance, to elucidate the decision-making process of the optimal model. The study indicates that age is the primary determinant in forecasting osteoporosis risk, followed by hormonal alterations and familial history. These results corroborate clinical knowledge and affirm the models’ therapeutic significance. The research underscores the significance of explainability in machine learning models for healthcare applications, guaranteeing that physicians can rely on the system’s predictions. The report ultimately proposes directions for further research, such as validation across varied populations and the integration of supplementary biomarkers for enhanced predictive accuracy.

[LG-18] Safety in the Face of Adversity: Achieving Zero Constraint Violation in Online Learning with Slowly Changing Constraints AISTATS

链接: https://arxiv.org/abs/2505.00398
作者: Bassel Hamoud,Ilnura Usmanova,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注: Accepted to AISTATS, 2025

点击查看摘要

Abstract:We present the first theoretical guarantees for zero constraint violation in Online Convex Optimization (OCO) across all rounds, addressing dynamic constraint changes. Unlike existing approaches in constrained OCO, which allow for occasional safety breaches, we provide the first approach for maintaining strict safety under the assumption of gradually evolving constraints, namely the constraints change at most by a small amount between consecutive rounds. This is achieved through a primal-dual approach and Online Gradient Ascent in the dual space. We show that employing a dichotomous learning rate enables ensuring both safety, via zero constraint violation, and sublinear regret. Our framework marks a departure from previous work by providing the first provable guarantees for maintaining absolute safety in the face of changing constraints in OCO.

[LG-19] Approximation to Deep Q-Network by Stochastic Delay Differential Equations

链接: https://arxiv.org/abs/2505.00382
作者: Jianya Lu,Yingjun Mo
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Despite the significant breakthroughs that the Deep Q-Network (DQN) has brought to reinforcement learning, its theoretical analysis remains limited. In this paper, we construct a stochastic differential delay equation (SDDE) based on the DQN algorithm and estimate the Wasserstein-1 distance between them. We provide an upper bound for the distance and prove that the distance between the two converges to zero as the step size approaches zero. This result allows us to understand DQN’s two key techniques, the experience replay and the target network, from the perspective of continuous systems. Specifically, the delay term in the equation, corresponding to the target network, contributes to the stability of the system. Our approach leverages a refined Lindeberg principle and an operator comparison to establish these results.

[LG-20] From GNNs to Trees: Multi-Granular Interpretability for Graph Neural Networks ICLR2025

链接: https://arxiv.org/abs/2505.00364
作者: Jie Yang,Yuwen Wang,Kaixuan Chen,Tongya Zheng,Yihe Zhou,Zhenbang Xiao,Ji Cao,Mingli Song,Shunyu Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2025

点击查看摘要

Abstract:Interpretable Graph Neural Networks (GNNs) aim to reveal the underlying reasoning behind model predictions, attributing their decisions to specific subgraphs that are informative. However, existing subgraph-based interpretable methods suffer from an overemphasis on local structure, potentially overlooking long-range dependencies within the entire graphs. Although recent efforts that rely on graph coarsening have proven beneficial for global interpretability, they inevitably reduce the graphs to a fixed granularity. Such an inflexible way can only capture graph connectivity at a specific level, whereas real-world graph tasks often exhibit relationships at varying granularities (e.g., relevant interactions in proteins span from functional groups, to amino acids, and up to protein domains). In this paper, we introduce a novel Tree-like Interpretable Framework (TIF) for graph classification, where plain GNNs are transformed into hierarchical trees, with each level featuring coarsened graphs of different granularity as tree nodes. Specifically, TIF iteratively adopts a graph coarsening module to compress original graphs (i.e., root nodes of trees) into increasingly coarser ones (i.e., child nodes of trees), while preserving diversity among tree nodes within different branches through a dedicated graph perturbation module. Finally, we propose an adaptive routing module to identify the most informative root-to-leaf paths, providing not only the final prediction but also the multi-granular interpretability for the decision-making process. Extensive experiments on the graph classification benchmarks with both synthetic and real-world datasets demonstrate the superiority of TIF in interpretability, while also delivering a competitive prediction performance akin to the state-of-the-art counterparts.

[LG-21] Validation of a 24-hour-ahead Prediction model for a Residential Electrical Load under diverse climate

链接: https://arxiv.org/abs/2505.00348
作者: Ehtisham Asghar,Martin Hill,Ibrahim Sengor,Conor Lynch,Phan Quang An
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate household electrical energy demand prediction is essential for effectively managing sustainable Energy Communities. Integrated with the Energy Management System, these communities aim to optimise operational costs. However, most existing forecasting models are region-specific and depend on large datasets, limiting their applicability across different climates and geographical areas. These models often lack flexibility and may not perform well in regions with limited historical data, leading to inaccurate predictions. This paper proposes a global model for 24-hour-ahead hourly electrical energy demand prediction that is designed to perform effectively across diverse climate conditions and datasets. The model’s efficiency is demonstrated using data from two distinct regions: Ireland, with a maritime climate and Vietnam, with a tropical climate. Remarkably, the model achieves high accuracy even with a limited dataset spanning only nine months. Its robustness is further validated across different seasons in Ireland (summer and winter) and Vietnam (dry and wet). The proposed model is evaluated against state-of-the-art machine learning and deep learning methods. Simulation results indicate that the model consistently outperforms benchmark models, showcasing its capability to provide reliable forecasts globally, regardless of varying climatic conditions and data availability. This research underscores the model’s potential to enhance the efficiency and sustainability of Energy Communities worldwide. The proposed model achieves a Mean Absolute Percentage Error of 8.0% and 4.0% on the full Irish and Vietnamese datasets.

[LG-22] Communication-Efficient Wireless Federated Fine-Tuning for Large-Scale AI Models

链接: https://arxiv.org/abs/2505.00333
作者: Bumjun Kim,Wan Choi
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have achieved remarkable success across various tasks. Yet, fine-tuning such massive models in federated learning (FL) settings poses significant challenges due to resource constraints and communication overhead. Low-Rank Adaptation (LoRA) addresses these issues by training compact, low-rank matrices instead of fully fine-tuning large models. This paper introduces a wireless federated LoRA fine-tuning framework that optimizes both learning performance and communication efficiency. We provide a novel convergence analysis, revealing how LoRA rank and covariance effects influence FL training dynamics. Leveraging these insights, we propose Sparsified Orthogonal Fine-Tuning (\textbfSOFT), an adaptive sparsification method that streamlines parameter updates without expensive matrix multiplications and singular value decomposition (SVD) operations. Additionally, we present a Two Stage Federated Algorithm (\textbfTSFA) algorithm that pre-determines key parameters offline and dynamically adjusts bandwidth and sparsification online, ensuring efficient training under latency constraints. Experiments on benchmark datasets show that our approach achieves accuracy comparable to ideal scenario models while significantly reducing communication overhead. Our framework thus enables scalable, resource-efficient deployment of large models in real-world wireless FL scenarios.

[LG-23] Optimal Vector Compressed Sensing Using James Stein Shrinkage

链接: https://arxiv.org/abs/2505.00326
作者: Apratim Dey,David Donoho
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Computation (stat.CO); Methodology (stat.ME)
*备注: 69 pages

点击查看摘要

Abstract:The trend in modern science and technology is to take vector measurements rather than scalars, ruthlessly scaling to ever higher dimensional vectors. For about two decades now, traditional scalar Compressed Sensing has been synonymous with a Convex Optimization based procedure called Basis Pursuit. In the vector recovery case, the natural tendency is to return to a straightforward vector extension of Basis Pursuit, also based on Convex Optimization. However, Convex Optimization is provably suboptimal, particularly when B is large. In this paper, we propose SteinSense, a lightweight iterative algorithm, which is provably optimal when B is large. It does not have any tuning parameter, does not need any training data, requires zero knowledge of sparsity, is embarrassingly simple to implement, and all of this makes it easily scalable to high vector dimensions. We conduct a massive volume of both real and synthetic experiments that confirm the efficacy of SteinSense, and also provide theoretical justification based on ideas from Approximate Message Passing. Fascinatingly, we discover that SteinSense is quite robust, delivering the same quality of performance on real data, and even under substantial departures from conditions under which existing theory holds.

[LG-24] Edge Large AI Models: Revolutionizing 6G Networks

链接: https://arxiv.org/abs/2505.00321
作者: Zixin Wang,Yuanming Shi,Yong Zhou,Jingyang Zhu,Khaled. B. Letaief
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Large artificial intelligence models (LAMs) possess human-like abilities to solve a wide range of real-world problems, exemplifying the potential of experts in various domains and modalities. By leveraging the communication and computation capabilities of geographically dispersed edge devices, edge LAM emerges as an enabling technology to empower the delivery of various real-time intelligent services in 6G. Unlike traditional edge artificial intelligence (AI) that primarily supports a single task using small models, edge LAM is featured by the need of the decomposition and distributed deployment of large models, and the ability to support highly generalized and diverse tasks. However, due to limited communication, computation, and storage resources over wireless networks, the vast number of trainable neurons and the substantial communication overhead pose a formidable hurdle to the practical deployment of edge LAMs. In this paper, we investigate the opportunities and challenges of edge LAMs from the perspectives of model decomposition and resource management. Specifically, we propose collaborative fine-tuning and full-parameter training frameworks, alongside a microservice-assisted inference architecture, to enhance the deployment of edge LAM over wireless networks. Additionally, we investigate the application of edge LAM in air-interface designs, focusing on channel prediction and beamforming. These innovative frameworks and applications offer valuable insights and solutions for advancing 6G technology.

[LG-25] FedEMA: Federated Exponential Moving Averag ing with Negative Entropy Regularizer in Autonomous Driving

链接: https://arxiv.org/abs/2505.00318
作者: Wei-Bin Kou,Guangxu Zhu,Bingyang Cheng,Shuai Wang,Ming Tang,Yik-Chung Wu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Street Scene Semantic Understanding (denoted as S3U) is a crucial but complex task for autonomous driving (AD) vehicles. Their inference models typically face poor generalization due to domain-shift. Federated Learning (FL) has emerged as a promising paradigm for enhancing the generalization of AD models through privacy-preserving distributed learning. However, these FL AD models face significant temporal catastrophic forgetting when deployed in dynamically evolving environments, where continuous adaptation causes abrupt erosion of historical knowledge. This paper proposes Federated Exponential Moving Average (FedEMA), a novel framework that addresses this challenge through two integral innovations: (I) Server-side model’s historical fitting capability preservation via fusing current FL round’s aggregation model and a proposed previous FL round’s exponential moving average (EMA) model; (II) Vehicle-side negative entropy regularization to prevent FL models’ possible overfitting to EMA-introduced temporal patterns. Above two strategies empower FedEMA a dual-objective optimization that balances model generalization and adaptability. In addition, we conduct theoretical convergence analysis for the proposed FedEMA. Extensive experiments both on Cityscapes dataset and Camvid dataset demonstrate FedEMA’s superiority over existing approaches, showing 7.12% higher mean Intersection-over-Union (mIoU).

[LG-26] Gateformer: Advancing Multivariate Time Series Forecasting through Temporal and Variate-Wise Attention with Gated Representations

链接: https://arxiv.org/abs/2505.00307
作者: Yu-Hsiang Lan,Anton Alyakin,Eric K. Oermann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been a recent surge of interest in time series modeling using the Transformer architecture. However, forecasting multivariate time series with Transformer presents a unique challenge as it requires modeling both temporal (cross-time) and variate (cross-variate) dependencies. While Transformer-based models have gained popularity for their flexibility in capturing both sequential and cross-variate relationships, it is unclear how to best integrate these two sources of information in the context of the Transformer architecture while optimizing for both performance and efficiency. We re-purpose the Transformer architecture to effectively model both cross-time and cross-variate dependencies. Our approach begins by embedding each variate independently into a variate-wise representation that captures its cross-time dynamics, and then models cross-variate dependencies through attention mechanisms on these learned embeddings. Gating operations in both cross-time and cross-variate modeling phases regulate information flow, allowing the model to focus on the most relevant features for accurate predictions. Our method achieves state-of-the-art performance across 13 real-world datasets and can be seamlessly integrated into other Transformer-based and LLM-based forecasters, delivering performance improvements up to 20.7% over original models. Code is available at this repository: this https URL.

[LG-27] mporal Attention Evolutional Graph Convolutional Network for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2505.00302
作者: Xinlong Zhao,Liying Zhang,Tianbo Zou,Yan Zhang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Multivariate time series forecasting enables the prediction of future states by leveraging historical data, thereby facilitating decision-making processes. Each data node in a multivariate time series encompasses a sequence of multiple dimensions. These nodes exhibit interdependent relationships, forming a graph structure. While existing prediction methods often assume a fixed graph structure, many real-world scenarios involve dynamic graph structures. Moreover, interactions among time series observed at different time scales vary significantly. To enhance prediction accuracy by capturing precise temporal and spatial features, this paper introduces the Temporal Attention Evolutional Graph Convolutional Network (TAEGCN). This novel method not only integrates causal temporal convolution and a multi-head self-attention mechanism to learn temporal features of nodes, but also construct the dynamic graph structure based on these temporal features to keep the consistency of the changing in spatial feature with temporal series. TAEGCN adeptly captures temporal causal relationships and hidden spatial dependencies within the data. Furthermore, TAEGCN incorporates a unified neural network that seamlessly integrates these components to generate final predictions. Experimental results conducted on two public transportation network datasets, METR-LA and PEMS-BAY, demonstrate the superior performance of the proposed model.

[LG-28] Intelligent Task Scheduling for Microservices via A3C-Based Reinforcement Learning

链接: https://arxiv.org/abs/2505.00299
作者: Yang Wang,Tengda Tang,Zhou Fang,Yingnan Deng,Yifei Duan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address the challenges of high resource dynamism and intensive task concurrency in microservice systems, this paper proposes an adaptive resource scheduling method based on the A3C reinforcement learning algorithm. The scheduling problem is modeled as a Markov Decision Process, where policy and value networks are jointly optimized to enable fine-grained resource allocation under varying load conditions. The method incorporates an asynchronous multi-threaded learning mechanism, allowing multiple agents to perform parallel sampling and synchronize updates to the global network parameters. This design improves both policy convergence efficiency and model stability. In the experimental section, a real-world dataset is used to construct a scheduling scenario. The proposed method is compared with several typical approaches across multiple evaluation metrics, including task delay, scheduling success rate, resource utilization, and convergence speed. The results show that the proposed method delivers high scheduling performance and system stability in multi-task concurrent environments. It effectively alleviates the resource allocation bottlenecks faced by traditional methods under heavy load, demonstrating its practical value for intelligent scheduling in microservice systems.

[LG-29] Repetition Makes Perfect: Recurrent Sum-GNNs Match Message Passing Limit

链接: https://arxiv.org/abs/2505.00291
作者: Eran Rosenbluth,Martin Grohe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We provide first tight bounds for the expressivity of Recurrent Graph Neural Networks (recurrent GNNs) with finite-precision parameters. We prove that recurrent GNNs, with sum aggregation and ReLU activation, can emulate any graph algorithm that respects the natural message-passing invariance induced by the color refinement (or Weisfeiler-Leman) algorithm. While it is well known that the expressive power of GNNs is limited by this invariance [Morris et al., AAAI 2019; Xu et al., ICLR 2019], we establish that recurrent GNNs can actually reach this limit. This is in contrast to non-recurrent GNNs, which have the power of Weisfeiler-Leman only in a very weak, “non-uniform”, sense where every graph size requires a different GNN model to compute with. The emulation we construct introduces only a polynomial overhead in both time and space. Furthermore, we show that by incorporating random initialization, recurrent GNNs can emulate all graph algorithms, implying in particular that any graph algorithm with polynomial-time complexity can be emulated by a recurrent GNN with random initialization, running in polynomial time. Subjects: Machine Learning (cs.LG) MSC classes: 68T05, 68T07 ACMclasses: I.2.6 Cite as: arXiv:2505.00291 [cs.LG] (or arXiv:2505.00291v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.00291 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Policies of Multiple Skill Levels for Better Strength Estimation in Games

链接: https://arxiv.org/abs/2505.00279
作者: Kyota Kuboki,Tatsuyoshi Ogawa,Chu-Hsuan Hsueh,Shi-Jim Yen,Kokolo Ikeda
类目: Machine Learning (cs.LG)
*备注: 25 pages, 15 figures

点击查看摘要

Abstract:Accurately estimating human skill levels is crucial for designing effective human-AI interactions so that AI can provide appropriate challenges or guidance. In games where AI players have beaten top human professionals, strength estimation plays a key role in adapting AI behavior to match human skill levels. In a previous state-of-the-art study, researchers have proposed a strength estimator trained using human players’ match data. Given some matches, the strength estimator computes strength scores and uses them to estimate player ranks (skill levels). In this paper, we focus on the observation that human players’ behavior tendency varies according to their strength and aim to improve the accuracy of strength estimation by taking this into account. Specifically, in addition to strength scores, we obtain policies for different skill levels from neural networks trained using human players’ match data. We then combine features based on these policies with the strength scores to estimate strength. We conducted experiments on Go and chess. For Go, our method achieved an accuracy of 80% in strength estimation when given 10 matches, which increased to 92% when given 20 matches. In comparison, the previous state-of-the-art method had an accuracy of 71% with 10 matches and 84% with 20 matches, demonstrating improvements of 8-9%. We observed similar improvements in chess. These results contribute to developing a more accurate strength estimation method and to improving human-AI interaction.

[LG-31] Field-scale soil moisture estimated from Sentinel-1 SAR data using a knowledge-guided deep learning approach

链接: https://arxiv.org/abs/2505.00265
作者: Yi Yu,Patrick Filippi,Thomas F. A. Bishop
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted by the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

点击查看摘要

Abstract:Soil moisture (SM) estimation from active microwave data remains challenging due to the complex interactions between radar backscatter and surface characteristics. While the water cloud model (WCM) provides a semi-physical approach for understanding these interactions, its empirical component often limits performance across diverse agricultural landscapes. This research presents preliminary efforts for developing a knowledge-guided deep learning approach, which integrates WCM principles into a long short-term memory (LSTM) model, to estimate field SM using Sentinel-1 Synthetic Aperture Radar (SAR) data. Our proposed approach leverages LSTM’s capacity to capture spatiotemporal dependencies while maintaining physical consistency through a modified dual-component loss function, including a WCM-based semi-physical component and a boundary condition regularisation. The proposed approach is built upon the soil backscatter coefficients isolated from the total backscatter, together with Landsat-resolution vegetation information and surface characteristics. A four-fold spatial cross-validation was performed against in-situ SM data to assess the model performance. Results showed the proposed approach reduced SM retrieval uncertainties by 0.02 m ^3 /m ^3 and achieved correlation coefficients ® of up to 0.64 in areas with varying vegetation cover and surface conditions, demonstrating the potential to address the over-simplification in WCM.

[LG-32] Graph Privacy: A Heterogeneous Federated GNN for Trans-Border Financial Data Circulation

链接: https://arxiv.org/abs/2505.00257
作者: Zhizhong Tan,Jiexin Zheng,Kevin Qi Zhang,Wenyong Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The sharing of external data has become a strong demand of financial institutions, but the privacy issue has led to the difficulty of interconnecting different platforms and the low degree of data openness. To effectively solve the privacy problem of financial data in trans-border flow and sharing, to ensure that the data is available but not visible, to realize the joint portrait of all kinds of heterogeneous data of business organizations in different industries, we propose a Heterogeneous Federated Graph Neural Network (HFGNN) approach. In this method, the distribution of heterogeneous business data of trans-border organizations is taken as subgraphs, and the sharing and circulation process among subgraphs is constructed as a statistically heterogeneous global graph through a central server. Each subgraph learns the corresponding personalized service model through local training to select and update the relevant subset of subgraphs with aggregated parameters, and effectively separates and combines topological and feature information among subgraphs. Finally, our simulation experimental results show that the proposed method has higher accuracy performance and faster convergence speed than existing methods.

[LG-33] D-Tracker: Modeling Interest Diffusion in Social Activity Tensor Data Streams KDD2025

链接: https://arxiv.org/abs/2505.00242
作者: Shingo Higashiguchi,Yasuko Matsubara,Koki Kawabata,Taichi Murayama,Yasushi Sakurai
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: ACM SIGKDD 2025 (KDD2025)

点击查看摘要

Abstract:Large quantities of social activity data, such as weekly web search volumes and the number of new infections with infectious diseases, reflect peoples’ interests and activities. It is important to discover temporal patterns from such data and to forecast future activities accurately. However, modeling and forecasting social activity data streams is difficult because they are high-dimensional and composed of multiple time-varying dynamics such as trends, seasonality, and interest diffusion. In this paper, we propose D-Tracker, a method for continuously capturing time-varying temporal patterns within social activity tensor data streams and forecasting future activities. Our proposed method has the following properties: (a) Interpretable: it incorporates the partial differential equation into a tensor decomposition framework and captures time-varying temporal patterns such as trends, seasonality, and interest diffusion between locations in an interpretable manner; (b) Automatic: it has no hyperparameters and continuously models tensor data streams fully automatically; © Scalable: the computation time of D-Tracker is independent of the time series length. Experiments using web search volume data obtained from GoogleTrends, and COVID-19 infection data obtained from COVID-19 Open Data Repository show that our method can achieve higher forecasting accuracy in less computation time than existing methods while extracting the interest diffusion between locations. Our source code and datasets are available at this https URL.

[LG-34] Future-Oriented Navigation: Dynamic Obstacle Avoidance with One-Shot Energy-Based Multimodal Motion Prediction

链接: https://arxiv.org/abs/2505.00237
作者: Ze Zhang,Georg Hess,Junjie Hu,Emmanuel Dean,Lennart Svensson,Knut Åkesson
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to IEEE RA-L

点击查看摘要

Abstract:This paper proposes an integrated approach for the safe and efficient control of mobile robots in dynamic and uncertain environments. The approach consists of two key steps: one-shot multimodal motion prediction to anticipate motions of dynamic obstacles and model predictive control to incorporate these predictions into the motion planning process. Motion prediction is driven by an energy-based neural network that generates high-resolution, multi-step predictions in a single operation. The prediction outcomes are further utilized to create geometric shapes formulated as mathematical constraints. Instead of treating each dynamic obstacle individually, predicted obstacles are grouped by proximity in an unsupervised way to improve performance and efficiency. The overall collision-free navigation is handled by model predictive control with a specific design for proactive dynamic obstacle avoidance. The proposed approach allows mobile robots to navigate effectively in dynamic environments. Its performance is accessed across various scenarios that represent typical warehouse settings. The results demonstrate that the proposed approach outperforms other existing dynamic obstacle avoidance methods.

[LG-35] Node2Vec-DGI-EL: A Hierarchical Graph Representation Learning Model for Ingredient-Disease Association Prediction

链接: https://arxiv.org/abs/2505.00236
作者: Leifeng Zhang,Xin Dong,Shuaibing Jia,Jianhua Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional Chinese medicine, as an essential component of traditional medicine, contains active ingredients that serve as a crucial source for modern drug development, holding immense therapeutic potential and development value. A multi-layered and complex network is formed from Chinese medicine to diseases and used to predict the potential associations between Chinese medicine ingredients and diseases. This study proposes an ingredient-disease association prediction model (Node2Vec-DGI-EL) based on hierarchical graph representation learning. First, the model uses the Node2Vec algorithm to extract node embedding vectors from the network as the initial features of the nodes. Next, the network nodes are deeply represented and learned using the DGI algorithm to enhance the model’s expressive power. To improve prediction accuracy and robustness, an ensemble learning method is incorporated to achieve more accurate ingredient-disease association predictions. The effectiveness of the model is then evaluated through a series of theoretical verifications. The results demonstrated that the proposed model significantly outperformed existing methods, achieving an AUC of 0.9987 and an AUPR of 0.9545, thereby indicating superior predictive capability. Ablation experiments further revealed the contribution and importance of each module. Additionally, case studies explored potential associations, such as triptonide with hypertensive retinopathy and methyl ursolate with colorectal cancer. Molecular docking experiments validated these findings, showing the triptonide-PGR interaction and the methyl ursolate-NFE2L2 interaction can bind stable. In conclusion, the Node2Vec-DGI-EL model focuses on TCM datasets and effectively predicts ingredient-disease associations, overcoming the reliance on node semantic information.

[LG-36] Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review

链接: https://arxiv.org/abs/2505.00210
作者: Suk Ki Lee,Hyunwoong Ko
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
*备注: 12 pages, 1 figure, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2025

点击查看摘要

Abstract:Dynamic manufacturing processes exhibit complex characteristics defined by time-varying parameters, nonlinear behaviors, and uncertainties. These characteristics require sophisticated in-situ monitoring techniques utilizing multimodal sensor data and adaptive control systems that can respond to real-time feedback while maintaining product quality. Recently, generative machine learning (ML) has emerged as a powerful tool for modeling complex distributions and generating synthetic data while handling these manufacturing uncertainties. However, adopting these generative technologies in dynamic manufacturing systems lacks a functional control-oriented perspective to translate their probabilistic understanding into actionable process controls while respecting constraints. This review presents a functional classification of Prediction-Based, Direct Policy, Quality Inference, and Knowledge-Integrated approaches, offering a perspective for understanding existing ML-enhanced control systems and incorporating generative ML. The analysis of generative ML architectures within this framework demonstrates control-relevant properties and potential to extend current ML-enhanced approaches where conventional methods prove insufficient. We show generative ML’s potential for manufacturing control through decision-making applications, process guidance, simulation, and digital twins, while identifying critical research gaps: separation between generation and control functions, insufficient physical understanding of manufacturing phenomena, and challenges adapting models from other domains. To address these challenges, we propose future research directions aimed at developing integrated frameworks that combine generative ML and control technologies to address the dynamic complexities of modern manufacturing systems.

[LG-37] Mapping minds not averag es: a scalable subject-specific manifold learning framework for neuroimaging data

链接: https://arxiv.org/abs/2505.00196
作者: Eloy Geenjaar,Vince Calhoun
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Mental and cognitive representations are believed to reside on low-dimensional, non-linear manifolds embedded within high-dimensional brain activity. Uncovering these manifolds is key to understanding individual differences in brain function, yet most existing machine learning methods either rely on population-level spatial alignment or assume data that is temporally structured, either because data is aligned among subjects or because event timings are known. We introduce a manifold learning framework that can capture subject-specific spatial variations across both structured and temporally unstructured neuroimaging data. On simulated data and two naturalistic fMRI datasets (Sherlock and Forrest Gump), our framework outperforms group-based baselines by recovering more accurate and individualized representations. We further show that the framework scales efficiently to large datasets and generalizes well to new subjects. To test this, we apply the framework to temporally unstructured resting-state fMRI data from individuals with schizophrenia and healthy controls. We further apply our method to a large resting-state fMRI dataset comprising individuals with schizophrenia and controls. In this setting, we demonstrate that the framework scales efficiently to large populations and generalizes robustly to unseen subjects. The learned subject-specific spatial maps our model finds reveal clinically relevant patterns, including increased activation in the basal ganglia, visual, auditory, and somatosensory regions, and decreased activation in the insula, inferior frontal gyrus, and angular gyrus. These findings suggest that our framework can uncover clinically relevant subject-specific brain activity patterns. Our approach thus provides a scalable and individualized framework for modeling brain activity, with applications in computational neuroscience and clinical research.

[LG-38] Algorithmic Collective Action with Two Collectives

链接: https://arxiv.org/abs/2505.00195
作者: Aditya Karan,Nicholas Vincent,Karrie Karahalios,Hari Sundaram
类目: Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given that data-dependent algorithmic systems have become impactful in more domains of life, the need for individuals to promote their own interests and hold algorithms accountable has grown. To have meaningful influence, individuals must band together to engage in collective action. Groups that engage in such algorithmic collective action are likely to vary in size, membership characteristics, and crucially, objectives. In this work, we introduce a first of a kind framework for studying collective action with two or more collectives that strategically behave to manipulate data-driven systems. With more than one collective acting on a system, unexpected interactions may occur. We use this framework to conduct experiments with language model-based classifiers and recommender systems where two collectives each attempt to achieve their own individual objectives. We examine how differing objectives, strategies, sizes, and homogeneity can impact a collective’s efficacy. We find that the unintentional interactions between collectives can be quite significant; a collective acting in isolation may be able to achieve their objective (e.g., improve classification outcomes for themselves or promote a particular item), but when a second collective acts simultaneously, the efficacy of the first group drops by as much as 75% . We find that, in the recommender system context, neither fully heterogeneous nor fully homogeneous collectives stand out as most efficacious and that heterogeneity’s impact is secondary compared to collective size. Our results signal the need for more transparency in both the underlying algorithmic models and the different behaviors individuals or collectives may take on these systems. This approach also allows collectives to hold algorithmic system developers accountable and provides a framework for people to actively use their own data to promote their own interests.

[LG-39] Chronic Diseases Prediction using Machine Learning and Deep Learning Methods

链接: https://arxiv.org/abs/2505.00189
作者: Houda Belhad,Asmae Bourbia,Salma Boughanja
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chronic diseases, such as cardiovascular disease, diabetes, chronic kidney disease, and thyroid disorders, are the leading causes of premature mortality worldwide. Early detection and intervention are crucial for improving patient outcomes, yet traditional diagnostic methods often fail due to the complex nature of these conditions. This study explores the application of machine learning (ML) and deep learning (DL) techniques to predict chronic disease and thyroid disorders. We used a variety of models, including Logistic Regression (LR), Random Forest (RF), Gradient Boosted Trees (GBT), Neural Networks (NN), Decision Trees (DT) and Native Bayes (NB), to analyze and predict disease outcomes. Our methodology involved comprehensive data pre-processing, including handling missing values, categorical encoding, and feature aggregation, followed by model training and evaluation. Performance metrics such ad precision, recall, accuracy, F1-score, and Area Under the Curve (AUC) were used to assess the effectiveness of each model. The results demonstrated that ensemble methods like Random Forest and Gradient Boosted Trees consistently outperformed. Neutral Networks also showed superior performance, particularly in capturing complex data patterns. The findings highlight the potential of ML and DL in revolutionizing chronic disease prediction, enabling early diagnosis and personalized treatment strategies. However, challenges such as data quality, model interpretability, and the need for advanced computational techniques in healthcare to improve patient outcomes and reduce the burden of chronic diseases. This study was conducted as part of Big Data class project under the supervision of our professors Mr. Abderrahmane EZ-ZAHOUT and Mr. Abdessamad ESSAIDI.

[LG-40] Stochastic Subspace Descent Accelerated via Bi-fidelity Line Search

链接: https://arxiv.org/abs/2505.00162
作者: Nuojin Cheng,Alireza Doostan,Stephen Becker
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Efficient optimization remains a fundamental challenge across numerous scientific and engineering domains, especially when objective function and gradient evaluations are computationally expensive. While zeroth-order optimization methods offer effective approaches when gradients are inaccessible, their practical performance can be limited by the high cost associated with function queries. This work introduces the bi-fidelity stochastic subspace descent (BF-SSD) algorithm, a novel zeroth-order optimization method designed to reduce this computational burden. BF-SSD leverages a bi-fidelity framework, constructing a surrogate model from a combination of computationally inexpensive low-fidelity (LF) and accurate high-fidelity (HF) function evaluations. This surrogate model facilitates an efficient backtracking line search for step size selection, for which we provide theoretical convergence guarantees under standard assumptions. We perform a comprehensive empirical evaluation of BF-SSD across four distinct problems: a synthetic optimization benchmark, dual-form kernel ridge regression, black-box adversarial attacks on machine learning models, and transformer-based black-box language model fine-tuning. Numerical results demonstrate that BF-SSD consistently achieves superior optimization performance while requiring significantly fewer HF function evaluations compared to relevant baseline methods. This study highlights the efficacy of integrating bi-fidelity strategies within zeroth-order optimization, positioning BF-SSD as a promising and computationally efficient approach for tackling large-scale, high-dimensional problems encountered in various real-world applications.

[LG-41] Kernel-Based Ensemble Gaussian Mixture Probability Hypothesis Density Filter

链接: https://arxiv.org/abs/2505.00131
作者: Dalton Durant,Renato Zanetti
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this work, a kernel-based Ensemble Gaussian Mixture Probability Hypothesis Density (EnGM-PHD) filter is presented for multi-target filtering applications. The EnGM-PHD filter combines the Gaussian-mixture-based techniques of the Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with the particle-based techniques of the Sequential Monte Carlo Probability Hypothesis Density (SMC-PHD) filter. It achieves this by obtaining particles from the posterior intensity function, propagating them through the system dynamics, and then using Kernel Density Estimation (KDE) techniques to approximate the Gaussian mixture of the prior intensity function. This approach guarantees convergence to the true intensity function in the limit of the number of components. Moreover, in the special case of a single target with no births, deaths, clutter, and perfect detection probability, the EnGM-PHD filter reduces to the standard Ensemble Gaussian Mixture Filter (EnGMF). In the presented experiment, the results indicate that the EnGM-PHD filter achieves better multi-target filtering performance than both the GM-PHD and SMC-PHD filters while using the same number of components or particles.

[LG-42] From Lab to Wrist: Bridging Metabolic Monitoring and Consumer Wearables for Heart Rate and Oxygen Consumption Modeling

链接: https://arxiv.org/abs/2505.00101
作者: Barak Gahtan,Sanketh Vedula,Gil Samuelly Leichtag,Einat Kodesh,Alex M. Bronstein
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Understanding physiological responses during running is critical for performance optimization, tailored training prescriptions, and athlete health management. We introduce a comprehensive framework – what we believe to be the first capable of predicting instantaneous oxygen consumption (VO _2 ) trajectories exclusively from consumer-grade wearable data. Our approach employs two complementary physiological models: (1) accurate modeling of heart rate (HR) dynamics via a physiologically constrained ordinary differential equation (ODE) and neural Kalman filter, trained on over 3 million HR observations, achieving 1-second interval predictions with mean absolute errors as low as 2.81,bpm (correlation 0.87); and (2) leveraging the principles of precise HR modeling, a novel VO _2 prediction architecture requiring only the initial second of VO _2 data for calibration, enabling robust, sequence-to-sequence metabolic demand estimation. Despite relying solely on smartwatch and chest-strap data, our method achieves mean absolute percentage errors of approximately 13%, effectively capturing rapid physiological transitions and steady-state conditions across diverse running intensities. Our synchronized dataset, complemented by blood lactate measurements, further lays the foundation for future noninvasive metabolic zone identification. By embedding physiological constraints within modern machine learning, this framework democratizes advanced metabolic monitoring, bridging laboratory-grade accuracy and everyday accessibility, thus empowering both elite athletes and recreational fitness enthusiasts.

[LG-43] An Inversion Theorem for Buffered Linear Toeplitz (BLT) Matrices and Applications to Streaming Differential Privacy

链接: https://arxiv.org/abs/2504.21413
作者: H. Brendan McMahan,Krishna Pillutla
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Buffered Linear Toeplitz (BLT) matrices are a family of parameterized lower-triangular matrices that play an important role in streaming differential privacy with correlated noise. Our main result is a BLT inversion theorem: the inverse of a BLT matrix is itself a BLT matrix with different parameters. We also present an efficient and differentiable O(d^3) algorithm to compute the parameters of the inverse BLT matrix, where d is the degree of the original BLT (typically d 10 ). Our characterization enables direct optimization of BLT parameters for privacy mechanisms through automatic differentiation.

[LG-44] Bayes-Optimal Fair Classification with Multiple Sensitive Features

链接: https://arxiv.org/abs/2505.00631
作者: Yi Yang,Yinghui Huang,Xiangyu Chang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing theoretical work on Bayes-optimal fair classifiers usually considers a single (binary) sensitive feature. In practice, individuals are often defined by multiple sensitive features. In this paper, we characterize the Bayes-optimal fair classifier for multiple sensitive features under general approximate fairness measures, including mean difference and mean ratio. We show that these approximate measures for existing group fairness notions, including Demographic Parity, Equal Opportunity, Predictive Equality, and Accuracy Parity, are linear transformations of selection rates for specific groups defined by both labels and sensitive features. We then characterize that Bayes-optimal fair classifiers for multiple sensitive features become instance-dependent thresholding rules that rely on a weighted sum of these group membership probabilities. Our framework applies to both attribute-aware and attribute-blind settings and can accommodate composite fairness notions like Equalized Odds. Building on this, we propose two practical algorithms for Bayes-optimal fair classification via in-processing and post-processing. We show empirically that our methods compare favorably to existing methods.

[LG-45] SA-GAT-SR: Self-Adaptable Graph Attention Networks with Symbolic Regression for high-fidelity material property prediction

链接: https://arxiv.org/abs/2505.00625
作者: Liu Junchi,Tang Ying,Tretiak Sergei,Duan Wenhui,Zhou Liujiang
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have demonstrated an enormous utility of deep learning approaches, particularly Graph Neural Networks (GNNs) for materials science. These methods have emerged as powerful tools for high-throughput prediction of material properties, offering a compelling enhancement and alternative to traditional first-principles calculations. While the community has predominantly focused on developing increasingly complex and universal models to enhance predictive accuracy, such approaches often lack physical interpretability and insights into materials behavior. Here, we introduce a novel computational paradigm, Self-Adaptable Graph Attention Networks integrated with Symbolic Regression (SA-GAT-SR), that synergistically combines the predictive capability of GNNs with the interpretative power of symbolic regression. Our framework employs a self-adaptable encoding algorithm that automatically identifies and adjust attention weights so as to screen critical features from an expansive 180-dimensional feature space while maintaining O(n) computational scaling. The integrated SR module subsequently distills these features into compact analytical expressions that explicitly reveal quantum-mechanically meaningful relationships, achieving 23 times acceleration compared to conventional SR implementations that heavily rely on first principle calculations-derived features as input. This work suggests a new framework in computational materials science, bridging the gap between predictive accuracy and physical interpretability, offering valuable physical insights into material behavior.

[LG-46] ransition States Energies from Machine Learning: An Application to Reverse Water-Gas Shift on Single-Atom Alloys

链接: https://arxiv.org/abs/2505.00574
作者: Raffaele Cheula,Mie Andersen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Obtaining accurate transition state (TS) energies is a bottleneck in computational screening of complex materials and reaction networks due to the high cost of TS search methods and first-principles methods such as density functional theory (DFT). Here we propose a machine learning (ML) model for predicting TS energies based on Gaussian process regression with the Wasserstein Weisfeiler-Lehman graph kernel (WWL-GPR). Applying the model to predict adsorption and TS energies for the reverse water-gas shift (RWGS) reaction on single-atom alloy (SAA) catalysts, we show that it can significantly improve the accuracy compared to traditional approaches based on scaling relations or ML models without a graph representation. Further benefitting from the low cost of model training, we train an ensemble of WWL-GPR models to obtain uncertainties through subsampling of the training data and show how these uncertainties propagate to turnover frequency (TOF) predictions through the construction of an ensemble of microkinetic models. Comparing the errors in model-based vs DFT-based TOF predictions, we show that the WWL-GPR model reduces errors by almost an order of magnitude compared to scaling relations. This demonstrates the critical impact of accurate energy predictions on catalytic activity estimation. Finally, we apply our model to screen new materials, identifying promising catalysts for RWGS. This work highlights the power of combining advanced ML techniques with DFT and microkinetic modeling for screening catalysts for complex reactions like RWGS, providing a robust framework for future catalyst design.

[LG-47] Hypothesis-free discovery from epidemiological data by automatic detection and local inference for tree-based nonlinearities and interactions

链接: https://arxiv.org/abs/2505.00571
作者: Giorgio Spadaccini,Marjolein Fokkema,Mark A. van de Wiel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Main body: 29 pages, 7 figures; Supplementary material: 39 pages, 14 figures

点击查看摘要

Abstract:In epidemiological settings, Machine Learning (ML) is gaining popularity for hypothesis-free discovery of risk (or protective) factors. Although ML is strong at discovering non-linearities and interactions, this power is currently compromised by a lack of reliable inference. Although local measures of feature effect can be combined with tree ensembles, uncertainty quantifications for these measures remain only partially available and oftentimes unsatisfactory. We propose RuleSHAP, a framework for using rule-based, hypothesis-free discovery that combines sparse Bayesian regression, tree ensembles and Shapley values in a one-step procedure that both detects and tests complex patterns at the individual level. To ease computation, we derive a formula that computes marginal Shapley values more efficiently for our setting. We demonstrate the validity of our framework on simulated data. To illustrate, we apply our machinery to data from an epidemiological cohort to detect and infer several effects for high cholesterol and blood pressure, such as nonlinear interaction effects between features like age, sex, ethnicity, BMI and glucose level.

[LG-48] Pre-Training Estimators for Structural Models: Application to Consumer Search

链接: https://arxiv.org/abs/2505.00526
作者: Yanhao ‘Max’ Wei,Zhenling Jiang
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We explore pretraining estimators for structural econometric models. The estimator is “pretrained” in the sense that the bulk of the computational cost and researcher effort occur during the construction of the estimator. Subsequent applications of the estimator to different datasets require little computational cost or researcher effort. The estimation leverages a neural net to recognize the structural model’s parameter from data patterns. As an initial trial, this paper builds a pretrained estimator for a sequential search model that is known to be difficult to estimate. We evaluate the pretrained estimator on 14 real datasets. The estimation takes seconds to run and shows high accuracy. We provide the estimator at this http URL. More generally, pretrained, off-the-shelf estimators can make structural models more accessible to researchers and practitioners.

[LG-49] Over-the-Air Inference over Multi-hop MIMO Networks

链接: https://arxiv.org/abs/2505.00430
作者: Chenghong Bian,Meng Hua,Deniz Gunduz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:A novel over-the-air machine learning framework over multi-hop multiple-input and multiple-output (MIMO) networks is proposed. The core idea is to imitate fully connected (FC) neural network layers using multiple MIMO channels by carefully designing the precoding matrices at the transmitting nodes. A neural network dubbed PrototypeNet is employed consisting of multiple FC layers, with the number of neurons of each layer equal to the number of antennas of the corresponding terminal. To achieve satisfactory performance, we train PrototypeNet based on a customized loss function consisting of classification error and the power of latent vectors to satisfy transmit power constraints, with noise injection during training. Precoding matrices for each hop are then obtained by solving an optimization problem. We also propose a multiple-block extension when the number of antennas is limited. Numerical results verify that the proposed over-the-air transmission scheme can achieve satisfactory classification accuracy under a power constraint. The results also show that higher classification accuracy can be achieved with an increasing number of hops at a modest signal-to-noise ratio (SNR).

[LG-50] Statistical Learning for Heterogeneous Treatment Effects: Pretraining Prognosis and Prediction

链接: https://arxiv.org/abs/2505.00310
作者: Maximilian Schuessler,Erik Sverdrup,Robert Tibshirani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Robust estimation of heterogeneous treatment effects is a fundamental challenge for optimal decision-making in domains ranging from personalized medicine to educational policy. In recent years, predictive machine learning has emerged as a valuable toolbox for causal estimation, enabling more flexible effect estimation. However, accurately estimating conditional average treatment effects (CATE) remains a major challenge, particularly in the presence of many covariates. In this article, we propose pretraining strategies that leverages a phenomenon in real-world applications: factors that are prognostic of the outcome are frequently also predictive of treatment effect heterogeneity. In medicine, for example, components of the same biological signaling pathways frequently influence both baseline risk and treatment response. Specifically, we demonstrate our approach within the R-learner framework, which estimates the CATE by solving individual prediction problems based on a residualized loss. We use this structure to incorporate “side information” and develop models that can exploit synergies between risk prediction and causal effect estimation. In settings where these synergies are present, this cross-task learning enables more accurate signal detection: yields lower estimation error, reduced false discovery rates, and higher power for detecting heterogeneity.

[LG-51] Reinforcement Learning with Continuous Actions Under Unmeasured Confounding

链接: https://arxiv.org/abs/2505.00304
作者: Yuhan Li,Eugene Han,Yifan Hu,Wenzhuo Zhou,Zhengling Qi,Yifan Cui,Ruoqing Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces when unmeasured confounders are present. While most existing research focuses on policy evaluation within partially observable Markov decision processes (POMDPs) and assumes discrete action spaces, we advance this field by establishing a novel identification result to enable the nonparametric estimation of policy value for a given target policy under an infinite-horizon framework. Leveraging this identification, we develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy that maximizes the estimated policy value. Furthermore, we provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy. Extensive simulations and a real-world application using the German Family Panel data demonstrate the effectiveness of our proposed methodology.

[LG-52] A Unifying Framework for Robust and Efficient Inference with Unstructured Data

链接: https://arxiv.org/abs/2505.00282
作者: Jacob Carlson,Melissa Dell
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a general framework for conducting efficient and robust inference on parameters derived from unstructured data, which include text, images, audio, and video. Economists have long incorporated data extracted from texts and images into their analyses, a practice that has accelerated with advancements in deep neural networks. However, neural networks do not generically produce unbiased predictions, potentially propagating bias to estimators that use their outputs. To address this challenge, we reframe inference with unstructured data as a missing structured data problem, where structured data are imputed from unstructured inputs using deep neural networks. This perspective allows us to apply classic results from semiparametric inference, yielding valid, efficient, and robust estimators based on unstructured data. We formalize this approach with MARS (Missing At Random Structured Data), a unifying framework that integrates and extends existing methods for debiased inference using machine learning predictions, linking them to a variety of older, familiar problems such as causal inference. We develop robust and efficient estimators for both descriptive and causal estimands and address challenges such as inference using aggregated and transformed predictions from unstructured data. Importantly, MARS applies to common empirical settings that have received limited attention in the existing literature. Finally, we reanalyze prominent studies that use unstructured data, demonstrating the practical value of MARS.

[LG-53] Explorative Curriculum Learning for Strongly Correlated Electron Systems

链接: https://arxiv.org/abs/2505.00233
作者: Kimihiro Yamazaki,Takuya Konishi,Yoshinobu Kawahara
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in neural network quantum states (NQS) have enabled high-accuracy predictions for complex quantum many-body systems such as strongly correlated electron systems. However, the computational cost remains prohibitive, making exploration of the diverse parameters of interaction strengths and other physical parameters inefficient. While transfer learning has been proposed to mitigate this challenge, achieving generalization to large-scale systems and diverse parameter regimes remains difficult. To address this limitation, we propose a novel curriculum learning framework based on transfer learning for NQS. This facilitates efficient and stable exploration across a vast parameter space of quantum many-body systems. In addition, by interpreting NQS transfer learning through a perturbative lens, we demonstrate how prior physical knowledge can be flexibly incorporated into the curriculum learning process. We also propose Pairing-Net, an architecture to practically implement this strategy for strongly correlated electron systems, and empirically verify its effectiveness. Our results show an approximately 200-fold speedup in computation and a marked improvement in optimization stability compared to conventional methods.

[LG-54] Inference for max-linear Bayesian networks with noise

链接: https://arxiv.org/abs/2505.00229
作者: Mark Adams,Kamillo Ferry,Ruriko Yoshida
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 18 pages, 10 figures. Short version to appear in the proceedings of the 13th Workshop on Uncertainty Processing

点击查看摘要

Abstract:Max-Linear Bayesian Networks (MLBNs) provide a powerful framework for causal inference in extreme-value settings; we consider MLBNs with noise parameters with a given topology in terms of the max-plus algebra by taking its logarithm. Then, we show that an estimator of a parameter for each edge in a directed acyclic graph (DAG) is distributed normally. We end this paper with computational experiments with the expectation and maximization (EM) algorithm and quadratic optimization.

[LG-55] oward Practical Quantum Machine Learning: A Novel Hybrid Quantum LSTM for Fraud Detection

链接: https://arxiv.org/abs/2505.00137
作者: Rushikesh Ubale,Sujan K.K.,Sangram Deshpande,Gregory T. Byrd
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 11 pages ,8 figures

点击查看摘要

Abstract:We present a novel hybrid quantum-classical neural network architecture for fraud detection that integrates a classical Long Short-Term Memory (LSTM) network with a variational quantum circuit. By leveraging quantum phenomena such as superposition and entanglement, our model enhances the feature representation of sequential transaction data, capturing complex non-linear patterns that are challenging for purely classical models. A comprehensive data preprocessing pipeline is employed to clean, encode, balance, and normalize a credit card fraud dataset, ensuring a fair comparison with baseline models. Notably, our hybrid approach achieves per-epoch training times in the range of 45-65 seconds, which is significantly faster than similar architectures reported in the literature, where training typically requires several minutes per epoch. Both classical and quantum gradients are jointly optimized via a unified backpropagation procedure employing the parameter-shift rule for the quantum parameters. Experimental evaluations demonstrate competitive improvements in accuracy, precision, recall, and F1 score relative to a conventional LSTM baseline. These results underscore the promise of hybrid quantum-classical techniques in advancing the efficiency and performance of fraud detection systems. Keywords: Hybrid Quantum-Classical Neural Networks, Quantum Computing, Fraud Detection, Hybrid Quantum LSTM, Variational Quantum Circuit, Parameter-Shift Rule, Financial Risk Analysis Comments: 11 pages ,8 figures Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2505.00137 [quant-ph] (or arXiv:2505.00137v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2505.00137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] On the expressivity of deep Heaviside networks

链接: https://arxiv.org/abs/2505.00110
作者: Insung Kong,Juntong Chen,Sophie Langer,Johannes Schmidt-Hieber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 61 pages, 16 figures

点击查看摘要

Abstract:We show that deep Heaviside networks (DHNs) have limited expressiveness but that this can be overcome by including either skip connections or neurons with linear activation. We provide lower and upper bounds for the Vapnik-Chervonenkis (VC) dimensions and approximation rates of these network classes. As an application, we derive statistical convergence rates for DHN fits in the nonparametric regression model.

[LG-57] Can a Quantum Support Vector Machine algorithm be utilized to identify Key Biomarkers from Multi-Omics data of COVID19 patients?

链接: https://arxiv.org/abs/2505.00037
作者: Junggu Choi,Chansu Yu,Kyle L. Jung,Suan-Sin Foo,Weiqiang Chen,Suzy AA Comhair,Serpil C. Erzurum,Lara Jehi,Jae U. Jung
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 70 pages, 6 figures

点击查看摘要

Abstract:Identifying key biomarkers for COVID-19 from high-dimensional multi-omics data is critical for advancing both diagnostic and pathogenesis research. In this study, we evaluated the applicability of the Quantum Support Vector Machine (QSVM) algorithm for biomarker-based classification of COVID-19. Proteomic and metabolomic biomarkers from two independent datasets were ranked by importance using ridge regression and grouped accordingly. The top- and bottom-ranked biomarker sets were then used to train and evaluate both classical SVM (CSVM) and QSVM models, serving as predictive and negative control inputs, respectively. The QSVM was implemented with multiple quantum kernels, including amplitude encoding, angle encoding, the ZZ feature map, and the projected quantum kernel. Across various experimental settings, QSVM consistently achieved classification performance that was comparable to or exceeded that of CSVM, while reflecting the importance rankings by ridge regression. Although the experiments were conducted in numerical simulation, our findings highlight the potential of QSVM as a promising approach for multi-omics data analysis in biomedical research.

信息检索

[IR-0] Efficient Recommendation with Millions of Items by Dynamic Pruning of Sub-Item Embeddings SIGIR2025

链接: https://arxiv.org/abs/2505.00560
作者: Aleksandr V. Petrov,Craig Macdonald,Nicola Tonellotto
类目: Information Retrieval (cs.IR)
*备注: Accepted as a full research paper at SIGIR 2025

点击查看摘要

Abstract:A large item catalogue is a major challenge for deploying modern sequential recommender models, since it makes the memory footprint of the model large and increases inference latency. One promising approach to address this is RecJPQ, which replaces item embeddings with sub-item embeddings. However, slow inference remains problematic because finding the top highest-scored items usually requires scoring all items in the catalogue, which may not be feasible for large catalogues. By adapting dynamic pruning concepts from document retrieval, we propose the RecJPQPrune dynamic pruning algorithm to efficiently find the top highest-scored items without computing the scores of all items in the catalogue. Our RecJPQPrune algorithm is safe-up-to-rank K since it theoretically guarantees that no potentially high-scored item is excluded from the final top K recommendation list, thereby ensuring no impact on effectiveness. Our experiments on two large datasets and three recommendation models demonstrate the efficiency achievable using RecJPQPrune: for instance, on the Tmall dataset with 2.2M items, we can reduce the median model scoring time by 64 times compared to the Transformer Default baseline, and 5.3 times compared to a recent scoring approach called PQTopK. Overall, this paper demonstrates the effective and efficient inference of Transformer-based recommendation models at catalogue scales not previously reported in the literature. Indeed, our RecJPQPrune algorithm can score 2 million items in under 10 milliseconds without GPUs, and without relying on Approximate Nearest Neighbour (ANN) techniques.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-05-02

目录

概览 (2025-05-02)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载