本篇博文主要内容为 2025-04-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-25)

今日共更新419篇论文,其中:

  • 自然语言处理44篇(Computation and Language (cs.CL))
  • 人工智能105篇(Artificial Intelligence (cs.AI))
  • 计算机视觉87篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习111篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] he Sparse Frontier: Sparse Attention Trade-offs in Transformer LLM s

【速读】: 该论文试图解决Transformer大型语言模型(LLMs)在处理长上下文时的能力扩展问题,具体关注稀疏注意力机制的可行性及其效率与精度之间的权衡,并系统性地研究其扩展性。论文的关键在于通过全面的实验分析不同模型规模、序列长度以及稀疏水平下无训练稀疏注意力方法的表现,揭示了稀疏注意力在非常长序列任务中的性能特征及局限性,同时提出了针对稀疏注意力的新颖缩放规律。这表明稀疏注意力是一种增强Transformer LLMs处理长序列能力的重要工具,但其应用需谨慎评估性能敏感型任务中的权衡问题。

链接: https://arxiv.org/abs/2504.17768
作者: Piotr Nawrot,Robert Li,Renjie Huang,Sebastian Ruder,Kelly Marchisio,Edoardo M. Ponti
机构: University of Edinburgh; Cohere (Cohere); Meta (Meta)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its viability, its efficiency-accuracy trade-offs, and systematic scaling studies remain unexplored. To address this gap, we perform a careful comparison of training-free sparse attention methods at varying model scales, sequence lengths, and sparsity levels on a diverse collection of long-sequence tasks-including novel ones that rely on natural language while remaining controllable and easy to evaluate. Based on our experiments, we report a series of key findings: 1) an isoFLOPS analysis reveals that for very long sequences, larger and highly sparse models are preferable to smaller and dense ones. 2) The level of sparsity attainable while statistically guaranteeing accuracy preservation is higher during decoding than prefilling, and correlates with model size in the former. 3) There is no clear strategy that performs best across tasks and phases, with different units of sparsification or budget adaptivity needed for different scenarios. Even moderate sparsity levels often result in significant performance degradation on at least one task, highlighting that sparse attention is not a universal solution. 4) We introduce and validate novel scaling laws specifically tailored for sparse attention, providing evidence that our findings are likely to hold true beyond our range of experiments. Through these insights, we demonstrate that sparse attention is a key tool to enhance the capabilities of Transformer LLMs for processing longer sequences, but requires careful evaluation of trade-offs for performance-sensitive applications.
zh

[NLP-1] Conversational Assistants to support Heart Failure Patients: comparing a Neurosymbolic Architecture with ChatGPT

【速读】: 该论文试图解决的问题是如何通过受控评估揭示基于传统架构与生成式 AI 架构的会话助理在实际应用场景中的优缺点。研究聚焦于医疗领域,特别是心衰患者查询食物盐含量的场景。论文的关键解决方案是设计了一项针对真实利益相关者的组内用户研究,比较了一种采用神经符号架构(neurosymbolic architecture)的自研系统与基于 ChatGPT 的系统,从准确性、任务完成能力、冗长性以及语音错误等方面量化两者的差异,并分析患者的主观偏好,从而全面评估两种系统的性能表现。

链接: https://arxiv.org/abs/2504.17753
作者: Anuja Tayal,Devika Salunke,Barbara Di Eugenio,Paula Allen-Meares,Eulalia Puig Abril,Olga Garcia,Carolyn Dickens,Andrew Boyd
机构: Department of Computer Science (计算机科学系); Department of Biomedical and Health Information Sciences (生物医学与健康信息科学系); Department of Medicine (医学系); Department of Communications (通信系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conversational assistants are becoming more and more popular, including in healthcare, partly because of the availability and capabilities of Large Language Models. There is a need for controlled, probing evaluations with real stakeholders which can highlight advantages and disadvantages of more traditional architectures and those based on generative AI. We present a within-group user study to compare two versions of a conversational assistant that allows heart failure patients to ask about salt content in food. One version of the system was developed in-house with a neurosymbolic architecture, and one is based on ChatGPT. The evaluation shows that the in-house system is more accurate, completes more tasks and is less verbose than the one based on ChatGPT; on the other hand, the one based on ChatGPT makes fewer speech errors and requires fewer clarifications to complete the task. Patients show no preference for one over the other.
zh

[NLP-2] Multilingual Performance Biases of Large Language Models in Education

【速读】: 该论文旨在评估大型语言模型(Large Language Models, LLMs)在非英语教育场景中的适用性,以确定其在多种语言环境下用于教育任务(如识别学生误解、提供针对性反馈、交互式辅导以及翻译评分)的有效性。论文的关键在于通过在六种非英语语言(印地语、阿拉伯语、波斯语、泰卢固语、乌克兰语、捷克语)以及英语上的实验,发现LLMs在这些任务上的表现与训练数据中语言资源的丰富程度密切相关,低资源语言的表现明显较差,尽管大多数语言上的表现尚可,但英语与其他语言之间的性能差距显著。因此,论文建议从业者在部署LLMs于特定教育任务之前,需先验证模型在目标语言中的有效性。

链接: https://arxiv.org/abs/2504.17720
作者: Vansh Gupta,Sankalan Pal Chowdhury,Vilém Zouhar,Donya Rooein,Mrinmaya Sachan
机构: ETH Zurich; Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being adopted in educational settings. These applications expand beyond English, though current LLMs remain primarily English-centric. In this work, we ascertain if their use in education settings in non-English languages is warranted. We evaluated the performance of popular LLMs on four educational tasks: identifying student misconceptions, providing targeted feedback, interactive tutoring, and grading translations in six languages (Hindi, Arabic, Farsi, Telugu, Ukrainian, Czech) in addition to English. We find that the performance on these tasks somewhat corresponds to the amount of language represented in training data, with lower-resource languages having poorer task performance. Although the models perform reasonably well in most languages, the frequent performance drop from English is significant. Thus, we recommend that practitioners first verify that the LLM works well in the target language for their educational task before deployment.
zh

[NLP-3] Safety in Large Reasoning Models: A Survey

【速读】: 该论文旨在解决大型推理模型(LRMs)在实际应用中面临的日益凸显的安全隐患与风险问题,通过系统梳理新近出现的安全威胁、攻击方式及防御策略,构建了一个清晰且结构化的分类体系。关键在于通过全面调研和归纳,为研究者和开发者提供明确的方向,以推动提升这些模型的安全性和可靠性,从而更好地支持其在现实世界中的部署与应用。

链接: https://arxiv.org/abs/2504.17704
作者: Cheng Wang,Yue Liu,Baolong Li,Duzhen Zhang,Zhongzhi Li,Junfeng Fang
机构: National University of Singapore (新加坡国立大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents a comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
zh

[NLP-4] Ensemble Bayesian Inference: Leverag ing Small Language Models to Achieve LLM -level Accuracy in Profile Matching Tasks

【速读】: 本文旨在探索小型语言模型(SLM)集成能否在准确性上媲美专有的大型语言模型(LLMs),并提出了一种名为集成贝叶斯推理(Ensemble Bayesian Inference, EBI)的新方法。EBI的关键在于通过贝叶斯估计结合多个SLMs的判断,从而克服单个模型的性能限制。研究在多样化的任务(包括日语和英语的能力评估与消费者画像分析)中验证了EBI的有效性,并特别分析了将具有负提升(Lift)值的模型纳入集成后提升整体性能的情况,同时考察了该方法在不同语言中的表现。这些发现表明,即使在有限计算资源下,EBI也为构建高性能AI系统提供了新思路,并有效利用了个体性能较低的模型。

链接: https://arxiv.org/abs/2504.17685
作者: Haru-Tada Sato,Fuka Matsuzaki,Jun-ichiro Takahashi
机构: Department of Data Science, i’s Factory Corporation, Ltd. (i’s Factory Corporation, Ltd.)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:This study explores the potential of small language model(SLM) ensembles to achieve accuracy comparable to proprietary large language models (LLMs). We propose Ensemble Bayesian Inference (EBI), a novel approach that applies Bayesian estimation to combine judgments from multiple SLMs, allowing them to exceed the performance limitations of individual models. Our experiments on diverse tasks(aptitude assessments and consumer profile analysis in both Japanese and English) demonstrate EBI’s effectiveness. Notably, we analyze cases where incorporating models with negative Lift values into ensembles improves overall performance, and we examine the method’s efficacy across different languages. These findings suggest new possibilities for constructing high-performance AI systems with limited computational resources and for effectively utilizing models with individually lower performance. Building on existing research on LLM performance evaluation, ensemble methods, and open-source LLM utilization, we discuss the novelty and significance of our approach.
zh

[NLP-5] Energy Considerations of Large Language Model Inference and Efficiency Optimizations

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在规模扩大和应用普及过程中不断上升的计算与环境成本问题。传统基准测试主要关注理想化设置下的延迟优化,而忽视了影响实际能耗的各种真实世界推理工作负载。论文的关键解决方案在于提出了一种建模方法,通过输入-输出令牌分布的分箱策略以及批量大小变化来近似现实中的LLM工作流程,并对多种自然语言处理(NLP)和生成式AI(Generative AI)任务(如对话AI和代码生成)的推理效率优化进行系统性分析。研究涵盖了软件框架、解码策略、GPU架构、在线/离线服务配置及模型并行性等多个维度,揭示了推理优化的效果高度依赖于工作负载特性、软件栈和硬件加速器,且基于浮点运算次数(FLOPs)或理论GPU利用率的简单能量估算显著低估了实际能耗。最终,论文表明合理应用相关推理效率优化措施可将未优化基线的总能耗降低高达73%,为可持续部署LLM提供了依据,并为未来AI基础设施的能效设计提供了指导。

链接: https://arxiv.org/abs/2504.17674
作者: Jared Fernandez,Clara Na,Vashisth Tiwari,Yonatan Bisk,Sasha Luccioni,Emma Strubell
机构: Carnegie Mellon University (卡内基梅隆大学); Hugging Face
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is highly sensitive to workload geometry, software stack, and hardware accelerators, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.
zh

[NLP-6] Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉问答(Visual Question Answering, VQA)任务中因幻觉现象(hallucination)导致的高风险输出问题。论文提出了一种基于分割校准预测(Split Conformal Prediction, SCP)框架的模型无关不确定性量化方法,通过动态阈值校准和跨模态一致性验证来缓解幻觉现象。关键创新包括:(1)严格控制边缘覆盖率以确保经验误差率始终低于用户定义的风险水平 (\alpha);(2)根据 \alpha 动态调整预测集大小,过滤低置信度输出;(3)无需依赖先验分布假设或重新训练。实证评估表明,该框架在不同基准数据集(如ScienceQA、MMMU)和多种LVLMs上均能提供理论保证,并在实际部署中展现出鲁棒性,适用于医疗、自动驾驶等安全敏感领域。

链接: https://arxiv.org/abs/2504.17671
作者: Yuanchang Ye,Weiyan Wen
机构: School of Data Sciences, Zhejiang University of Finance & Economics (浙江财经大学数据科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels ( \alpha ). Key innovations include: (1) rigorous control of \textbfmarginal coverage to ensure empirical error rates remain strictly below \alpha ; (2) dynamic adjustment of prediction set sizes inversely with \alpha , filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all \alpha values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.
zh

[NLP-7] Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics

【速读】: 该论文试图解决代码辅助大型语言模型(Code-assisted LLMs)在数学推理任务中的生成程序缺乏严谨评估的问题,重点关注其生成代码是否与数学规则有效关联及其对最终性能的影响。论文的关键解决方案在于通过手动和自动方式评估五种不同LLMs在两个数学数据集上的表现,分析生成程序的数学接地(mathematical grounding)分布特性,并揭示接地能力受模型能力和数学问题难度的影响,同时指出封闭源模型相较于开源模型在利用数学规则方面的优势。研究强调了超越执行准确性指标的深入评估需求,以更好地理解代码辅助LLMs在数学领域的潜力与局限性。

链接: https://arxiv.org/abs/2504.17665
作者: Zena Al-Khalili,Nick Howell,Dietrich Klakow
机构: Saarland Informatics Campus, Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs’ generated programs in response to math reasoning tasks. Our evaluation focuses on the extent to which LLMs ground their programs to math rules, and how that affects their end performance. For this purpose, we assess the generations of five different LLMs, on two different math datasets, both manually and automatically. Our results reveal that the distribution of grounding depends on LLMs’ capabilities and the difficulty of math problems. Furthermore, mathematical grounding is more effective for closed-source models, while open-source models fail to employ math rules in their solutions correctly. On MATH500, the percentage of grounded programs decreased to half, while the ungrounded generations doubled in comparison to ASDiv grade-school problems. Our work highlights the need for in-depth evaluation beyond execution accuracy metrics, toward a better understanding of code-assisted LLMs’ capabilities and limits in the math domain.
zh

[NLP-8] owards a comprehensive taxonomy of online abusive language informed by machine leaning

【速读】: 该论文旨在解决在线交流中滥用语言泛滥所带来的人们身心健康及社区安全的重大风险问题。论文的关键在于提出了一种用于区分在线文本中滥用语言主要特征的分类法。解决方案的核心是通过整合18个现有多标签数据集的分类系统,采用系统化的方法开发出一个包含5个类别和17个维度的分层且多面的 taxonomy(分类法),该分类法涵盖滥用语言的上下文、目标、强度、直接性以及主题等多个方面。这种共同的理解有助于促进研究人员、政策制定者、在线平台所有者以及其他利益相关方在在线滥用检测与缓解领域的更紧密合作、知识共享以及更快进展。

链接: https://arxiv.org/abs/2504.17653
作者: Samaneh Hosseini Moghaddam,Kelly Lyons,Cheryl Regehr,Vivek Goel,Kaitlyn Regehr
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of abusive language in online communications has posed significant risks to the health and wellbeing of individuals and communities. The growing concern regarding online abuse and its consequences necessitates methods for identifying and mitigating harmful content and facilitating continuous monitoring, moderation, and early intervention. This paper presents a taxonomy for distinguishing key characteristics of abusive language within online text. Our approach uses a systematic method for taxonomy development, integrating classification systems of 18 existing multi-label datasets to capture key characteristics relevant to online abusive language classification. The resulting taxonomy is hierarchical and faceted, comprising 5 categories and 17 dimensions. It classifies various facets of online abuse, including context, target, intensity, directness, and theme of abuse. This shared understanding can lead to more cohesive efforts, facilitate knowledge exchange, and accelerate progress in the field of online abuse detection and mitigation among researchers, policy makers, online platform owners, and other stakeholders.
zh

[NLP-9] RAG AT-Mind: A Multi-Granular Modeling Approach for Rumor Detection Based on MindSpore

【速读】: 该论文旨在解决社交平台上虚假信息传播加剧所引发的谣言检测挑战。解决方案的关键在于提出了一种名为RAGAT-Mind的多粒度建模方法,该方法结合了TextCNN用于局部语义提取、双向GRU用于序列上下文学习、Multi-Head Self-Attention用于全局依赖聚焦以及双向图卷积网络(Bidirectional Graph Convolutional Networks, BiGCN)用于词共现图的结构表示,从而有效融合层次化语言特征与基于图的语义结构,实现对中文谣言检测的高精度分类。

链接: https://arxiv.org/abs/2504.17574
作者: Zhenkai Qin,Guifang Yang,Dongze Wu
机构: Guangxi Police College (广西警察学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As false information continues to proliferate across social media platforms, effective rumor detection has emerged as a pressing challenge in natural language processing. This paper proposes RAGAT-Mind, a multi-granular modeling approach for Chinese rumor detection, built upon the MindSpore deep learning framework. The model integrates TextCNN for local semantic extraction, bidirectional GRU for sequential context learning, Multi-Head Self-Attention for global dependency focusing, and Bidirectional Graph Convolutional Networks (BiGCN) for structural representation of word co-occurrence graphs. Experiments on the Weibo1-Rumor dataset demonstrate that RAGAT-Mind achieves superior classification performance, attaining 99.2% accuracy and a macro-F1 score of 0.9919. The results validate the effectiveness of combining hierarchical linguistic features with graph-based semantic structures. Furthermore, the model exhibits strong generalization and interpretability, highlighting its practical value for real-world rumor detection applications.
zh

[NLP-10] DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

【速读】: 该论文旨在解决学术界对大型语言模型(Large Language Models, LLMs)基座模型训练过程及数据质量缺乏深入理解的问题。论文的关键解决方案在于构建了一个大规模、分级难度的推理数据集,并通过精确筛选最有价值的训练数据来提升推理能力。具体而言,作者利用通过率(pass rate)和变异系数(Coefficient of Variation, CV)选择数据,并发现基于基座模型的推理训练需要更高的学习率以实现有效训练。最终,通过精心筛选的数据显著提升了基座模型的推理能力,在AIME2024数学推理基准测试中达到了79.2%的通过率,接近当前最先进的性能水平。这一方法的核心在于数据处理与训练策略的优化。

链接: https://arxiv.org/abs/2504.17565
作者: Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li
机构: a-m-team
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: this https URL
zh

[NLP-11] When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

【速读】: 该论文旨在探究在预训练数据中添加元数据(metadata)这一技术对语言模型性能的影响,并试图解释为何该方法在某些下游任务中能够提升模型表现,而在其他情况下却未能带来一致的改进。论文的关键在于揭示了这种效果依赖于下游任务提示是否能推断出潜在语义(latent semantics)。具体而言,当上下文足够长以推断潜在语义时,使用元数据进行训练有助于提高模型性能;然而,如果上下文中缺乏必要的信息以实现准确的后验推理,则这种方法会对性能产生负面影响。因此,论文的核心贡献在于明确指出元数据的有效性取决于其与特定任务之间的关联程度以及能否有效引导模型学习到相关的潜在语义。

链接: https://arxiv.org/abs/2504.17562
作者: Rei Higuchi,Ryotaro Kawata,Naoki Nishikawa,Kazusato Oko,Shoichiro Yamaguchi,Sosuke Kobayashi,Seiya Tokui,Kohei Hayashi,Daisuke Okanohara,Taiji Suzuki
机构: The University of Tokyo (东京大学), RIKEN AIP; University of California, Berkeley (加州大学伯克利分校), RIKEN AIP; Preferred Networks, Inc. (Preferred Networks公司); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task’s prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model’s performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.
zh

[NLP-12] HalluLens: LLM Hallucination Benchmark

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成响应时出现偏离用户输入或训练数据的现象,即所谓的“幻觉”(hallucination)问题。这种幻觉现象会损害用户信任,并阻碍生成式AI系统的广泛应用。为应对这一挑战,论文提出了一个全面的幻觉基准测试框架,包含新的外在评估任务和现有的内在评估任务,这些任务基于清晰的幻觉分类学构建。论文的关键创新在于将LLM的幻觉与“事实性”(factuality)区分开来,提出了一种明确的分类法,区分了外在幻觉和内在幻觉,从而促进研究的一致性和可比性。此外,论文通过动态生成测试集来缓解数据泄露问题,确保基准的鲁棒性。论文还分析了现有基准的局限性和饱和现象。总体而言,论文的目标是建立清晰的幻觉分类学,引入新的外在幻觉任务,并对现有基准进行全面分析以区别于事实性评估。

链接: https://arxiv.org/abs/2504.17550
作者: Yejin Bang,Ziwei Ji,Alan Schelten,Anthony Hartshorn,Tara Fowler,Cheng Zhang,Nicola Cancedda,Pascale Fung
机构: Hong Kong University of Science and Technology (香港科技大学); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 42 pages

点击查看摘要

Abstract:Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as “hallucination.” These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from “factuality,” proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.
zh

[NLP-13] Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

【速读】: 该论文旨在解决水印在未经授权的知识蒸馏过程中对抗擦除攻击(scrubbing attacks)和反伪造攻击(spoofing attacks)的鲁棒性及不可伪造性问题。现有水印方法要么假设可访问模型内部,要么无法同时支持这两种攻击类型。论文的关键解决方案是提出了一种名为对比解码引导知识蒸馏(Contrastive Decoding-Guided Knowledge Distillation, CDG-KD)的统一框架,通过对比解码提取被污染或放大的水印文本,并结合双向蒸馏训练分别具备水印移除和水印伪造能力的新学生模型,从而实现未经授权的知识蒸馏下的双向攻击。实验表明,CDG-KD在有效执行攻击的同时保持了蒸馏模型的一般性能。研究结果强调了开发鲁棒且不可伪造的水印方案的重要性。

链接: https://arxiv.org/abs/2504.17480
作者: Xin Yi,Shunfan Zhengc,Linlin Wanga,Xiaoling Wang,Liang He
机构: East China Normal University (华东师范大学); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
zh

[NLP-14] HMI: Hierarchical Knowledge Management for Efficient Multi-Tenant Inference in Pretrained Language Models VLDB

【速读】: 该论文旨在解决预训练语言模型(Pretrained Language Models, PLMs)在多租户环境中高效服务所面临的显著计算需求挑战,特别是由于其对专用硬件的高度依赖。论文的关键创新在于提出了一种基于分层知识管理的多租户推理系统(Hierarchical knowledge management-based Multi-tenant Inference system, HMI),通过以下三个层面实现资源高效的模型管理:首先,将PLM的知识划分为通用、领域特定和任务特定,并构建分层PLMs(hPLMs)以在不同层级提取和存储知识,从而显著降低每个租户的GPU内存使用;其次,在HMI中通过频率驱动的领域特定知识树构建与更新以及参数交换技术有效管理任务特定知识,同时控制存储与内存开销;最后,通过细粒度流水线、分层知识预取以及批量化矩阵乘法优化等系统级改进提升资源利用率和推理吞吐量。这些措施使得HMI能够在单个GPU上高效支持多达10,000个hPLMs,且仅带来可忽略的精度损失。

链接: https://arxiv.org/abs/2504.17449
作者: Jun Zhang,Jue Wang,Huan Li,Lidan Shou,Ke Chen,Gang Chen,Qin Xie,Guiming Xie,Xuejian Gong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by VLDBJ 2025

点击查看摘要

Abstract:The significant computational demands of pretrained language models (PLMs), which often require dedicated hardware, present a substantial challenge in serving them efficiently, especially in multi-tenant environments. To address this, we introduce HMI, a Hierarchical knowledge management-based Multi-tenant Inference system, designed to manage tenants with distinct PLMs resource-efficiently. Our approach is three-fold: Firstly, we categorize PLM knowledge into general, domain-specific, and task-specific. Leveraging insights on knowledge acquisition across different model layers, we construct hierarchical PLMs (hPLMs) by extracting and storing knowledge at different levels, significantly reducing GPU memory usage per tenant. Secondly, we establish hierarchical knowledge management for hPLMs generated by various tenants in HMI. We manage domain-specific knowledge with acceptable storage increases by constructing and updating domain-specific knowledge trees based on frequency. We manage task-specific knowledge within limited GPU memory through parameter swapping. Finally, we propose system optimizations to enhance resource utilization and inference throughput. These include fine-grained pipelining via hierarchical knowledge prefetching to overlap CPU and I/O operations with GPU computations, and optimizing parallel implementations with batched matrix multiplications. Our experimental results demonstrate that the proposed HMI can efficiently serve up to 10,000 hPLMs (hBERTs and hGPTs) on a single GPU, with only a negligible compromise in accuracy.
zh

[NLP-15] Creating Targeted Interpretable Topic Models with LLM -Generated Text Augmentation

【速读】: 该论文试图解决主题模型在社会科学领域应用中存在的可解释性较差以及难以有效回答特定研究问题的局限性。论文的关键解决方案是利用大型语言模型(LLM)生成的文本增广来提升主题建模输出的实用性,通过在政治科学领域的案例研究验证其方法,发现使用GPT-4增广的主题建模能够生成高度可解释的类别,从而以较少的人工指导回答特定领域的研究问题。

链接: https://arxiv.org/abs/2504.17445
作者: Anna Lieb,Maneesh Arora,Eni Mustafaraj
机构: 未知
类目: Computation and Language (cs.CL)
备注: Presented at IC2S2 2024 in Philadelphia, USA

点击查看摘要

Abstract:Unsupervised machine learning techniques, such as topic modeling and clustering, are often used to identify latent patterns in unstructured text data in fields such as political science and sociology. These methods overcome common concerns about reproducibility and costliness involved in the labor-intensive process of human qualitative analysis. However, two major limitations of topic models are their interpretability and their practicality for answering targeted, domain-specific social science research questions. In this work, we investigate opportunities for using LLM-generated text augmentation to improve the usefulness of topic modeling output. We use a political science case study to evaluate our results in a domain-specific application, and find that topic modeling using GPT-4 augmentations creates highly interpretable categories that can be used to investigate domain-specific research questions with minimal human guidance.
zh

[NLP-16] PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona NAACL2025

【速读】: 该论文旨在解决现有任务型对话(Task-Oriented Dialogue, TOD)系统生成的响应过于通用、单调,缺乏个性化且无法适应用户个人属性的问题。为了解决这一问题,论文提出了PicPersona-TOD数据集,通过将用户图像融入persona,并结合首印象(first impressions)、基于对话策略引导的提示方法(dialogue policy-guided prompting)以及利用外部知识来减少幻觉(hallucinations),实现针对用户特定因素(如年龄或情感语境)的个性化响应。关键在于利用图像信息和策略引导的多模态方法,结合外部知识增强系统的个性化能力与可靠性。

链接: https://arxiv.org/abs/2504.17390
作者: Jihyun Lee,Yejin Jeon,Seungyeon Seo,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH (POSTECH 人工智能研究生院), Republic of Korea; Department of Computer Science and Engineering, POSTECH (POSTECH 计算机科学与工程系), Republic of Korea
类目: Computation and Language (cs.CL)
备注: Accepted in NAACL 2025 main

点击查看摘要

Abstract:Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users’ personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains this https URL.
zh

[NLP-17] LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams

【速读】: 该论文旨在解决自然语言处理中长上下文理解的挑战,特别是在具有语音特征、高冗余性和信息密度不均的实际对话场景中的难题。现有基准数据集虽使大型语言模型(Large Language Models, LLMs)取得显著成果,但未能充分反映此类文本的复杂性,限制了其在实际应用中的有效性。为弥合这一差距,论文构建了一个源自直播流的首个口语长文本数据集(Spoken Long-Text Dataset),以反映真实场景中的冗余丰富性和会话特性。论文设计了检索依赖型、推理依赖型和混合型三类任务,并评估了流行LLMs与专门方法在这些任务上的表现,发现当前方法存在任务特定偏好且在高度冗余输入下表现不佳。关键解决方案在于提出了一种新的基线方法,能够更好地处理口语文本中的冗余问题,并在各类任务中展现出色性能。最终,该研究揭示了现有方法的关键局限性,并为提升长上下文理解能力指明了方向,同时填补了评估口语长上下文理解能力的空白,为开发实用的电子商务系统奠定了基础。相关代码和基准可从提供的链接获取。

链接: https://arxiv.org/abs/2504.17366
作者: Yongxuan Wu,Runyu Chen,Peiyu Liu,Hongjin Qian
机构: University of International Business and Economics (对外经济贸易大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at this https URL.
zh

[NLP-18] meSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

【速读】: 该论文旨在解决现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的足球视频解说生成方法在长时序理解与端到端处理方面存在的不足。具体而言,传统方法通常依赖于预设的时间线索(temporal a priori)进行字幕生成,无法实现足球视频的端到端处理;而两阶段方法虽复杂但难以捕捉全局上下文,导致性能不佳。为此,论文提出TimeSoccer,这是一种针对全场足球视频单锚点密集视频字幕任务(Single-anchor Dense Video Captioning, SDVC)的首个端到端足球MLLM。其关键创新在于通过一次推理同时预测时间戳并生成字幕,从而实现对45分钟比赛的全局上下文建模。此外,为了支持长视频的理解,引入了MoFA-Select模块,这是一种无需训练的运动感知帧压缩模块,通过粗到细策略自适应选择代表性帧,并结合互补训练范式增强模型处理长时序序列的能力。实验结果表明,TimeSoccer在端到端SDVC任务中达到了当前最优性能(State-of-The-Art, SoTA),生成高质量解说的同时实现了精准的时间对齐和强语义相关性。

链接: https://arxiv.org/abs/2504.17365
作者: Ling You,Wenxuan Huang,Xinni Xie,Xiangyi Wei,Bangyan Li,Shaohui Lin,Yang Li,Changbo Wang
机构: East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model’s ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.
zh

[NLP-19] PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare

【速读】: 该论文旨在解决在医疗领域中利用大型语言模型(Large Language Models, LLMs)进行特定任务优化时面临的敏感数据隐私问题。传统方法通过微调(fine-tuning)来提升模型性能,但需要依赖大量标注数据,这可能包含敏感信息,从而引发显著的数据隐私隐患。特别是医疗领域,这类问题尤为突出。

论文提出的解决方案——PatientDx框架,核心在于采用模型合并(model merging)技术,无需针对患者数据进行微调或适配即可设计出适用于健康预测任务的有效LLMs。其关键是基于最近提出的LLMs合并技术,优化构建块合并策略,并通过调整一个适应数值推理的关键模型的超参数,同时在性能度量的基础上对示例进行调优,而无需在这些数据上训练LLM。实验结果表明,与初始模型相比,该方法在MIMIC-IV数据集的死亡率预测任务中AUROC提升了高达7%,并且相较于微调模型,PatientDx框架更不易发生数据泄露问题,同时保持了良好的性能表现。

链接: https://arxiv.org/abs/2504.17360
作者: Jose G. Moreno(IRIT-IRIS),Jesus Lovon(IRIT-IRIS),M’Rick Robin-Charlet(UT3),Christine Damase-Michel,Lynda Tamine(IRIT-IRIS)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Fine-tuning of Large Language Models (LLMs) has become the default practice for improving model performance on a given task. However, performance improvement comes at the cost of training on vast amounts of annotated data which could be sensitive leading to significant data privacy concerns. In particular, the healthcare domain is one of the most sensitive domains exposed to data privacy issues. In this paper, we present PatientDx, a framework of model merging that allows the design of effective LLMs for health-predictive tasks without requiring fine-tuning nor adaptation on patient data. Our proposal is based on recently proposed techniques known as merging of LLMs and aims to optimize a building block merging strategy. PatientDx uses a pivotal model adapted to numerical reasoning and tunes hyperparameters on examples based on a performance metric but without training of the LLM on these data. Experiments using the mortality tasks of the MIMIC-IV dataset show improvements up to 7% in terms of AUROC when compared to initial models. Additionally, we confirm that when compared to fine-tuned models, our proposal is less prone to data leak problems without hurting performance. Finally, we qualitatively show the capabilities of our proposal through a case study. Our best model is publicly available at this https URL Jgmorenof/mistral_merged_0_4.
zh

[NLP-20] M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction

【速读】: 该论文致力于将互惠增强效应(Mutual Reinforcement Effect, MRE)从文本领域扩展到多模态信息抽取领域,探索其在视觉与多模态任务中的适用性。论文提出了一种新的任务——多模态互惠增强效应(Multimodal Mutual Reinforcement Effect, M-MRE),并构建了相应的数据集。为了解决M-MRE带来的挑战,论文的关键解决方案是提出了Prompt格式适配器(Prompt Format Adapter, PFA),该适配器能够与各种大型视觉-语言模型(Large Vision-Language Models, LVLMs)完全兼容。实验结果表明,MRE在多模态文本-图像理解场景下的M-MRE任务中同样有效,证明了MRE在跨三个相关任务中实现相互增益的能力,从而验证了其在多模态领域的通用性。

链接: https://arxiv.org/abs/2504.17353
作者: Chengguang Gan,Sunbowen Lee,Zhixi Cai,Yanbin Wei,Lei Zheng,Yunhao Liang,Shiwen Ni,Tatsunori Mori
机构: Yokohama National University(Yokohama National University); Wuhan University of Science and Technology(Wuhan University of Science and Technology); Monash University(Monash University); Southern University of Science and Technology(Southern University of Science and Technology); Hong Kong University of Science and Technology(Hong Kong University of Science and Technology); Shanghai Jiao Tong University(上海交通大学); University of Chinese Academy of Sciences(中国科学院大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); Yokohama National University(Yokohama National University)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Mutual Reinforcement Effect (MRE) is an emerging subfield at the intersection of information extraction and model interpretability. MRE aims to leverage the mutual understanding between tasks of different granularities, enhancing the performance of both coarse-grained and fine-grained tasks through joint modeling. While MRE has been explored and validated in the textual domain, its applicability to visual and multimodal domains remains unexplored. In this work, we extend MRE to the multimodal information extraction domain for the first time. Specifically, we introduce a new task: Multimodal Mutual Reinforcement Effect (M-MRE), and construct a corresponding dataset to support this task. To address the challenges posed by M-MRE, we further propose a Prompt Format Adapter (PFA) that is fully compatible with various Large Vision-Language Models (LVLMs). Experimental results demonstrate that MRE can also be observed in the M-MRE task, a multimodal text-image understanding scenario. This provides strong evidence that MRE facilitates mutual gains across three interrelated tasks, confirming its generalizability beyond the textual domain.
zh

[NLP-21] Bridging Cognition and Emotion: Empathy-Driven Multimodal Misinformation Detection

【速读】: 该论文试图解决社交网络中 misinformation(虚假信息)快速传播的问题,传统检测方法主要关注表面特征,忽视了人类共情在传播过程中的重要作用。论文的关键解决方案是提出Dual-Aspect Empathy Framework (DAE),通过结合认知共情与情感共情,从 misinformation 创作者和读者的双重视角进行分析。具体而言,DAE 通过研究创作者的认知策略和情感诉求,并利用Large Language Models (LLMs) 模拟读者的认知判断和情感反应,提供了一种更全面且以人为本的虚假信息检测方法。此外,还引入共情感知过滤机制以增强响应的真实性和多样性。实验结果表明,DAE 在基准数据集上的表现优于现有方法,为多模态虚假信息检测提供了新的范式。

链接: https://arxiv.org/abs/2504.17332
作者: Zihan Wang,Lu Yuan,Zhengxuan Zhang,Qing Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the digital era, social media has become a major conduit for information dissemination, yet it also facilitates the rapid spread of misinformation. Traditional misinformation detection methods primarily focus on surface-level features, overlooking the crucial roles of human empathy in the propagation process. To address this gap, we propose the Dual-Aspect Empathy Framework (DAE), which integrates cognitive and emotional empathy to analyze misinformation from both the creator and reader perspectives. By examining creators’ cognitive strategies and emotional appeals, as well as simulating readers’ cognitive judgments and emotional responses using Large Language Models (LLMs), DAE offers a more comprehensive and human-centric approach to misinformation detection. Moreover, we further introduce an empathy-aware filtering mechanism to enhance response authenticity and diversity. Experimental results on benchmark datasets demonstrate that DAE outperforms existing methods, providing a novel paradigm for multimodal misinformation detection.
zh

[NLP-22] FLUKE: A Linguistically-Driven and Task-Agnostic Framework for Robustness Evaluation

【速读】: 该论文试图解决模型在面对语言学变化时的鲁棒性评估问题。为了解决这一问题,论文提出了FLUKE(Framework for LingUistically-driven and tasK-agnostic robustness Evaluation)框架,其关键是通过系统性的最小化测试数据的变化来评估模型的鲁棒性,涵盖从拼写到方言及风格变异等多层次的语言学变化,并结合大规模语言模型(LLMs)与人工验证生成修改方案。这种任务无关的方法揭示了不同任务对语言学变化敏感性的差异以及各类模型在否定性修改下的普遍脆弱性。

链接: https://arxiv.org/abs/2504.17311
作者: Yulia Otmakhova,Hung Thinh Truong,Rahmad Mahendra,Zenan Zhai,Rongxin Zhu,Daniel Beck,Jey Han Lau
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a task-agnostic framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels - from orthography to dialect and style varieties - and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE’s utility by evaluating both fine-tuned models and LLMs across four diverse NLP tasks, and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) while LLMs have better overall robustness compared to fine-tuned models, they still exhibit significant brittleness to certain linguistic variations; (3) all models show substantial vulnerability to negation modifications across most tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
zh

[NLP-23] CoheMark: A Novel Sentence-Level Watermark for Enhanced Text Quality ICLR2025

【速读】: 该论文旨在解决在保持生成文本高质量的同时实现鲁棒水印检测的问题。现有的句子级水印技术通常依赖于任意分割或生成过程来嵌入水印,这可能导致可用句子的限制,并进而影响生成响应的质量。为了解决这一挑战,论文提出了一种名为CoheMark的高级句子级水印技术,它利用句子之间的连贯关系以提高逻辑流畅性。CoheMark的关键在于通过训练的模糊聚类选择句子,并应用特定的后续句子选择标准。实验评估表明,CoheMark在确保强水印强度的同时,对文本质量的影响极小。

链接: https://arxiv.org/abs/2504.17309
作者: Junyan Zhang,Shuliang Liu,Aiwei Liu,Yubo Gao,Jungang Li,Xiaojie Gu,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Published at the 1st workshop on GenAI Watermarking, collocated with ICLR 2025

点击查看摘要

Abstract:Watermarking technology is a method used to trace the usage of content generated by large language models. Sentence-level watermarking aids in preserving the semantic integrity within individual sentences while maintaining greater robustness. However, many existing sentence-level watermarking techniques depend on arbitrary segmentation or generation processes to embed watermarks, which can limit the availability of appropriate sentences. This limitation, in turn, compromises the quality of the generated response. To address the challenge of balancing high text quality with robust watermark detection, we propose CoheMark, an advanced sentence-level watermarking technique that exploits the cohesive relationships between sentences for better logical fluency. The core methodology of CoheMark involves selecting sentences through trained fuzzy c-means clustering and applying specific next sentence selection criteria. Experimental evaluations demonstrate that CoheMark achieves strong watermark strength while exerting minimal impact on text quality.
zh

[NLP-24] Evaluating and Mitigating Bias in AI-Based Medical Text Generation

【速读】: 该论文旨在解决医学领域文本生成中的公平性问题,观察到不同种族、性别、年龄组以及交叉群体在生成性能上的显著差异。论文提出了一种算法,通过选择性优化表现欠佳的群体来减少偏见,同时确保整个过程保持完全可微以支持有效的模型训练。关键在于该算法不仅考虑了词级准确性,还兼顾了与目标参考的病理学准确性,从而在不牺牲整体性能的前提下提升了生成文本的公平性,使各类群体间的性能差异减少了超过30%,而文本生成准确率的相对变化通常控制在2%以内。

链接: https://arxiv.org/abs/2504.17279
作者: Xiuying Chen,Tairan Wang,Juexiao Zhou,Zirui Song,Xin Gao,Xiangliang Zhang
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); King Abdullah University of Science and Technology (KAUST); University of Notre Dame, IN, USA
类目: Computation and Language (cs.CL)
备注: 12 pages, 8 figures, published in Nature Computational Science

点击查看摘要

Abstract:Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at this https URL. Comments: 12 pages, 8 figures, published in Nature Computational Science Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.17279 [cs.CL] (or arXiv:2504.17279v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.17279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Nature Computational Science 2025
zh

[NLP-25] JurisCTC: Enhancing Legal Judgment Prediction via Cross-Domain Transfer and Contrastive Learning IJCNN

【速读】: 该论文旨在解决跨不同法律领域进行知识迁移的挑战,特别是针对长篇且复杂的法律文本以及大规模标注数据集匮乏的问题。为应对这些挑战,论文提出了JurisCTC模型,其关键在于通过对比学习实现不同法律领域的有效知识迁移,并特别针对民事与刑事法律领域的司法判决预测任务进行了优化。相比其他方法及特定的大语言模型(LLMs),JurisCTC在相关任务上表现出显著优势,取得了最高78.83%的预测准确率。

链接: https://arxiv.org/abs/2504.17264
作者: Zhaolu Kang,Hongtian Cai,Xiangyang Ji,Jinzhe Li,Nanfei Gu
机构: School of Software & Microelectronics, Peking University (北京大学); School of Electrical & Electronic Engineering, Nanyang Technological University (南洋理工大学); College of Software, Jilin University (吉林大学软件学院); College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院); KoGuan School of Law, Shanghai Jiao Tong University (上海交通大学凯原法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:In recent years, Unsupervised Domain Adaptation (UDA) has gained significant attention in the field of Natural Language Processing (NLP) owing to its ability to enhance model generalization across diverse domains. However, its application for knowledge transfer between distinct legal domains remains largely unexplored. To address the challenges posed by lengthy and complex legal texts and the limited availability of large-scale annotated datasets, we propose JurisCTC, a novel model designed to improve the accuracy of Legal Judgment Prediction (LJP) tasks. Unlike existing approaches, JurisCTC facilitates effective knowledge transfer across various legal domains and employs contrastive learning to distinguish samples from different domains. Specifically, for the LJP task, we enable knowledge transfer between civil and criminal law domains. Compared to other models and specific large language models (LLMs), JurisCTC demonstrates notable advancements, achieving peak accuracies of 76.59% and 78.83%, respectively.
zh

[NLP-26] Low-Resource Neural Machine Translation Using Recurrent Neural Networks and Transfer Learning: A Case Study on English-to-Igbo

【速读】: 该论文旨在解决英语到伊博语(Igbo)这一低资源非洲语言翻译任务中的性能不足问题。伊博语是尼日利亚及西非地区超过4000万人使用的语言。论文的关键解决方案在于结合循环神经网络(Recurrent Neural Network, RNN),包括长短期记忆网络(Long Short-Term Memory, LSTM)和门控循环单元(Gated Recurrent Unit, GRU),并引入注意力机制以提升翻译准确性;同时,通过在SimpleTransformers框架中应用基于MarianNMT预训练模型的迁移学习技术进一步优化性能。实验结果表明,使用迁移学习后BLEU评分提升了+4.83点,达到约70%的翻译精度,证明了将RNN与迁移学习相结合在低资源语言翻译任务中的有效性。

链接: https://arxiv.org/abs/2504.17252
作者: Ocheme Anthony Ekle,Biswarup Das
机构: Tennessee Technological University (田纳西技术大学); Moscow Institute of Physics and Technology (莫斯科物理技术学院); Huawei (华为)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 14 combined figures (19 total), includes horizontal layouts. Submitted to arXiv for open access

点击查看摘要

Abstract:In this study, we develop Neural Machine Translation (NMT) and Transformer-based transfer learning models for English-to-Igbo translation - a low-resource African language spoken by over 40 million people across Nigeria and West Africa. Our models are trained on a curated and benchmarked dataset compiled from Bible corpora, local news, Wikipedia articles, and Common Crawl, all verified by native language experts. We leverage Recurrent Neural Network (RNN) architectures, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), enhanced with attention mechanisms to improve translation accuracy. To further enhance performance, we apply transfer learning using MarianNMT pre-trained models within the SimpleTransformers framework. Our RNN-based system achieves competitive results, closely matching existing English-Igbo benchmarks. With transfer learning, we observe a performance gain of +4.83 BLEU points, reaching an estimated translation accuracy of 70%. These findings highlight the effectiveness of combining RNNs with transfer learning to address the performance gap in low-resource language translation tasks.
zh

[NLP-27] Crisp: Cognitive Restructuring of Negative Thoughts through Multi-turn Supportive Dialogues

【速读】: 该论文旨在解决认知重构(Cognitive Restructuring, CR)在人机交互心理治疗中的现有方法未能有效模拟临床心理治疗过程的问题。目前的方法主要依赖简单的文本改写、固定模式的对话或一次性CR流程,无法满足实际需求。为填补这一空白,论文提出了一种名为CRDial的新框架,其关键是通过设计专门的识别与重构阶段创建多轮对话,结合句子级支持性会话策略,并采用多通道循环机制实现迭代式的认知重构。此外,基于CRDial框架,论文构建了一个大规模高质量的双语对话数据集Crisp,并训练了两个规模分别为7B和14B参数量的Crispers对话型大语言模型,用于支持认知重构任务。大量的人类研究表明,Crispers在点评估、配对评估和干预评估中表现出显著优势。

链接: https://arxiv.org/abs/2504.17238
作者: Jinfeng Zhou,Yuxuan Chen,Jianing Yin,Yongkang Huang,Yihan Shi,Xikun Zhang,Libiao Peng,Rongsheng Zhang,Tangjie Lv,Zhipeng Hu,Hongning Wang,Minlie Huang
机构: The CoAI Group, DCST, Tsinghua University (清华大学); University of Pennsylvania (宾夕法尼亚大学); Lingxin AI; Harvard Graduate School of Education, Harvard University (哈佛大学); Department of Psychology and Behavioral Sciences, Zhejiang University (浙江大学心理与行为科学系); Fuxi AI Lab; https://peppy-ai.com/
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Cognitive Restructuring (CR) is a psychotherapeutic process aimed at identifying and restructuring an individual’s negative thoughts, arising from mental health challenges, into more helpful and positive ones via multi-turn dialogues. Clinician shortage and stigma urge the development of human-LLM interactive psychotherapy for CR. Yet, existing efforts implement CR via simple text rewriting, fixed-pattern dialogues, or a one-shot CR workflow, failing to align with the psychotherapeutic process for effective CR. To address this gap, we propose CRDial, a novel framework for CR, which creates multi-turn dialogues with specifically designed identification and restructuring stages of negative thoughts, integrates sentence-level supportive conversation strategies, and adopts a multi-channel loop mechanism to enable iterative CR. With CRDial, we distill Crisp, a large-scale and high-quality bilingual dialogue dataset, from LLM. We then train Crispers, Crisp-based conversational LLMs for CR, at 7B and 14B scales. Extensive human studies show the superiority of Crispers in pointwise, pairwise, and intervention evaluations.
zh

[NLP-28] Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?

【速读】: 本文旨在解决通过知识蒸馏(Knowledge Distillation, KD)方法在大规模语言模型(Large Language Models, LLMs)中进行高效束生成(bundle generation)的问题。由于LLMs参数规模庞大,在微调和推理阶段会带来显著的计算成本,因此需要一种能够有效降低计算需求同时保持性能的解决方案。论文的关键在于提出了一种综合性的知识蒸馏框架,该框架通过逐步提取知识(如模式、规则及深层见解)、采用不同的策略捕获不同量级的知识,并利用上下文学习(in-context learning)、监督微调(supervised fine-tuning)以及多种组合方式来适应小规模学生模型,从而实现特定领域的适配与效率提升。实验结果表明,这种框架能够显著提高基于LLMs的束生成任务的效率与效果。

链接: https://arxiv.org/abs/2504.17220
作者: Kaidong Feng,Zhu Sun,Jie Yang,Hui Fang,Xinghua Qu,Wenyuan Liu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during fine-tuning and inference due to their massive parameterization. Knowledge distillation (KD) offers a promising solution, transferring expertise from large teacher models to compact student models. This study systematically investigates knowledge distillation approaches for bundle generation, aiming to minimize computational demands while preserving performance. We explore three critical research questions: (1) how does the format of KD impact bundle generation performance? (2) to what extent does the quantity of distilled knowledge influence performance? and (3) how do different ways of utilizing the distilled knowledge affect performance? We propose a comprehensive KD framework that (i) progressively extracts knowledge (patterns, rules, deep thoughts); (ii) captures varying quantities of distilled knowledge through different strategies; and (iii) exploits complementary LLM adaptation techniques (in-context learning, supervised fine-tuning, combination) to leverage distilled knowledge in small student models for domain-specific adaptation and enhanced efficiency. Extensive experiments provide valuable insights into how knowledge format, quantity, and utilization methodologies collectively shape LLM-based bundle generation performance, exhibiting KD’s significant potential for more efficient yet effective LLM-based bundle generation.
zh

[NLP-29] A RAG -Based Multi-Agent LLM System for Natural Hazard Resilience and Adaptation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提供上下文特定信息方面的局限性,尤其是在需要专门知识的领域支持决策者应对极端自然危害事件(如野火、极端天气等)的问题。论文的关键解决方案是提出了一种基于检索增强生成(Retrieval-Augmented Generation, RAG)的多代理LLM系统,通过整合自然危害预测数据、观测数据集以及科学文献,确保提供的信息既准确又具有上下文相关性。作为概念验证,论文展示了WildfireGPT系统,该系统专注于野火风险分析,并采用以用户为中心的多代理架构,为不同利益相关方提供定制化的风险洞察。实验结果显示,WildfireGPT在决策支持任务中显著优于现有的基于LLM的解决方案。

链接: https://arxiv.org/abs/2504.17200
作者: Yangxinyu Xie,Bowen Jiang,Tanwi Mallick,Joshua David Bergerson,John K. Hutchison,Duane R. Verner,Jordan Branham,M. Ross Alexander,Robert B. Ross,Yan Feng,Leslie-Anne Levy,Weijie Su,Camillo J. Taylor
机构: Department of Statistics and Data Science, University of Pennsylvania (宾夕法尼亚大学); Department of Computer and Information Science, University of Pennsylvania (宾夕法尼亚大学); Mathematics and Computer Science Division, Argonne National Laboratory (阿贡国家实验室); Decision and Infrastructure Sciences Division, Argonne National Laboratory (阿贡国家实验室); Environmental Science Division, Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are a transformational capability at the frontier of artificial intelligence and machine learning that can support decision-makers in addressing pressing societal challenges such as extreme natural hazard events. As generalized models, LLMs often struggle to provide context-specific information, particularly in areas requiring specialized knowledge. In this work we propose a retrieval-augmented generation (RAG)-based multi-agent LLM system to support analysis and decision-making in the context of natural hazards and extreme weather events. As a proof of concept, we present WildfireGPT, a specialized system focused on wildfire hazards. The architecture employs a user-centered, multi-agent design to deliver tailored risk insights across diverse stakeholder groups. By integrating natural hazard and extreme weather projection data, observational datasets, and scientific literature through an RAG framework, the system ensures both the accuracy and contextual relevance of the information it provides. Evaluation across ten expert-led case studies demonstrates that WildfireGPT significantly outperforms existing LLM-based solutions for decision support.
zh

[NLP-30] Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

【速读】: 该论文旨在解决机器学习研究快速发展但对应的代码实现往往不可用的问题,这导致研究人员难以高效复现结果或基于已有工作进行创新。为应对这一挑战,论文提出PaperCoder,这是一种多智能体大语言模型框架,能够将机器学习论文转化为功能完备的代码仓库。其解决方案的关键在于采用三阶段流程:首先通过规划阶段构建高层次路线图、设计系统架构、识别文件依赖关系并生成配置文件;其次在分析阶段解析特定实现细节;最后在生成阶段产出模块化且具备依赖感知的代码。此外,每个阶段均由专门设计的智能体实例化,并协作完成整个流程。论文通过基于模型与人为评估的方式验证了PaperCoder在从机器学习论文生成代码实现上的有效性,并展示了其在新发布的PaperBench基准测试中的卓越表现。

链接: https://arxiv.org/abs/2504.17192
作者: Minju Seo,Jinheon Baek,Seongyun Lee,Sung Ju Hwang
机构: KAIST (高丽大学); DeepAuto.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, specifically from the original paper authors, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
zh

[NLP-31] MIRAG E: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation NAACL2025

【速读】: 该论文旨在解决 Retrieval-Augmented Generation (RAG) 系统评估难题,由于检索与生成组件之间的复杂交互,现有 RAG 系统缺乏能够进行详细且组件特定评估的基准数据集。为应对这一挑战,论文提出 MIRAGE 数据集,这是一个专门设计用于 RAG 系统评估的问题回答数据集,包含 7,560 条精心策划的实例以及与之关联的 37,800 条检索池条目,从而实现检索和生成任务的高效精确评估。此外,论文引入新的评价指标以衡量 RAG 的适应性,包括噪声鲁棒性、上下文可接受性、上下文无关性和上下文误解析等维度。关键在于通过构建 MIRAGE 数据集及其配套的评价机制,为不同检索器-大型语言模型 (Retriever-LLM) 配置提供全面实验支持,并揭示模型配对的最佳匹配方式及 RAG 系统内部的细微动态变化。

链接: https://arxiv.org/abs/2504.17137
作者: Chanhee Park,Hyeonseok Moon,Chanjun Park,Heuiseok Lim
机构: Korea University (韩国高丽大学); Republic of Korea (韩国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL2025 Findings

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnoteThe MIRAGE code and data are available at this https URL.
zh

[NLP-32] Steering the CensorShip: Uncovering Representation Vectors for LLM “Thought” Control

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全调优过程中“审查”机制的可检测性和可控性问题。论文的核心目标是揭示和量化这些模型在响应有害请求或调整输出以符合特定偏好时的“审查”行为,并提供方法来检测和调节这种行为的强度。解决方案的关键在于提出了一种通过表示工程(representation engineering)技术找到拒绝-合规向量(refusal-compliance vector)的方法,该向量能够检测并控制模型输出中的审查水平。此外,研究进一步分析了从DeepSeek-R1蒸馏而来的推理型LLMs,发现了一种通过“思维抑制”(thought suppression)实现的额外维度的审查,并展示了如何利用类似方法找到一个抑制模型推理过程的向量,从而通过应用该向量的负倍数来移除审查。

链接: https://arxiv.org/abs/2504.17130
作者: Hannah Cyberey,David Evans
机构: University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this “censorship” works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal–compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through “thought suppression”. We show a similar approach can be used to find a vector that suppresses the model’s reasoning process, allowing us to remove censorship by applying the negative multiples of this vector
zh

[NLP-33] he Rise of Small Language Models in Healthcare: A Comprehensive Survey

【速读】: 本文旨在解决在资源受限环境下,如何实现高效且可扩展的医疗健康信息学应用的问题。随着大型语言模型(Large Language Models, LLMs)在医疗领域的广泛应用,数据隐私问题和资源限制成为关注焦点,而小语言模型(Small Language Models, SLMs)提供了一种具有临床可行性的解决方案。关键在于通过构建分类框架识别和归类SLMs,基于此框架从三个维度——自然语言处理任务、利益相关者角色以及连续性护理——分析模型,并提出从零构建模型的架构基础、通过提示工程、指令微调和推理提升临床精度的方法,以及利用压缩技术提高模型的可访问性和可持续性。最终目标是为医疗专业人士提供全面的调查研究,介绍模型优化的最新创新,并提供精选资源以支持未来的研究与发展。

链接: https://arxiv.org/abs/2504.17119
作者: Muskan Garg,Shaina Raza,Shebuti Rayana,Xingyi Liu,Sunghwan Sohn
机构: Mayo Clinic (梅奥诊所), Rochester, Minnesota, USA; Vector Institute, Toronto, Canada; SUNY at Old Westbury (纽约州立大学奥斯威戈分校), Old Westbury, New York, USA; Mayo Clinic (梅奥诊所), Rochester, Minnesota, USA; Mayo Clinic (梅奥诊所), Rochester, Minnesota, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 35 pages, 7 tables, 5 figures

点击查看摘要

Abstract:Despite substantial progress in healthcare applications driven by large language models (LLMs), growing concerns around data privacy, and limited resources; the small language models (SLMs) offer a scalable and clinically viable solution for efficient performance in resource-constrained environments for next-generation healthcare informatics. Our comprehensive survey presents a taxonomic framework to identify and categorize them for healthcare professionals and informaticians. The timeline of healthcare SLM contributions establishes a foundational framework for analyzing models across three dimensions: NLP tasks, stakeholder roles, and the continuum of care. We present a taxonomic framework to identify the architectural foundations for building models from scratch; adapting SLMs to clinical precision through prompting, instruction fine-tuning, and reasoning; and accessibility and sustainability through compression techniques. Our primary objective is to offer a comprehensive survey for healthcare professionals, introducing recent innovations in model optimization and equipping them with curated resources to support future research and development in the field. Aiming to showcase the groundbreaking advancements in SLMs for healthcare, we present a comprehensive compilation of experimental results across widely studied NLP tasks in healthcare to highlight the transformative potential of SLMs in healthcare. The updated repository is available at Github
zh

[NLP-34] Co-CoT: A Prompt-Based Framework for Collaborative Chain-of-Thought Reasoning

【速读】: 该论文旨在解决因短形式内容的普及和人工智能快速采用导致的深度反思机会减少问题,这削弱了用户的批判性思维能力,并降低了对人工智能生成输出背后推理的参与度。为了解决这一问题,论文提出了一个交互式链式思维(Interactive Chain-of-Thought, CoT)框架。该框架的关键在于通过使模型的推理过程透明、模块化且可由用户编辑,从而增强以人为中心的可解释性和负责任的人工智能使用。框架将推理分解为用户可以检查、修改和重新执行的明确定义的块,鼓励主动的认知参与而非被动消费。此外,它集成了轻量级的基于偏好学习的编辑-适应机制,以适应不同的认知风格和用户意图,同时通过明确的元数据披露、内置的偏见检查点功能和隐私保护措施确保伦理透明性。

链接: https://arxiv.org/abs/2504.17091
作者: Seunghyun Yoo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 page

点击查看摘要

Abstract:Due to the proliferation of short-form content and the rapid adoption of AI, opportunities for deep, reflective thinking have significantly diminished, undermining users’ critical thinking and reducing engagement with the reasoning behind AI-generated outputs. To address this issue, we propose an Interactive Chain-of-Thought (CoT) Framework that enhances human-centered explainability and responsible AI usage by making the model’s inference process transparent, modular, and user-editable. The framework decomposes reasoning into clearly defined blocks that users can inspect, modify, and re-execute, encouraging active cognitive engagement rather than passive consumption. It further integrates a lightweight edit-adaptation mechanism inspired by preference learning, allowing the system to align with diverse cognitive styles and user intentions. Ethical transparency is ensured through explicit metadata disclosure, built-in bias checkpoint functionality, and privacy-preserving safeguards. This work outlines the design principles and architecture necessary to promote critical engagement, responsible interaction, and inclusive adaptation in AI systems aimed at addressing complex societal challenges.
zh

[NLP-35] How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study

【速读】: 该论文试图解决的问题是:探究语言风格(Language Style)在大型语言模型(LLM)与用户交互中的作用及其对用户偏好的影响,特别是评估语言风格如何以及为何会在不同用户群体和个人特质下产生差异化的偏好,并分析其潜在的双刃剑效应(提升用户体验的同时增加误用风险)。

解决方案的关键在于通过一系列探索性及实验性的用户研究,验证LLM的语言风格是否确实会影响用户的偏好,并进一步探讨这种影响如何受到用户群体特征和个体特质的调节。论文强调,理解语言风格与个人特质之间的交互作用对于全面评估LLM的用户体验至关重要,同时指出当前研究结果需要更大样本量和更广泛的代表性人群以增强结论的普适性。未来的研究方向将致力于解决这些局限性,从而更深入地揭示语言风格、个人特质与用户偏好之间的因果关系及相关机制。

链接: https://arxiv.org/abs/2504.17083
作者: Rendi Chevi,Kentaro Inui,Thamar Solorio,Alham Fikri Aji
机构: MBZUAI (MBZUAI); Abu Dhabi (阿布扎比); UAE (阿拉伯联合酋长国)
类目: Computation and Language (cs.CL)
备注: Accepted at GenAICHI 2025 @ ACM CHI 2025

点击查看摘要

Abstract:What makes an interaction with the LLM more preferable for the user? While it is intuitive to assume that information accuracy in the LLM’s responses would be one of the influential variables, recent studies have found that inaccurate LLM’s responses could still be preferable when they are perceived to be more authoritative, certain, well-articulated, or simply verbose. These variables interestingly fall under the broader category of language style, implying that the style in the LLM’s responses might meaningfully influence users’ preferences. This hypothesized dynamic could have double-edged consequences: enhancing the overall user experience while simultaneously increasing their susceptibility to risks such as LLM’s misinformation or hallucinations. In this short paper, we present our preliminary studies in exploring this subject. Through a series of exploratory and experimental user studies, we found that LLM’s language style does indeed influence user’s preferences, but how and which language styles influence the preference varied across different user populations, and more interestingly, moderated by the user’s very own individual traits. As a preliminary work, the findings in our studies should be interpreted with caution, particularly given the limitations in our samples, which still need wider demographic diversity and larger sample sizes. Our future directions will first aim to address these limitations, which would enable a more comprehensive joint effect analysis between the language style, individual traits, and preferences, and further investigate the potential causal relationship between and beyond these variables.
zh

[NLP-36] Agree to Disagree? A Meta-Evaluation of LLM Misgendering

【速读】: 该论文试图解决不同评估方法在衡量大型语言模型(LLMs)性别误指(m_gendering)问题上的收敛效度(convergent validity)问题,即这些方法得出的结果是否一致。论文的关键解决方案在于提出了一种方法,能够将三个现有数据集进行转换,以实现概率性评估与生成性评估的平行化,从而系统性地比较概率性和生成性评估方法在实例、数据集和模型层面的一致性。通过自动评估六种来自三个不同家族的模型,发现这些方法在20.2%的评估实例上存在分歧。此外,结合人类评估进一步表明,性别误指行为比代词更为复杂,而当前的自动评估方法无法充分捕捉这一点,这揭示了其与人类评估之间的本质差异。

链接: https://arxiv.org/abs/2504.17075
作者: Arjun Subramonian,Vagrant Gautam,Preethi Seshadri,Dietrich Klakow,Kai-Wei Chang,Yizhou Sun
机构: UCLA(加州大学洛杉矶分校, USA); Saarland University(萨尔兰大学, Germany); UC Irvine(加州大学欧文分校, USA)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Work in progress

点击查看摘要

Abstract:Numerous methods have been proposed to measure LLM misgendering, including probability-based evaluations (e.g., automatically with templatic sentences) and generation-based evaluations (e.g., with automatic heuristics or human validation). However, it has gone unexamined whether these evaluation methods have convergent validity, that is, whether their results align. Therefore, we conduct a systematic meta-evaluation of these methods across three existing datasets for LLM misgendering. We propose a method to transform each dataset to enable parallel probability- and generation-based evaluation. Then, by automatically evaluating a suite of 6 models from 3 families, we find that these methods can disagree with each other at the instance, dataset, and model levels, conflicting on 20.2% of evaluation instances. Finally, with a human evaluation of 2400 LLM generations, we show that misgendering behaviour is complex and goes far beyond pronouns, which automatic evaluations are not currently designed to capture, suggesting essential disagreement with human evaluations. Based on our findings, we provide recommendations for future evaluations of LLM misgendering. Our results are also more widely relevant, as they call into question broader methodological conventions in LLM evaluation, which often assume that different evaluation methods agree.
zh

[NLP-37] Do Words Reflect Beliefs? Evaluating Belief Depth in Large Language Models

【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)在政治话语中日益重要,但其响应的一致性在仔细审查时往往不足。现有研究主要通过将LLMs的输出归类为左倾或右倾来评估其政治立场,然而一个关键问题仍未解决——这些响应是否反映了模型真正的内在信念,还是仅仅是对训练数据的表面级对齐?
解决方案的关键在于提出了一种新的框架,通过分析(1)论证一致性与(2)不确定性量化来评估信念深度。论文通过对来自Political Compass Test的19项经济政策测试12个LLMs,并挑战其信念稳定性,揭示了LLMs表现出主题特定的信念稳定性而非统一的政治意识形态。此外,论文引入语义熵(semantic entropy)作为评估工具,实现了高精度(AUROC=0.78),有效区分了表面级对齐与真实的信念表达。这表明LLMs的信念稳定性需进行主题特定的可靠性评估,而不是简单地假设它们具有稳定的人类化政治意识形态。

链接: https://arxiv.org/abs/2504.17052
作者: Shariar Kabir,Kevin Esterling,Yue Dong
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); University of California Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly shaping political discourse, yet their responses often display inconsistency when subjected to scrutiny. While prior research has primarily categorized LLM outputs as left- or right-leaning to assess their political stances, a critical question remains: Do these responses reflect genuine internal beliefs or merely surface-level alignment with training data? To address this, we propose a novel framework for evaluating belief depth by analyzing (1) argumentative consistency and (2) uncertainty quantification. We evaluate 12 LLMs on 19 economic policies from the Political Compass Test, challenging their belief stability with both supportive and opposing arguments. Our analysis reveals that LLMs exhibit topic-specific belief stability rather than a uniform ideological stance. Notably, up to 95% of left-leaning models’ responses and 89% of right-leaning models’ responses remain consistent under the challenge, enabling semantic entropy to achieve high accuracy (AUROC=0.78), effectively distinguishing between surface-level alignment from genuine belief. These findings call into question the assumption that LLMs maintain stable, human-like political ideologies, emphasizing the importance of conducting topic-specific reliability assessments for real-world applications.
zh

[NLP-38] SCALAR: A Part-of-speech Tagger for Identifiers

【速读】: 该论文旨在解决源代码标识符名称与其对应的词性标注序列(语法模式)之间的映射问题,即开发一种专门用于标识符命名分析的工具。解决方案的关键在于构建了一个名为SCALAR的工具,其内部模型通过结合scikit-learn的GradientBoostingClassifier与人工精心编纂的标识符及其语法模式的语料库进行训练,从而能够识别开发者在创建各类标识符(如函数名、变量名等)时使用的自然语言独特结构。这一方法显著提升了标识符注释的准确性,优于其他现有的词性标注器。

链接: https://arxiv.org/abs/2504.17038
作者: Christian D. Newman,Brandon Scholten,Sophia Testa,Joshua A. C. Behler,Syreen Banabilah,Michael L. Collard,Michael J. Decker,Mohamed Wiem Mkaouer,Marcos Zampieri,Eman Abdullah AlOmar,Reem Alsuhaibani,Anthony Peruma,Jonathan I. Maletic
机构: Rochester Institute of Technology (罗切斯特理工学院); Kent State University (肯特州立大学); University of Akron (阿克伦大学); Bowling Green State University (鲍灵格林州立大学); University of Michigan Flint (密歇根大学弗林特分校); George Mason University (乔治梅森大学); Stevens Institute of Technology (史蒂文斯理工学院); Prince Sultan University (苏丹王子大学); University of Hawaii at Manoa (夏威夷大学马诺阿分校)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR’s internal model is trained using scikit-learn’s GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR’s output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers’ output for annotating identifiers. The code is available on Github
zh

[NLP-39] Optimizing LLM s for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

【速读】: 该论文旨在解决现有预训练大语言模型(Pretrained Large Language Models, LLMs)在非英语语言任务中的低效编码(高token“生育率”)和较慢推理速度的问题。尽管最先进的LLMs能够处理多种语言,但由于语言污染或部分多语言预训练数据的影响,它们并未针对非英语语言进行优化。论文的关键解决方案是提出了一种名为语义对齐词汇适应(Semantic Alignment Vocabulary Adaptation, SAVA)的新方法,该方法利用神经映射实现词汇替换。通过这一方法,论文展示了如何有效优化英语LLMs以适应意大利语,并在多个下游任务中取得了具有竞争力的表现。此外,研究还表明,经过词汇适配后,这些模型仅需有限的连续训练阶段即可恢复其性能。

链接: https://arxiv.org/abs/2504.17025
作者: Luca Moroni,Giovanni Puccetti,Pere-Lluis Huguet Cabot,Andrei Stefan Bejgu,Edoardo Barba,Alessio Miaschi,Felice Dell’Orletta,Andrea Esuli,Roberto Navigli
机构: Sapienza University of Rome (罗马第一大学); ISTI-CNR (意大利国家研究委员会信息学研究所); ILC-CNR (意大利国家研究委员会语言与文化研究所); Babelscape (Babelscape)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token “fertility”) and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
zh

[NLP-40] (Im)possibility of Automated Hallucination Detection in Large Language Models

【速读】: 该论文试图解决的问题是:是否可以通过自动化方法检测大型语言模型(Large Language Models, LLMs)生成的幻觉(hallucinations)。研究基于经典Gold-Angluin语言识别框架及其在语言生成中的最新适应性,探讨算法能否在仅访问目标语言集合中未知语言 ( K ) 的正确示例以及LLM输出的情况下,可靠地区分LLM输出是正确的还是幻觉。

解决方案的关键在于引入专家标注反馈。论文首先证明了幻觉检测与传统语言识别任务在理论上是等价的,且两者可以相互转换。然而,由于语言识别任务本身的难度,若仅使用目标语言中的正确示例训练检测器,则对于大多数语言集合,幻觉检测本质上是不可能实现的。进一步地,论文指出通过结合正样本(正确陈述)和负样本(明确标注的错误陈述)的专家标注反馈进行训练,可以在所有可数语言集合上实现自动化的幻觉检测。这一结果强调了专家标注示例在训练幻觉检测器中的核心作用,并为基于反馈的方法(如带有人类反馈的强化学习,Reinforcement Learning with Human Feedback, RLHF)提供了理论支持,这些方法已被证明对LLM的可靠部署至关重要。

链接: https://arxiv.org/abs/2504.17004
作者: Amin Karbasi,Omar Montasser,John Sous,Grigoris Velegkas
机构: Yale University (耶鲁大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Is automated hallucination detection possible? In this work, we introduce a theoretical framework to analyze the feasibility of automatically detecting hallucinations produced by large language models (LLMs). Inspired by the classical Gold-Angluin framework for language identification and its recent adaptation to language generation by Kleinberg and Mullainathan, we investigate whether an algorithm, trained on examples drawn from an unknown target language K (selected from a countable collection) and given access to an LLM, can reliably determine whether the LLM’s outputs are correct or constitute hallucinations. First, we establish an equivalence between hallucination detection and the classical task of language identification. We prove that any hallucination detection method can be converted into a language identification method, and conversely, algorithms solving language identification can be adapted for hallucination detection. Given the inherent difficulty of language identification, this implies that hallucination detection is fundamentally impossible for most language collections if the detector is trained using only correct examples from the target language. Second, we show that the use of expert-labeled feedback, i.e., training the detector with both positive examples (correct statements) and negative examples (explicitly labeled incorrect statements), dramatically changes this conclusion. Under this enriched training regime, automated hallucination detection becomes possible for all countable language collections. These results highlight the essential role of expert-labeled examples in training hallucination detectors and provide theoretical support for feedback-based methods, such as reinforcement learning with human feedback (RLHF), which have proven critical for reliable LLM deployment. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML) Cite as: arXiv:2504.17004 [cs.LG] (or arXiv:2504.17004v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.17004 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Grigoris Velegkas [view email] [v1] Wed, 23 Apr 2025 18:00:07 UTC (126 KB)
zh

[NLP-41] okenization Matters: Improving Zero-Shot NER for Indic Languages

【速读】: 该论文旨在解决低资源印度语言(Indic languages)中命名实体识别(Named Entity Recognition, NER)任务中的词元化(tokenization)策略选择问题,特别是针对形态学复杂度较高的语言。论文的关键在于比较三种词元化方法——Byte Pair Encoding (BPE)、SentencePiece 和字符级词元化(Character Level),评估其在低资源及极低资源印度语言(如阿萨姆语、孟加拉语、马拉地语、奥里亚语、桑塔利语、曼尼普尔语和信德语)中的性能表现。研究不仅考察了内在的语言学特性(如词元化效率、词汇外(Out-of-Vocabulary, OOV)率和形态学保留能力),还评估了下游任务的表现(包括微调和零样本跨语言迁移)。结果表明,相较于BPE,SentencePiece在低资源印度语言的NER任务中表现出更一致的优越性,尤其是在零样本跨语言设置下,能够更好地保持实体一致性。关键在于SentencePiece不仅在形态学丰富的语言中提供更好的结构保留,还能实现更高的跨书写系统(如阿拉伯文中的信德语)泛化能力,从而成为多语言低资源印度自然语言处理应用中更有效的NER词元化策略。

链接: https://arxiv.org/abs/2504.16977
作者: Priyaranjan Pattnayak,Hitesh Laxmichand Patel,Amit Agarwal
机构: University of Washington (华盛顿大学); New York University (纽约大学); Liverpool John Moores University (利物浦约翰摩尔斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is a standard tokenization method in multilingual language models, its suitability for Named Entity Recognition (NER) in low resource Indic languages remains underexplored due to its limitations in handling morphological complexity. In this work, we systematically compare BPE, SentencePiece, and Character Level tokenization strategies using IndicBERT for NER tasks in low resource Indic languages like Assamese, Bengali, Marathi, and Odia, as well as extremely low resource Indic languages like Santali, Manipuri, and Sindhi. We assess both intrinsic linguistic properties tokenization efficiency, out of vocabulary (OOV) rates, and morphological preservation as well as extrinsic downstream performance, including fine tuning and zero shot cross lingual transfer. Our experiments show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic Languages, particularly in zero shot cross lingual settings, as it better preserves entity consistency. While BPE provides the most compact tokenization form, it is not capable of generalization because it misclassifies or even fails to recognize entity labels when tested on unseen languages. In contrast, SentencePiece constitutes a better linguistic structural preservation model, benefiting extremely low resource and morphologically rich Indic languages, such as Santali and Manipuri, for superior entity recognition, as well as high generalization across scripts, such as Sindhi, written in Arabic. The results point to SentencePiece as the more effective tokenization strategy for NER within multilingual and low resource Indic NLP applications. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16977 [cs.CL] (or arXiv:2504.16977v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.16977 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-42] Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity

【速读】: 本文旨在解决单细胞RNA测序(scRNA-seq)数据分析中因高维度、稀疏性和批次效应引起的复杂性所带来的主要计算挑战。传统基于Transformer的方法虽然在该领域取得了显著进展,但通常受到二次复杂度以及对长距离依赖处理不够优化的限制。论文的关键解决方案是引入GeneMamba,这是一种基于状态空间建模的可扩展且高效的单细胞转录组学基础模型。GeneMamba利用Bi-Mamba架构以线性时间复杂度捕捉双向基因上下文,在计算效率上显著优于Transformer基线模型。此外,GeneMamba通过近3000万个细胞的预训练,并结合生物学导向的目标函数,如路径感知对比损失和基于排名的基因编码,进一步增强了其性能。实验结果表明,GeneMamba在多批次整合、细胞类型注释和基因-基因相关性等任务上表现出色,具有良好的性能、可解释性和鲁棒性,成为一种实用且强大的替代方案,推动了生物基础、可扩展工具的发展用于大规模单细胞数据解析。

链接: https://arxiv.org/abs/2504.16956
作者: Cong Qi,Hanzhang Fang,Tianxing Hu,Siqi Jiang,Wei Zhi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
zh

[NLP-43] A Desideratum for Conversational Agents : Capabilities Challenges and Future Directions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)驱动的对话代理在实现接近人类智能的系统过程中所面临的挑战与局限性。论文试图回答关于这些模型能力、限制及未来发展路径的基本问题,并提出下一代对话代理的理想需求——即已取得的成就、持续存在的挑战以及为了构建更具扩展性的系统所需完成的工作。关键在于从推理(Reasoning)、监控(Monitor)和控制(Control)三个主要维度系统分析LLM驱动的对话代理的能力,并基于此引入新的分类法来组织相关研究工作。通过识别重要的研究空白并概述关键方向,如现实评估、长期多轮推理技能、自我进化能力、协作与多智能体任务完成、个性化及主动性等,论文为对话代理的发展提供了结构化的基础,指出了现有局限,并为未来的研究方向提供了洞见,从而推动人工通用智能(Artificial General Intelligence, AGI)的进步。

链接: https://arxiv.org/abs/2504.16939
作者: Emre Can Acikgoz,Cheng Qian,Hongru Wang,Vardhan Dongre,Xiusi Chen,Heng Ji,Dilek Hakkani-Tür,Gokhan Tur
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: this https URL.
zh

计算机视觉

[CV-0] LiDPM: Rethinking Point Diffusion for Lidar Scene Completion

【速读】:该论文旨在解决在室外场景尺度下直接对激光雷达点云进行扩散模型训练的挑战,主要困难在于从白噪声中生成大视场下的精细细节。传统方法通常基于对象级别,使用标准去噪扩散概率模型(DDPM),而最新工作通过将原始DDPM重构成局部扩散过程来解决场景补全问题。本文填补了这两种方法之间的空白,指出局部扩散公式中的近似并非必要,并证明只需选择合适的起始点,标准DDPM即可实现场景级别的补全任务。关键解决方案在于采用一种名为LiDPM的方法,其在SemanticKITTI数据集上的实验结果表明优于现有方法。

链接: https://arxiv.org/abs/2504.17791
作者: Tetiana Martyniuk,Gilles Puy,Alexandre Boulch,Renaud Marlet,Raoul de Charette
机构: Inria; valeo.ai; LIGM, ENPC, Univ Gustave Eiffel, CNRS (法国国家科学研究中心), France
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE IV 2025

点击查看摘要

Abstract:Training diffusion models that work directly on lidar points at the scale of outdoor scenes is challenging due to the difficulty of generating fine-grained details from white noise over a broad field of view. The latest works addressing scene completion with diffusion models tackle this problem by reformulating the original DDPM as a local diffusion process. It contrasts with the common practice of operating at the level of objects, where vanilla DDPMs are currently used. In this work, we close the gap between these two lines of work. We identify approximations in the local diffusion formulation, show that they are not required to operate at the scene level, and that a vanilla DDPM with a well-chosen starting point is enough for completion. Finally, we demonstrate that our method, LiDPM, leads to better results in scene completion on SemanticKITTI. The project page is this https URL .
zh

[CV-1] oken-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

【速读】:该论文旨在解决基于自回归(Autoregressive, AR)模型在图像合成任务中因所需图像tokens数量庞大而导致训练与推理效率低下以及图像分辨率受限的问题。为了解决这一挑战,论文提出了一种名为Token-Shuffle的新方法。其关键是利用多模态大型语言模型(Multimodal Large Language Models, MLLMs)中视觉词汇的维度冗余特性,即视觉编码器输出的低维视觉代码被直接映射到高维语言词汇表上。基于此洞察,论文设计了两个关键操作:token-shuffle通过沿通道维度合并空间局部tokens来减少输入tokens的数量;token-unshuffle则在Transformer块后的推断tokens上进行逆操作,以恢复空间排列用于输出。该策略联合文本提示进行训练,无需额外的预训练文本编码器,使MLLMs能够在统一的下一tokens预测方式下支持极高分辨率(如2048x2048)的图像合成,同时保持高效的训练和推理性能。

链接: https://arxiv.org/abs/2504.17789
作者: Xu Ma,Peize Sun,Haoyu Ma,Hao Tang,Chih-Yao Ma,Jialiang Wang,Kunpeng Li,Xiaoliang Dai,Yujun Shi,Xuan Ju,Yushi Hu,Artsiom Sanakoyeu,Felix Juefei-Xu,Ji Hou,Junjiao Tian,Tao Xu,Tingbo Hou,Yen-Cheng Liu,Zecheng He,Zijian He,Matt Feiszli,Peizhao Zhang,Peter Vajda,Sam Tsai,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) models, long dominant in language generation, are increasingly applied to image synthesis but are often considered less competitive than Diffusion-based models. A primary limitation is the substantial number of image tokens required for AR models, which constrains both training and inference efficiency, as well as image resolution. To address this, we present Token-Shuffle, a novel yet simple method that reduces the number of image tokens in Transformer. Our key insight is the dimensional redundancy of visual vocabularies in Multimodal Large Language Models (MLLMs), where low-dimensional visual codes from visual encoder are directly mapped to high-dimensional language vocabularies. Leveraging this, we consider two key operations: token-shuffle, which merges spatially local tokens along channel dimension to decrease the input token number, and token-unshuffle, which untangles the inferred tokens after Transformer blocks to restore the spatial arrangement for output. Jointly training with textual prompts, our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis in a unified next-token prediction way while maintaining efficient training and inference. For the first time, we push the boundary of AR text-to-image generation to a resolution of 2048x2048 with gratifying generation performance. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15. Exhaustive large-scale human evaluations also demonstrate our prominent image generation ability in terms of text-alignment, visual flaw, and visual appearance. We hope that Token-Shuffle can serve as a foundational design for efficient high-resolution image generation within MLLMs.
zh

[CV-2] Dynamic Camera Poses and Where to Find Them CVPR2025

【速读】:该论文旨在解决大规模动态互联网视频中相机姿态标注的问题,这对于推动真实感视频生成和模拟等领域的进步至关重要。然而,由于大多数互联网视频不适合姿态估计,收集这样的数据集极具挑战性,即使对于最先进的方法而言,对动态视频进行标注也存在显著困难。论文的关键解决方案在于引入了DynPose-100K,这是一个包含动态互联网视频及其相机姿态标注的大规模数据集。其数据收集流程通过结合特定任务模型与通用模型来实现过滤优化,并在姿态估计过程中综合运用点跟踪(point tracking)、动态掩蔽(dynamic masking)以及运动结构恢复(structure-from-motion)等最新技术,从而超越现有最先进方法。分析和实验表明,DynPose-100K不仅规模庞大,而且在多个关键属性上具有多样性,为下游应用的发展开辟了新的路径。

链接: https://arxiv.org/abs/2504.17788
作者: Chris Rockwell,Joseph Tung,Tsung-Yi Lin,Ming-Yu Liu,David F. Fouhey,Chen-Hsuan Lin
机构: NVIDIA(英伟达); University of Michigan(密歇根大学); New York University(纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Project Page: this https URL

点击查看摘要

Abstract:Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
zh

[CV-3] he Fourth Monocular Depth Estimation Challenge CVPR

【速读】:该论文致力于解决单目深度估计在零样本泛化(zero-shot generalization)场景下的性能提升问题,具体聚焦于SYNS-Patches基准数据集中的自然与室内复杂环境。论文的关键解决方案在于改进评估协议,采用具有两个自由度的最小二乘对齐方法,以支持视差和仿射不变预测;同时引入了流行的现成方法作为基线,并鼓励参赛者开发依赖于仿射不变预测的先进模型。最终,挑战赛的获胜方案通过优化上述策略,将3D F-Score从上一届的最佳结果22.58%提升至23.05%。

链接: https://arxiv.org/abs/2504.17787
作者: Anton Obukhov,Matteo Poggi,Fabio Tosi,Ripudaman Singh Arora,Jaime Spencer,Chris Russell,Simon Hadfield,Richard Bowden,Shuaihang Wang,Zhenxin Ma,Weijie Chen,Baobei Xu,Fengyu Sun,Di Xie,Jiang Zhu,Mykola Lavreniuk,Haining Guan,Qun Wu,Yupei Zeng,Chao Lu,Huanran Wang,Guangyuan Zhou,Haotian Zhang,Jianxiong Wang,Qiang Rao,Chunjie Wang,Xiao Liu,Zhiqiang Lou,Hualie Jiang,Yihao Chen,Rui Xu,Minglang Tan,Zihan Qin,Yifan Mao,Jiayang Liu,Jialei Xu,Yifan Yang,Wenbo Zhao,Junjun Jiang,Xianming Liu,Mingshuai Zhao,Anlong Ming,Wu Chen,Feng Xue,Mengying Yu,Shida Gao,Xiangfeng Wang,Gbenga Omotara,Ramy Farag,Jacket Demby,Seyed Mohamad Ali Tousi,Guilherme N DeSouza,Tuan-Anh Yang,Minh-Quang Nguyen,Thien-Phuc Tran,Albert Luginov,Muhammad Shahzad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in CVPRW2025

点击查看摘要

Abstract:This paper presents the results of the fourth edition of the Monocular Depth Estimation Challenge (MDEC), which focuses on zero-shot generalization to the SYNS-Patches benchmark, a dataset featuring challenging environments in both natural and indoor settings. In this edition, we revised the evaluation protocol to use least-squares alignment with two degrees of freedom to support disparity and affine-invariant predictions. We also revised the baselines and included popular off-the-shelf methods: Depth Anything v2 and Marigold. The challenge received a total of 24 submissions that outperformed the baselines on the test set; 10 of these included a report describing their approach, with most leading methods relying on affine-invariant predictions. The challenge winners improved the 3D F-Score over the previous edition’s best result, raising it from 22.58% to 23.05%.
zh

[CV-4] Step1X-Edit: A Practical Framework for General Image Editing

【速读】:该论文旨在解决开源图像编辑算法与闭源高级多模态模型(如GPT-4o和Gemini2 Flash)之间的性能差距问题。论文的关键解决方案是提出了一种名为Step1X-Edit的最新图像编辑模型,其通过采用多模态大型语言模型(Multimodal LLM)处理参考图像和用户的编辑指令,提取潜在嵌入并向扩散图像解码器集成,从而生成目标图像。为了训练该模型,构建了一个高质量数据生成管道;同时,开发了基于真实用户指令的新型基准GEdit-Bench进行评估,证明Step1X-Edit在性能上显著优于现有开源基线,并接近领先的闭源模型,从而推动图像编辑领域的发展。

链接: https://arxiv.org/abs/2504.17761
作者: Shiyu Liu,Yucheng Han,Peng Xing,Fukun Yin,Rui Wang,Wei Cheng,Jiaqi Liao,Yingming Wang,Honghao Fu,Chunrui Han,Guopeng Li,Yuang Peng,Quan Sun,Jingwei Wu,Yan Cai,Zheng Ge,Ranchen Ming,Lei Xia,Xianfang Zeng,Yibo Zhu,Binxing Jiao,Xiangyu Zhang,Gang Yu,Daxin Jiang
机构: Step1X-Image Team (Step1X-Image 团队); StepFun (StepFun)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL

点击查看摘要

Abstract:In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user’s editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
zh

[CV-5] EgoCHARM: Resource-Efficient Hierarchical Activity Recognition using an Egocentric IMU Sensor

【速读】:本文旨在解决基于头戴式惯性测量单元(Inertial Measurement Unit, IMU)的自视角(egocentric)人体活动识别(Human Activity Recognition, HAR)中存在的性能低下或资源消耗过高的问题。为应对这一挑战,论文提出了一种名为EgoCHARM的高效机器学习算法,其关键在于采用半监督学习策略,主要依赖高阶活动标签进行训练,从而学习可泛化的低阶运动嵌入表示,用于低阶活动的精确识别。这种分层算法在仅使用有限模型参数(高阶活动63k,低阶活动22k)的情况下,实现了高阶活动F1分数0.826和低阶活动F1分数0.855的良好性能,同时支持低阶编码器直接部署于现有的IMU芯片上,兼顾计算效率与功耗优化。

链接: https://arxiv.org/abs/2504.17735
作者: Akhil Padmanabha,Saravanan Govindarajan,Hwanmun Kim,Sergio Ortiz,Rahul Rajan,Doruk Senkal,Sneha Kadetotad
机构: Carnegie Mellon University (卡内基梅隆大学); Meta Reality Labs (Meta现实实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) on smartglasses has various use cases, including health/fitness tracking and input for context-aware AI assistants. However, current approaches for egocentric activity recognition suffer from low performance or are resource-intensive. In this work, we introduce a resource (memory, compute, power, sample) efficient machine learning algorithm, EgoCHARM, for recognizing both high level and low level activities using a single egocentric (head-mounted) Inertial Measurement Unit (IMU). Our hierarchical algorithm employs a semi-supervised learning strategy, requiring primarily high level activity labels for training, to learn generalizable low level motion embeddings that can be effectively utilized for low level activity recognition. We evaluate our method on 9 high level and 3 low level activities achieving 0.826 and 0.855 F1 scores on high level and low level activity recognition respectively, with just 63k high level and 22k low level model parameters, allowing the low level encoder to be deployed directly on current IMU chips with compute. Lastly, we present results and insights from a sensitivity analysis and highlight the opportunities and limitations of activity recognition using egocentric IMUs.
zh

[CV-6] DPMambaIR:All-in-One Image Restoration via Degradation-Aware Prompt State Space Model

【速读】:该论文致力于解决传统图像修复方法因针对每种退化类型设计专用模型而导致的高训练成本和部署复杂性问题。为实现单一模型处理多种图像退化问题(即All-in-One图像修复),现有方法通常依赖于特定退化模型或粗粒度退化提示,但存在对退化信息建模不足以及多任务冲突平衡困难的问题。论文提出的关键解决方案是DPMambaIR框架,其通过整合Degradation-Aware Prompt State Space Model (DP-SSM) 和High-Frequency Enhancement Block (HEB),实现了复杂退化信息的细粒度建模与高效全局整合,同时缓解了任务竞争引起的高频细节损失。具体而言,DP-SSM利用预训练的退化特征提取器捕获细粒度退化特征,并动态融入状态空间建模过程以增强模型对多样化退化类型的适应性;HEB则补充高频信息,有效应对多任务场景下边缘和纹理等关键细节的丢失问题。实验结果表明,DPMambaIR在包含七种退化类型的混合数据集上达到最优性能,PSNR和SSIM分别为27.69 dB和0.893,凸显了其作为统一解决方案的潜力与优势。

链接: https://arxiv.org/abs/2504.17732
作者: Zhanwen Liu,Sai Zhou,Yuchao Dai,Yang Wang,Yisheng An,Xiangmo Zhao
机构: Chang’an University (长安大学)(西安, China); Northwestern Polytechnical University (西北工业大学)(西安, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-One image restoration aims to address multiple image degradation problems using a single model, significantly reducing training costs and deployment complexity compared to traditional methods that design dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework. By integrating a Degradation-Aware Prompt State Space Model (DP-SSM) and a High-Frequency Enhancement Block (HEB), DPMambaIR enables fine-grained modeling of complex degradation information and efficient global integration, while mitigating the loss of high-frequency details caused by task competition. Specifically, the DP-SSM utilizes a pre-trained degradation extractor to capture fine-grained degradation features and dynamically incorporates them into the state space modeling process, enhancing the model’s adaptability to diverse degradation types. Concurrently, the HEB supplements high-frequency information, effectively addressing the loss of critical details, such as edges and textures, in multi-task image restoration scenarios. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
zh

[CV-7] CasualHDRSplat: Robust High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos

【速读】:该论文试图解决从随意捕获的视频中重建高质量三维高动态范围(HDR)场景的问题,传统方法通常依赖于固定相机位置下的多曝光图像,而本文提出了一种名为\textbf{CasualHDRSplat}的一阶段方法,能够在自动曝光开启的情况下,即使存在严重运动模糊和未知曝光时间变化时,也能鲁棒地进行重建。其关键在于引入了一个统一的可微物理成像模型,通过在成像过程中施加连续时间轨迹约束,从而能够同时优化曝光时间、相机响应函数(CRF)、相机姿态以及清晰的三维HDR场景。

链接: https://arxiv.org/abs/2504.17728
作者: Shucheng Gong,Lingzhe Zhao,Wenpu Li,Hong Xie,Yin Zhang,Shiyu Zhao,Peidong Liu
机构: Westlake University (西湖大学); Wuhan University (武汉大学); ETH Zürich (瑞士苏黎世联邦理工学院); Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Source Code: this https URL

点击查看摘要

Abstract:Recently, photo-realistic novel view synthesis from multi-view images, such as neural radiance field (NeRF) and 3D Gaussian Splatting (3DGS), have garnered widespread attention due to their superior performance. However, most works rely on low dynamic range (LDR) images, which limits the capturing of richer scene details. Some prior works have focused on high dynamic range (HDR) scene reconstruction, typically require capturing of multi-view sharp images with different exposure times at fixed camera positions during exposure times, which is time-consuming and challenging in practice. For a more flexible data acquisition, we propose a one-stage method: \textbfCasualHDRSplat to easily and robustly reconstruct the 3D HDR scene from casually captured videos with auto-exposure enabled, even in the presence of severe motion blur and varying unknown exposure time. \textbfCasualHDRSplat contains a unified differentiable physical imaging model which first applies continuous-time trajectory constraint to imaging process so that we can jointly optimize exposure time, camera response function (CRF), camera poses, and sharp 3D HDR scene. Extensive experiments demonstrate that our approach outperforms existing methods in terms of robustness and rendering quality. Our source code will be available at this https URL
zh

[CV-8] Generative Fields: Uncovering Hierarchical Feature Control for StyleGAN via Inverted Receptive Fields

【速读】:该论文旨在解决基于生成对抗网络(GANs)图像生成中对生成图像特征控制困难的问题,主要源于潜在空间的低维纠缠特性。尽管已有工作通过图像或文本提示在W潜空间中调节采样以实现一定程度的可控性,但W潜空间仍受限于其无法直接控制特征合成,并且需要预训练过程来重构风格信号,限制了其广泛应用。为了解决这些问题,论文引入“生成场”(Generative Fields)的概念,从卷积神经网络(CNNs)的感受野启发,解释StyleGAN中的分层特征合成机制。同时,提出了一种新的基于生成场理论和通道式风格潜空间S的StyleGAN图像编辑流水线,利用CNN的内在结构特性,在合成过程中实现特征生成的解耦控制。关键在于利用生成场理论与S潜空间,直接针对特征合成进行精确控制,从而绕过传统方法中对W潜空间的依赖及其局限性。

链接: https://arxiv.org/abs/2504.17712
作者: Zhuo He,Paul Henderson,Nicolas Pugeault
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:StyleGAN has demonstrated the ability of GANs to synthesize highly-realistic faces of imaginary people from random noise. One limitation of GAN-based image generation is the difficulty of controlling the features of the generated image, due to the strong entanglement of the low-dimensional latent space. Previous work that aimed to control StyleGAN with image or text prompts modulated sampling in W latent space, which is more expressive than Z latent space. However, W space still has restricted expressivity since it does not control the feature synthesis directly; also the feature embedding in W space requires a pre-training process to reconstruct the style signal, limiting its application. This paper introduces the concept of “generative fields” to explain the hierarchical feature synthesis in StyleGAN, inspired by the receptive fields of convolution neural networks (CNNs). Additionally, we propose a new image editing pipeline for StyleGAN using generative field theory and the channel-wise style latent space S, utilizing the intrinsic structural feature of CNNs to achieve disentangled control of feature synthesis at synthesis time.
zh

[CV-9] Hierarchical and Multimodal Data for Daily Activity Understanding

【速读】:该论文试图解决如何通过多模态数据理解和建模复杂的人类活动,并揭示人类中心应用中的重要挑战。为实现这一目标,论文构建了一个名为Daily Activity Recordings for Artificial Intelligence (DARai) 的多层级注释数据集,包含超过200小时的多传感器数据,覆盖多种环境和活动场景。DARai的关键在于其多层次注释结构(L1高阶活动、L2低阶动作和L3细粒度步骤),并通过共享标注设计捕捉活动间的复杂关联性,同时利用未标注数据支持反事实活动的生成。基于此数据集,论文开展了一系列单模态与多模态传感器融合实验,涵盖识别、时间定位及未来动作预测等任务,并通过跨领域变体实验验证了不同传感器的局限性。因此,DARai的核心解决方案在于其多层次注释与多模态数据设计,以及由此带来的对人类活动复杂性的全面建模能力。

链接: https://arxiv.org/abs/2504.17696
作者: Ghazal Kaviani,Yavuz Yarici,Seulgi Kim,Mohit Prabhushankar,Ghassan AlRegib,Mashhour Solh,Ameya Patil
机构: Georgia Institute of Technology(乔治亚理工学院); Amazon Lab126 (亚马逊实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Daily Activity Recordings for Artificial Intelligence (DARai, pronounced “Dahr-ree”) is a multimodal, hierarchically annotated dataset constructed to understand human activities in real-world settings. DARai consists of continuous scripted and unscripted recordings of 50 participants in 10 different environments, totaling over 200 hours of data from 20 sensors including multiple camera views, depth and radar sensors, wearable inertial measurement units (IMUs), electromyography (EMG), insole pressure sensors, biomonitor sensors, and gaze tracker. To capture the complexity in human activities, DARai is annotated at three levels of hierarchy: (i) high-level activities (L1) that are independent tasks, (ii) lower-level actions (L2) that are patterns shared between activities, and (iii) fine-grained procedures (L3) that detail the exact execution steps for actions. The dataset annotations and recordings are designed so that 22.7% of L2 actions are shared between L1 activities and 14.2% of L3 procedures are shared between L2 actions. The overlap and unscripted nature of DARai allows counterfactual activities in the dataset. Experiments with various machine learning models showcase the value of DARai in uncovering important challenges in human-centered applications. Specifically, we conduct unimodal and multimodal sensor fusion experiments for recognition, temporal localization, and future action anticipation across all hierarchical annotation levels. To highlight the limitations of individual sensors, we also conduct domain-variant experiments that are enabled by DARai’s multi-sensor and counterfactual activity design setup. The code, documentation, and dataset are available at the dedicated DARai website: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.17696 [cs.CV] (or arXiv:2504.17696v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.17696 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ghazal Kaviani [view email] [v1] Thu, 24 Apr 2025 16:04:00 UTC (40,427 KB)
zh

[CV-10] PICO: Reconstructing 3D People In Contact with Objects CVPR’25

【速读】:该论文旨在解决从单张彩色图像中恢复3D人体-物体交互(Human-Object Interaction, HOI)的问题,这一任务面临深度模糊、遮挡以及物体形状和外观的巨大变化等挑战。传统方法通常依赖于受控环境(如已知的物体形状和接触点),且仅适用于有限的物体类别。论文提出的方法致力于扩展到自然图像和新的物体类别。

解决方案的关键在于两个方面:(1) 构建了一个名为PICO-db的新数据集,其中包含自然图像及其对应的密集3D接触点标注,这些标注不仅覆盖人体网格,还涵盖物体网格。为此,研究者利用了DAMON数据集中已有接触标注的图像,但将这些标注扩展至物体表面,而非仅限于标准人体模型。具体实现中,通过视觉基础模型检索合适的3D物体网格,并结合一种只需两次点击即可将DAMON的身体接触区域投影到物体上的新方法,建立了丰富的身体与物体之间的接触对应关系。(2) 开发了一种新颖的渲染和比较拟合方法PICO-fit,利用上述建立的接触对应关系来同时恢复人体和物体的3D网格。PICO-fit能够推断SMPL-X人体的接触点,从PICO-db中检索与物体相关的可能的3D网格及接触点,并通过优化迭代拟合3D网格以匹配图像证据。这种方法的独特之处在于其对许多现有方法无法处理的物体类别表现良好,从而推动了HOI理解在实际场景中的广泛应用。数据集和代码已公开发布。

链接: https://arxiv.org/abs/2504.17695
作者: Alpár Cseke,Shashank Tripathi,Sai Kumar Dwivedi,Arjun Lakshmipathy,Agniv Chatterjee,Michael J. Black,Dimitrios Tzionas
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所), Germany; Meshcapade; Carnegie Mellon University (卡内基梅隆大学), USA; UT Austin (德克萨斯大学奥斯汀分校), USA; University of Amsterdam (阿姆斯特丹大学), the Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR’25. Project Page: this https URL

点击查看摘要

Abstract:Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON’s body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild. Our data and code are available at this https URL.
zh

[CV-11] BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring

【速读】:本文旨在解决建筑施工监测中基于增强现实(AR)应用所面临的挑战,特别是在传统跟踪方法因特征不足的表面、动态变化及漂移积累而导致数字模型与物理世界之间对齐失败的问题。论文提出了一种BIM(建筑信息建模)感知的漂移校正方法,其关键是结合BIM提供的先验结构知识,通过优化技术实现SLAM(即时定位与地图构建)坐标系与BIM坐标系之间的鲁棒平面匹配,并计算两者原点帧间的变换矩阵,从而减小长期运行中的漂移误差。实验结果表明,与初始手动对齐相比,该方法平均减少了52.24%的角度偏差和60.8%的墙面对齐距离误差。

链接: https://arxiv.org/abs/2504.17693
作者: Asier Bikandi,Muhammad Shaheer,Hriday Bavle,Jayan Jevanesan,Holger Voos,Jose Luis Sanchez-Lopez
机构: Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg, Luxembourg (卢森堡大学); GAMMA Tech (GAMMA Tech)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Augmented reality (AR) applications for construction monitoring rely on real-time environmental tracking to visualize architectural elements. However, construction sites present significant challenges for traditional tracking methods due to featureless surfaces, dynamic changes, and drift accumulation, leading to misalignment between digital models and the physical world. This paper proposes a BIM-aware drift correction method to address these challenges. Instead of relying solely on SLAM-based localization, we align as-built" detected planes from the real-world environment with as-planned" architectural planes in BIM. Our method performs robust plane matching and computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using optimization techniques, minimizing drift over time. By incorporating BIM as prior structural knowledge, we can achieve improved long-term localization and enhanced AR visualization accuracy in noisy construction environments. The method is evaluated through real-world experiments, showing significant reductions in drift-induced errors and optimized alignment consistency. On average, our system achieves a reduction of 52.24% in angular deviations and a reduction of 60.8% in the distance error of the matched walls compared to the initial manual alignment by the user.
zh

[CV-12] DiMeR: Disentangled Mesh Reconstruction Model

【速读】:该论文旨在解决基于稀疏视图的网格重建任务中,因RGB图像输入导致训练目标冲突且几何重建清晰度不足的问题。为应对这一挑战,论文提出了一种新颖的解耦双流前馈模型DiMeR。其关键在于将输入数据与网络框架在几何与纹理两个部分进行解耦,遵循奥卡姆剃刀原则以降低每个部分的训练难度。具体而言,通过利用严格与几何一致且能够精确捕捉表面变化的法线贴图作为几何分支的专属输入,简化网络输入与输出之间的复杂性;同时改进网格提取算法以引入3D地面真实监督。对于纹理分支,则使用RGB图像输入以生成纹理化网格。实验结果表明,DiMeR在稀疏视图重建、单图像到3D以及文本到3D等任务中表现出色,在GSO和OmniObject3D数据集上的Chamfer Distance指标较先前方法提升了超过30%。

链接: https://arxiv.org/abs/2504.17670
作者: Lutao Jiang,Jiantao Lin,Kanghao Chen,Wenhang Ge,Xin Yang,Yifan Jiang,Yuanhuiyi Lyu,Xu Zheng,Yingcong Chen
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:With the advent of large-scale 3D datasets, feed-forward 3D generative models, such as the Large Reconstruction Model (LRM), have gained significant attention and achieved remarkable success. However, we observe that RGB images often lead to conflicting training objectives and lack the necessary clarity for geometry reconstruction. In this paper, we revisit the inductive biases associated with mesh reconstruction and introduce DiMeR, a novel disentangled dual-stream feed-forward model for sparse-view mesh reconstruction. The key idea is to disentangle both the input and framework into geometry and texture parts, thereby reducing the training difficulty for each part according to the Principle of Occam’s Razor. Given that normal maps are strictly consistent with geometry and accurately capture surface variations, we utilize normal maps as exclusive input for the geometry branch to reduce the complexity between the network’s input and output. Moreover, we improve the mesh extraction algorithm to introduce 3D ground truth supervision. As for texture branch, we use RGB images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust capabilities across various tasks, including sparse-view reconstruction, single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR significantly outperforms previous methods, achieving over 30% improvement in Chamfer Distance on the GSO and OmniObject3D dataset.
zh

[CV-13] Aerial Image Classification in Scarce and Unconstrained Environments via Conformal Prediction

【速读】:该论文旨在解决复杂现实环境中数据稀缺且高度变化条件下,如何通过生成可靠的不确定性估计来提升模型预测性能的问题。论文的关键解决方案在于利用预训练模型(如MobileNet、DenseNet和ResNet)结合少量标注数据生成信息丰富的预测集,并通过深入分析校准方法(包括温度缩放与无校准两种管道)对预测可靠性与效率之间权衡的影响。研究发现即使在小样本标注和简单非一致性评分情况下,结合校准的非形式化预测仍能提供有价值的不确定性估计。此外,论文强调了在校准方法选择上的谨慎性以及模型压缩技术在资源受限环境中的潜力。

链接: https://arxiv.org/abs/2504.17655
作者: Farhad Pourkamali-Anaraki
机构: Department of Mathematical and Statistical Sciences (数学与统计科学系), University of Colorado Denver (丹佛大学)COUSA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 17 pages, 5 figures, and 2 tables

点击查看摘要

Abstract:This paper presents a comprehensive empirical analysis of conformal prediction methods on a challenging aerial image dataset featuring diverse events in unconstrained environments. Conformal prediction is a powerful post-hoc technique that takes the output of any classifier and transforms it into a set of likely labels, providing a statistical guarantee on the coverage of the true label. Unlike evaluations on standard benchmarks, our study addresses the complexities of data-scarce and highly variable real-world settings. We investigate the effectiveness of leveraging pretrained models (MobileNet, DenseNet, and ResNet), fine-tuned with limited labeled data, to generate informative prediction sets. To further evaluate the impact of calibration, we consider two parallel pipelines (with and without temperature scaling) and assess performance using two key metrics: empirical coverage and average prediction set size. This setup allows us to systematically examine how calibration choices influence the trade-off between reliability and efficiency. Our findings demonstrate that even with relatively small labeled samples and simple nonconformity scores, conformal prediction can yield valuable uncertainty estimates for complex tasks. Moreover, our analysis reveals that while temperature scaling is often employed for calibration, it does not consistently lead to smaller prediction sets, underscoring the importance of careful consideration in its application. Furthermore, our results highlight the significant potential of model compression techniques within the conformal prediction pipeline for deployment in resource-constrained environments. Based on our observations, we advocate for future research to delve into the impact of noisy or ambiguous labels on conformal prediction performance and to explore effective model reduction strategies.
zh

[CV-14] CLIPSE – a minimalistic CLIP-based image search engine for research

【速读】:该论文试图解决图像搜索引擎在处理大规模数据集时的效率与可扩展性问题。解决方案的关键在于使用CLIP嵌入(CLIP embeddings)来统一处理图像和文本查询,并设计了一个简单且易于扩展的整体框架。通过描述和评估两种基准场景(涵盖索引和查询时间),论文表明CLIPSE能够高效处理较小数据集,而对于更大规模的数据集,则建议采用分布式方法以提升性能。

链接: https://arxiv.org/abs/2504.17643
作者: Steve Göring
机构: Technische Universität Ilmenau (伊尔梅瑙工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A brief overview of CLIPSE, a self-hosted image search engine with the main application of research, is provided. In general, CLIPSE uses CLIP embeddings to process the images and also the text queries. The overall framework is designed with simplicity to enable easy extension and usage. Two benchmark scenarios are described and evaluated, covering indexing and querying time. It is shown that CLIPSE is capable of handling smaller datasets; for larger datasets, a distributed approach with several instances should be considered.
zh

[CV-15] A Guide to Structureless Visual Localization

【速读】:该论文试图解决视觉定位算法在场景变化后调整其底层3D模型的灵活性不足的问题。传统基于结构的方法虽然精度高,但在场景变化后更新3D模型较为困难。为了解决这一问题,论文聚焦于无结构(structureless)视觉定位方法的研究,这些方法通过将场景表示为具有已知姿态的图像数据库,提供了更灵活的表示形式,可以轻松通过添加或移除图像进行更新。论文的关键在于全面讨论和比较现有的无结构方法,并通过大量实验表明,采用更高程度经典几何推理的方法通常能够实现更高的姿态精度,特别是基于经典绝对或半广义相对位姿估计的方法显著优于近期基于姿态回归的方法。与基于结构的最新方法相比,无结构方法的灵活性以略低的姿态精度为代价,这为进一步研究指出了一个有趣的方向。

链接: https://arxiv.org/abs/2504.17636
作者: Vojtech Panek,Qunjie Zhou,Yaqing Ding,Sérgio Agostinho,Zuzana Kukelova,Torsten Sattler,Laura Leal-Taixé
机构: Faculty of Electrical Engineering, Czech Technical University (CTU) in Prague (捷克技术大学电气工程学院); Czech Institute of Informatics, Robotics and Cybernetics, CTU in Prague (捷克技术大学捷克智能信息学、机器人学和网络学研究所); NVIDIA (英伟达); Visual Recognition Group, Faculty of Electrical Engineering, CTU in Prague (捷克技术大学电气工程学院视觉识别小组); Czech Institute of Informatics, Robotics and Cybernetics, CTU in Prague (捷克技术大学捷克智能信息学、机器人学和网络学研究所); NVIDIA (英伟达); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.
zh

[CV-16] Improving Open-World Object Localization by Discovering Background

【速读】:本文致力于解决开放世界场景下的目标定位问题,即在训练阶段仅提供有限数量目标类别的边界框信息的情况下,目标是在推理阶段定位图像中属于训练类别及未见类别的所有对象。为实现此目标,近期研究主要通过提出新的目标函数(如定位质量)或利用基于物体的辅助信息(如深度信息、像素/区域亲和图等)来改进目标表征。本文提出了一种新框架,通过引入背景信息指导“目标性”概念的学习。关键在于将背景发现任务定义为识别非判别性区域(即冗余且信息量低的区域),并训练一个物体建议网络以避免在这些区域内检测任何物体。实验结果表明,所提方法在标准基准数据集上的表现显著优于现有技术。

链接: https://arxiv.org/abs/2504.17626
作者: Ashish Singh,Michael J. Jones,Kuan-Chuan Peng,Anoop Cherian,Moitreya Chatterjee,Erik Learned-Miller
机构: Univ. of Mass.-Amherst (马萨诸塞大学安姆斯特分校); MERL (Mitsubishi Electric Research Labs)(三菱电机研究实验室); Mitsubishi Electric Research Labs (MERL)(三菱电机研究实验室); Mitsubishi Electric Research Labs (MERL)(三菱电机研究实验室); Mitsubishi Electric Research Labs (MERL)(三菱电机研究实验室); Univ. of Mass.-Amherst (马萨诸塞大学安姆斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our work addresses the problem of learning to localize objects in an open-world setting, i.e., given the bounding box information of a limited number of object classes during training, the goal is to localize all objects, belonging to both the training and unseen classes in an image, during inference. Towards this end, recent work in this area has focused on improving the characterization of objects either explicitly by proposing new objective functions (localization quality) or implicitly using object-centric auxiliary-information, such as depth information, pixel/region affinity map etc. In this work, we address this problem by incorporating background information to guide the learning of the notion of objectness. Specifically, we propose a novel framework to discover background regions in an image and train an object proposal network to not detect any objects in these regions. We formulate the background discovery task as that of identifying image regions that are not discriminative, i.e., those that are redundant and constitute low information content. We conduct experiments on standard benchmarks to showcase the effectiveness of our proposed approach and observe significant improvements over the previous state-of-the-art approaches for this task.
zh

[CV-17] Enhancing CNNs robustness to occlusions with bioinspired filters for border completion

【速读】:该论文旨在通过模拟视觉皮层机制中边界完成的数学模型,为卷积神经网络(CNNs)设计定制滤波器,以解决受阻 MNIST 数据集图像识别准确性较低的问题。论文的关键在于基于生物视觉机制提出了一种改进的 LeNet-5 模型,并通过定制滤波器提升模型在处理遮挡图像时的性能,特别是显著提高了分类准确性。

链接: https://arxiv.org/abs/2504.17619
作者: Catarina P. Coutinho,Aneeqa Merhab,Janko Petkovic,Ferdinando Zanchetta,Rita Fioresi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the 7th International Conference on Geometric Science of Information

点击查看摘要

Abstract:We exploit the mathematical modeling of the visual cortex mechanism for border completion to define custom filters for CNNs. We see a consistent improvement in performance, particularly in accuracy, when our modified LeNet 5 is tested with occluded MNIST images.
zh

[CV-18] he effects of Hessian eigenvalue spectral density type on the applicability of Hessian analysis to generalization capability assessment of neural networks

【速读】:该论文旨在研究Hessian特征值谱密度(HESD)在神经网络(Neural Network, NN)训练中的行为特性及其对泛化能力评估的影响。论文试图解决的问题是如何通过分析HESD的类型来更准确地估计神经网络的泛化潜力,并确定在何种情况下现有的Hessian分析方法可能失效。论文的关键在于提出了一种统一的HESD分析方法,结合正特征值为主的HESD(MP-HESD)和负特征值为主的HESD(MN-HESD),并通过实验验证了不同优化器、数据集预处理方式以及梯度操纵对HESD类型的影响。此外,论文还提出了判断HESD类型的准则及相应条件,并讨论了准奇异(Quasi-Singular, QS)HESD的出现及其对传统假设的挑战。这一系列工作揭示了HESD类型与神经网络损失曲面曲率之间的复杂关系,并为改进泛化能力评估提供了新的视角。

链接: https://arxiv.org/abs/2504.17618
作者: Nikita Gabdullin
机构: Joint Stock “Research and production company “Kryptonite” (联合股份“研究与生产公司”Kryptonite)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures, 4 tables, 4 equations

点击查看摘要

Abstract:Hessians of neural network (NN) contain essential information about the curvature of NN loss landscapes which can be used to estimate NN generalization capabilities. We have previously proposed generalization criteria that rely on the observation that Hessian eigenvalue spectral density (HESD) behaves similarly for a wide class of NNs. This paper further studies their applicability by investigating factors that can result in different types of HESD. We conduct a wide range of experiments showing that HESD mainly has positive eigenvalues (MP-HESD) for NN training and fine-tuning with various optimizers on different datasets with different preprocessing and augmentation procedures. We also show that mainly negative HESD (MN-HESD) is a consequence of external gradient manipulation, indicating that the previously proposed Hessian analysis methodology cannot be applied in such cases. We also propose criteria and corresponding conditions to determine HESD type and estimate NN generalization potential. These HESD types and previously proposed generalization criteria are combined into a unified HESD analysis methodology. Finally, we discuss how HESD changes during training, and show the occurrence of quasi-singular (QS) HESD and its influence on the proposed methodology and on the conventional assumptions about the relation between Hessian eigenvalues and NN loss landscape curvature.
zh

[CV-19] STCL:Curriculum learning Strategies for deep learning image steganography models

【速读】:本文针对基于深度学习的隐写图像质量较差以及图像隐写模型网络收敛速度慢的问题,提出了一种面向深度学习图像隐写模型的隐写课程学习训练策略(STCL)。其关键是结合基于教师模型的难度评估策略与基于膝点的训练调度策略:首先通过多个教师模型训练,利用多模型生成隐写图像质量的一致性作为难度评分,构建由易到难的训练子集;其次提出基于膝点的训练控制策略,以降低小规模训练集上的过拟合风险并加速训练过程。实验结果表明,所提方案在多种算法框架下提升了模型性能,生成的隐写图像不仅具有高PSNR、SSIM得分及解码准确性,还具备较低的隐写分析得分。

链接: https://arxiv.org/abs/2504.17609
作者: Fengchun Liu,Tong Zhang,Chunying Zhang
机构: Qianan College, North China University of Science and Technology (华北理工大学轻工学院); School of Cyberspace Security, Beijing University of Posts and Telecommunications (北京邮电大学网络空间安全学院); College of Science, North China University of Science and Technology (华北理工大学理学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Aiming at the problems of poor quality of steganographic images and slow network convergence of image steganography models based on deep learning, this paper proposes a Steganography Curriculum Learning training strategy (STCL) for deep learning image steganography models. So that only easy images are selected for training when the model has poor fitting ability at the initial stage, and gradually expand to more difficult images, the strategy includes a difficulty evaluation strategy based on the teacher model and an knee point-based training scheduling strategy. Firstly, multiple teacher models are trained, and the consistency of the quality of steganographic images under multiple teacher models is used as the difficulty score to construct the training subsets from easy to difficult. Secondly, a training control strategy based on knee points is proposed to reduce the possibility of overfitting on small training sets and accelerate the training process. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed image steganography scheme is able to improve the model performance under multiple algorithmic frameworks, which not only has a high PSNR, SSIM score, and decoding accuracy, but also the steganographic images generated by the model under the training of the STCL strategy have a low steganography analysis scores. You can find our code at \hrefthis https URLthis https URL.
zh

[CV-20] RGB-D Tracking via Hierarchical Modality Aggregation and Distribution Network

【速读】:该论文旨在解决当前RGB-Depth (RGB-D) 跟踪器效率较低、仅关注单级特征导致融合鲁棒性较弱及运行速度较慢的问题,无法满足实际应用需求。论文提出的解决方案核心在于引入一种名为HMAD (Hierarchical Modality Aggregation and Distribution) 的新型网络,通过充分利用RGB和深度模态在特征表示上的优势,采用分层方法进行特征分布与融合,从而显著提升RGB-D跟踪的鲁棒性。实验结果表明,HMAD在多个RGB-D数据集上达到了最先进的性能,并且在真实场景中的实时跟踪挑战中表现出色。

链接: https://arxiv.org/abs/2504.17595
作者: Boyue Xu,Yi Xu,Ruichao Hou,Jia Bei,Tongwei Ren,Gangshan Wu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (软件新技术国家重点实验室,南京大学), Nanjing (南京), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The integration of dual-modal features has been pivotal in advancing RGB-Depth (RGB-D) tracking. However, current trackers are less efficient and focus solely on single-level features, resulting in weaker robustness in fusion and slower speeds that fail to meet the demands of real-world applications. In this paper, we introduce a novel network, denoted as HMAD (Hierarchical Modality Aggregation and Distribution), which addresses these challenges. HMAD leverages the distinct feature representation strengths of RGB and depth modalities, giving prominence to a hierarchical approach for feature distribution and fusion, thereby enhancing the robustness of RGB-D tracking. Experimental results on various RGB-D datasets demonstrate that HMAD achieves state-of-the-art performance. Moreover, real-world experiments further validate HMAD’s capacity to effectively handle a spectrum of tracking challenges in real-time scenarios.
zh

[CV-21] amper-evident Image using JPEG Fixed Points

【速读】:该论文试图解决的问题是如何利用JPEG压缩过程中的固定点特性来检测图像篡改。论文证明了在JPEG的核心处理步骤中存在固定点,并且这些固定点可以在几次迭代内达到,同时保持图像的视觉质量且具有最小失真。解决方案的关键在于分析JPEG的压缩与解压过程,揭示固定点的存在,并基于此开发一种从原始真实图像生成防篡改图像的方法,通过比较待检测图像与固定点图像之间的偏差来暴露篡改操作。

链接: https://arxiv.org/abs/2504.17594
作者: Zhaofeng Si,Siwei Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:An intriguing phenomenon about JPEG compression has been observed since two decades ago- after repeating JPEG compression and decompression, it leads to a stable image that does not change anymore, which is a fixed point. In this work, we prove the existence of fixed points in the essential JPEG procedures. We analyze JPEG compression and decompression processes, revealing the existence of fixed points that can be reached within a few iterations. These fixed points are diverse and preserve the image’s visual quality, ensuring minimal distortion. This result is used to develop a method to create a tamper-evident image from the original authentic image, which can expose tampering operations by showing deviations from the fixed point image.
zh

[CV-22] Occlusion-Aware Self-Supervised Monocular Depth Estimation for Weak-Texture Endoscopic Images

【速读】:本文旨在解决内窥镜场景中单目深度估计的问题,现有方法虽精度较高,但通常假设光照一致性,而由于胃肠道运动引起的动态光照和遮挡现象常使这一假设失效。这些变化会导致几何解释错误及不可靠的自监督信号,从而降低深度重建质量。为解决此问题,论文提出了一个具有遮挡感知的自监督框架。关键在于:首先引入遮挡掩模进行数据增强,通过模拟视点相关的遮挡情景生成伪标签,增强了模型在部分可见情况下的鲁棒深度特征学习能力;其次利用非负矩阵分解引导的语义分割,对卷积激活进行聚类以在纹理缺失区域生成伪标签,从而提高分割准确性并减轻光照变化引起的信息丢失。实验结果表明,该方法在自监督深度估计方面达到了最先进的性能,并在不同内窥镜环境中有较强的泛化能力。

链接: https://arxiv.org/abs/2504.17582
作者: Zebo Huang,Yinghui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a self-supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self-supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion-aware self-supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo-labels by simulating viewpoint-dependent occlusion scenarios. This enhances the model’s ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non-negative matrix factorization, clustering convolutional activations to generate pseudo-labels in texture-deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state-of-the-art performance in self-supervised depth estimation. Additionally, evaluations on the Endo-SLAM and SERV-CT datasets demonstrate strong generalization across diverse endoscopic environments.
zh

[CV-23] Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior

【速读】:该论文旨在解决城市土地利用分类与制图在复杂城市环境中的精度不足问题,现有遥感技术因缺乏地面细节而表现欠佳。论文提出的关键解决方案是一种带有地理先验知识的无监督对比聚类模型,通过结合简单的视觉聚类分配,提供了一种灵活且可定制的城市土地利用制图方法,尤其适用于城市规划者的特定需求。该方法利用地理空间数据的空间一致性(Tobler’s Law),能够从带地理标记的街景图像数据集中生成土地利用地图,并适应多种场景以实现可扩展的无监督土地利用制图与更新。

链接: https://arxiv.org/abs/2504.17551
作者: Lin Che,Yizi Chen,Tanhua Jin,Martin Raubal,Konrad Schindler,Peter Kiefer
机构: ETH Zurich(Switzerland); Ghent University(Belgium)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, preprint version

点击查看摘要

Abstract:Urban land use classification and mapping are critical for urban planning, resource management, and environmental monitoring. Existing remote sensing techniques often lack precision in complex urban environments due to the absence of ground-level details. Unlike aerial perspectives, street view images provide a ground-level view that captures more human and social activities relevant to land use in complex urban scenes. Existing street view-based methods primarily rely on supervised classification, which is challenged by the scarcity of high-quality labeled data and the difficulty of generalizing across diverse urban landscapes. This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior, to enhance clustering performance. When combined with a simple visual assignment of the clusters, our approach offers a flexible and customizable solution to land use mapping, tailored to the specific needs of urban planners. We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities. As our methodology relies on the universal spatial coherence of geospatial data (“Tobler’s law”), it can be adapted to various settings where street view images are available, to enable scalable, unsupervised land use mapping and updating. The code will be available at this https URL.
zh

[CV-24] A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

【速读】:该论文旨在解决知识驱动视觉问答(KB-VQA)领域中缺乏系统性综述的问题。目前尚无全面的调查研究对现有的KB-VQA方法进行结构化组织与审查。论文的关键在于提出了一种结构化的KB-VQA方法分类法,并将系统分为三个主要阶段:知识表示、知识检索和知识推理。此外,通过探讨多种知识集成技术并识别持续存在的挑战,该研究还概述了有前景的未来研究方向,为推进KB-VQA模型及其应用奠定了基础。

链接: https://arxiv.org/abs/2504.17547
作者: Jiaqi Deng,Zonghan Wu,Huan Huo,Guandong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: 20 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.
zh

[CV-25] When Gaussian Meets Surfel: Ultra-fast High-fidelity Radiance Field Rendering

【速读】:该论文旨在解决快速高保真辐射场渲染的问题。为实现这一目标,论文提出了一种名为Gaussian-enhanced Surfels (GESs) 的双尺度表示方法。其关键是通过结合二维不透明surfels(用于粗尺度几何与外观表示)和环绕这些surfels的少量三维Gaussians(用于补充细尺度外观细节),在渲染过程中分两阶段进行:首先利用标准图形管线对surfels进行光栅化以生成深度和颜色贴图,然后独立地对每个像素顺序进行Gaussians的颜色累积和深度测试。此外,通过精心设计的由粗到细优化流程,从多视角图像中优化GESs,确保捕捉丰富的场景外观。该方法不仅实现了无排序渲染的超高速度,还避免了视点变化下的闪烁伪影,同时支持多种扩展以提升渲染性能和存储效率。

链接: https://arxiv.org/abs/2504.17545
作者: Keyang Ye,Tianjia Shao,Kun Zhou
机构: State Key Lab of CAD&CG, Zhejiang University (浙江大学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Gaussian-enhanced Surfels (GESs), a bi-scale representation for radiance field rendering, wherein a set of 2D opaque surfels with view-dependent colors represent the coarse-scale geometry and appearance of scenes, and a few 3D Gaussians surrounding the surfels supplement fine-scale appearance details. The rendering with GESs consists of two passes – surfels are first rasterized through a standard graphics pipeline to produce depth and color maps, and then Gaussians are splatted with depth testing and color accumulation on each pixel order independently. The optimization of GESs from multi-view images is performed through an elaborate coarse-to-fine procedure, faithfully capturing rich scene appearance. The entirely sorting-free rendering of GESs not only achieves very fast rates, but also produces view-consistent images, successfully avoiding popping artifacts under view changes. The basic GES representation can be easily extended to achieve anti-aliasing in rendering (Mip-GES), boosted rendering speeds (Speedy-GES) and compact storage (Compact-GES), and reconstruct better scene geometries by replacing 3D Gaussians with 2D Gaussians (2D-GES). Experimental results show that GESs advance the state-of-the-arts as a compelling representation for ultra-fast high-fidelity radiance field rendering.
zh

[CV-26] An Explainable Nature-Inspired Framework for Monkeypox Diagnosis: Xception Features Combined with NGBoost and African Vultures Optimization Algorithm

【速读】:该论文旨在解决猴痘在全球非传统流行区域的快速传播所引发的早期诊断难题,提出了一种基于深度学习的自动化检测框架,以实现从皮肤病变图像中高效且精准地识别猴痘。解决方案的关键在于结合迁移学习(Transfer Learning)、主成分分析(Principal Component Analysis, PCA)和自然梯度提升算法(Natural Gradient Boosting, NGBoost),并通过引入非洲秃鹫优化算法(African Vultures Optimization Algorithm, AVOA)进行超参数调优,从而优化模型性能与泛化能力,同时利用Grad-CAM和LIME技术增强模型的可解释性。

链接: https://arxiv.org/abs/2504.17540
作者: Ahmadreza Shateri,Negar Nourani,Morteza Dorrigiv,Hamid Nasiri
机构: Electrical and Computer Engineering Department, Semnan University (塞姆南大学), Iran; School of Computing and Communications, Lancaster University (兰开斯特大学), UK
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The recent global spread of monkeypox, particularly in regions where it has not historically been prevalent, has raised significant public health concerns. Early and accurate diagnosis is critical for effective disease management and control. In response, this study proposes a novel deep learning-based framework for the automated detection of monkeypox from skin lesion images, leveraging the power of transfer learning, dimensionality reduction, and advanced machine learning techniques. We utilize the newly developed Monkeypox Skin Lesion Dataset (MSLD), which includes images of monkeypox, chickenpox, and measles, to train and evaluate our models. The proposed framework employs the Xception architecture for deep feature extraction, followed by Principal Component Analysis (PCA) for dimensionality reduction, and the Natural Gradient Boosting (NGBoost) algorithm for classification. To optimize the model’s performance and generalization, we introduce the African Vultures Optimization Algorithm (AVOA) for hyperparameter tuning, ensuring efficient exploration of the parameter space. Our results demonstrate that the proposed AVOA-NGBoost model achieves state-of-the-art performance, with an accuracy of 97.53%, F1-score of 97.72% and an AUC of 97.47%. Additionally, we enhance model interpretability using Grad-CAM and LIME techniques, providing insights into the decision-making process and highlighting key features influencing classification. This framework offers a highly precise and efficient diagnostic tool, potentially aiding healthcare providers in early detection and diagnosis, particularly in resource-constrained environments.
zh

[CV-27] xt-to-Image Alignment in Denoising-Based Models through Step Selection

【速读】:该论文试图解决视觉生成式 AI (Visual Generative AI) 模型在文本-图像对齐和推理能力方面的挑战。解决方案的关键在于提出一种新颖的方法,通过在关键去噪步骤中选择性增强信号,优化基于输入语义的图像生成过程。与早期阶段的信号修改相比,该方法强调在后期阶段进行调整的重要性,从而实现更高质量的图像生成,并在Diffusion和Flow Matching模型上达到了最先进的性能表现。

链接: https://arxiv.org/abs/2504.17525
作者: Paul Grimal,Hervé Le Borgne,Olivier Ferret
机构: Université Paris-Saclay (巴黎萨克雷大学); CEA, List (法国原子能和替代能源委员会, List研究所), F-91120, Palaiseau, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual generative AI models often encounter challenges related to text-image alignment and reasoning limitations. This paper presents a novel method for selectively enhancing the signal at critical denoising steps, optimizing image generation based on input semantics. Our approach addresses the shortcomings of early-stage signal modifications, demonstrating that adjustments made at later stages yield superior results. We conduct extensive experiments to validate the effectiveness of our method in producing semantically aligned images on Diffusion and Flow Matching model, achieving state-of-the-art performance. Our results highlight the importance of a judicious choice of sampling stage to improve performance and overall image alignment.
zh

[CV-28] ESDiff: Encoding Strategy-inspired Diffusion Model with Few-shot Learning for Color Image Inpainting

【速读】:该论文旨在解决传统图像修复方法在处理复杂细节和结构时表现不足的问题,同时克服基于深度学习的模型需要大量训练数据的局限性。论文提出了一种受编码策略启发的扩散模型,并结合少量样本学习(Few-Shot Learning)用于彩色图像修复。方案的关键在于引入一种“虚拟掩模”(virtual mask),通过通道间的相互扰动构建高维对象,使扩散模型能够从有限的训练样本中捕捉丰富的图像表示和详细特征。此外,该编码策略利用通道间的冗余信息,在迭代修复过程中与低秩方法相结合,并整合到扩散模型中以实现精确的信息输出。实验结果表明,所提方法在定量指标上超越现有技术,显著提升了修复图像在纹理和结构完整性方面的质量,从而获得更精准且连贯的结果。

链接: https://arxiv.org/abs/2504.17524
作者: Junyan Zhang,Yan Li,Mengxiao Geng,Liu Shi,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages,10 figures,Submit to tcsvt

点击查看摘要

Abstract:Image inpainting is a technique used to restore missing or damaged regions of an image. Traditional methods primarily utilize information from adjacent pixels for reconstructing missing areas, while they struggle to preserve complex details and structures. Simultaneously, models based on deep learning necessitate substantial amounts of training data. To address this challenge, an encoding strategy-inspired diffusion model with few-shot learning for color image inpainting is proposed in this paper. The main idea of this novel encoding strategy is the deployment of a “virtual mask” to construct high-dimensional objects through mutual perturbations between channels. This approach enables the diffusion model to capture diverse image representations and detailed features from limited training samples. Moreover, the encoding strategy leverages redundancy between channels, integrates with low-rank methods during iterative inpainting, and incorporates the diffusion model to achieve accurate information output. Experimental results indicate that our method exceeds current techniques in quantitative metrics, and the reconstructed images quality has been improved in aspects of texture and structural integrity, leading to more precise and coherent results.
zh

[CV-29] owards One-Stage End-to-End Table Structure Recognition with Parallel Regression for Diverse Scenarios

【速读】:该论文旨在解决表格结构解析(table structure recognition)的问题,即从非结构化数据中解析表格并转化为机器可理解的格式。现有方法通常采用两阶段流程或优化的一阶段方法,但这些方法要么需要串行训练多个网络并执行耗时的顺序解码,要么依赖复杂的后处理算法来解析表格的逻辑结构,在跨场景适应性、鲁棒性和计算效率之间难以平衡。论文提出了一种名为TableCenterNet的一阶段端到端表格结构解析网络,其关键是将表格的空间结构和逻辑结构预测统一为一个并行回归任务,并通过共享特征提取层与任务特定解码的协同架构隐式学习单元的空间-逻辑位置映射规律。与两阶段方法相比,该方法更易于训练且推理速度更快。实验结果表明,TableCenterNet在多种场景下能够有效解析表格结构,并在TableGraph-24k数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.17522
作者: Anyi Xiao,Cihui Yang
机构: Nanchang Hangkong University (南昌航空大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Table structure recognition aims to parse tables in unstructured data into machine-understandable formats. Recent methods address this problem through a two-stage process or optimized one-stage approaches. However, these methods either require multiple networks to be serially trained and perform more time-consuming sequential decoding, or rely on complex post-processing algorithms to parse the logical structure of tables. They struggle to balance cross-scenario adaptability, robustness, and computational efficiency. In this paper, we propose a one-stage end-to-end table structure parsing network called TableCenterNet. This network unifies the prediction of table spatial and logical structure into a parallel regression task for the first time, and implicitly learns the spatial-logical location mapping laws of cells through a synergistic architecture of shared feature extraction layers and task-specific decoding. Compared with two-stage methods, our method is easier to train and faster to infer. Experiments on benchmark datasets show that TableCenterNet can effectively parse table structures in diverse scenarios and achieve state-of-the-art performance on the TableGraph-24k dataset. Code is available at this https URL.
zh

[CV-30] Mamba-Sea: A Mamba-based Framework with Global-to-Local Sequence Augmentation for Generalizable Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中的领域泛化(Domain Generalization, DG)问题,特别是针对数据分布偏移(distribution shifts)的情况。现有方法主要依赖于CNN或ViT架构,而该研究受到Mamba等先进状态空间模型在有监督医学图像分割中成功表现的启发,提出了一种新的基于Mamba的框架——Mamba-Sea。该框架的关键在于引入全局到局部序列增强机制,通过全局增强模拟不同站点间外观的潜在变化,抑制模型学习领域特定信息;同时,在局部层面通过沿输入序列的顺序增强,扰动随机连续子序列内标记符的风格,以建模并重采样与领域偏移相关的风格统计量。这一创新方法不仅首次实现了Prostate数据集上的Dice系数超过90%,还显著提升了模型对领域偏移的鲁棒性。

链接: https://arxiv.org/abs/2504.17515
作者: Zihan Cheng,Jintao Guo,Jian Zhang,Lei Qi,Luping Zhou,Yinghuan Shi,Yang Gao
机构: Shanghai Jiao Tong University School of Medicine (上海交通大学医学院); National Institute of Healthcare Data Science, Nanjing University (南京大学国家健康数据科学研究院); State Key Laboratory for Novel Software Technology (新型软件技术国家重点实验室); School of Computer Science and Engineering, Key Lab of Computer Network and Information Integration, Southeast University (东南大学计算机科学与工程学院, 计算机网络与信息集成重点实验室); School of Electrical and Information Engineering, The University of Sydney (悉尼大学电气与信息工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TMI 2025. The code is available at this https URL

点击查看摘要

Abstract:To segment medical images with distribution shifts, domain generalization (DG) has emerged as a promising setting to train models on source domains that can generalize to unseen target domains. Existing DG methods are mainly based on CNN or ViT architectures. Recently, advanced state space models, represented by Mamba, have shown promising results in various supervised medical image segmentation. The success of Mamba is primarily owing to its ability to capture long-range dependencies while keeping linear complexity with input sequence length, making it a promising alternative to CNNs and ViTs. Inspired by the success, in the paper, we explore the potential of the Mamba architecture to address distribution shifts in DG for medical image segmentation. Specifically, we propose a novel Mamba-based framework, Mamba-Sea, incorporating global-to-local sequence augmentation to improve the model’s generalizability under domain shift issues. Our Mamba-Sea introduces a global augmentation mechanism designed to simulate potential variations in appearance across different sites, aiming to suppress the model’s learning of domain-specific information. At the local level, we propose a sequence-wise augmentation along input sequences, which perturbs the style of tokens within random continuous sub-sequences by modeling and resampling style statistics associated with domain shifts. To our best knowledge, Mamba-Sea is the first work to explore the generalization of Mamba for medical image segmentation, providing an advanced and promising Mamba-based architecture with strong robustness to domain shifts. Remarkably, our proposed method is the first to surpass a Dice coefficient of 90% on the Prostate dataset, which exceeds previous SOTA of 88.61%. The code is available at this https URL.
zh

[CV-31] RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

【速读】:该论文试图解决文本到图像生成(Text-to-Image, T2I)领域中缺乏可靠自动评估方法的问题。现有评估方法要么仅关注任务的一个方面(如文本对齐或主体保留),要么与人类判断不一致,或者依赖于昂贵的基于API的评估。为了解决这一挑战,论文提出了一种名为RefVNLI的成本效益评估指标,其关键是能够在单一预测中同时评估文本对齐和主体保留。通过在大规模数据集上进行训练,该方法在多个基准测试和主体类别(如动物、物体等)中表现出色,分别在文本对齐和主体一致性上实现了高达6.4分和8.5分的提升,并且在较少为人知的概念上也能够以超过87%的准确率与人类偏好保持一致。

链接: https://arxiv.org/abs/2504.17502
作者: Aviv Slobodkin,Hagai Taitelbaum,Yonatan Bitton,Brian Gordon,Michal Sokolik,Nitzan Bitton Guetta,Almog Gueta,Royi Rassin,Itay Laish,Dani Lischinski,Idan Szpektor
机构: Google Research (谷歌研究); Ben Gurion University (本·古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability – ranging from enhanced personalization in image generation to consistent character representation in video rendering – progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single prediction. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or matches existing baselines across multiple benchmarks and subject categories (e.g., \emphAnimal, \emphObject), achieving up to 6.4-point gains in textual alignment and 8.5-point gains in subject consistency. It also excels with lesser-known concepts, aligning with human preferences at over 87% accuracy.
zh

[CV-32] Enhanced Sample Selection with Confidence Tracking: Identifying Correctly Labeled yet Hard-to-Learn Samples in Noisy Data

【速读】:该论文致力于解决在存在噪声标签的情况下图像分类样本选择的问题。现有方法通常将小损失(small-loss)的样本视为正确标注,但一些正确标注的样本由于模型难以学习而可能表现出与错误标注样本类似的高损失值,尤其是在训练初期。这种情况下,基于单一样本损失值设定阈值进行样本选择会导致精确度(precision)和召回率(recall)之间的权衡困境:较低的阈值可能导致许多难以学习的正确标注样本被遗漏(低召回率),而较高的阈值则可能引入错误标注样本(低精确度)。为了解决这一问题,论文的关键在于准确区分出那些虽然难以学习但仍属于正确标注的样本与错误标注的样本,从而缓解这种权衡难题。论文提出通过关注模型预测置信度的变化趋势而非单纯依赖损失值来实现这一点。实验观察表明,仅对于正确标注的样本,其标注类别的预测置信度在训练过程中通常比其他类别增长得更快。基于此洞察,论文建议在训练期间跟踪标注类别与其他类别之间的置信度差距,并利用Mann-Kendall检验评估这些差距的变化趋势。如果一个样本的所有置信度差距均呈现上升趋势,则认为该样本可能是正确标注的。该方法作为一个即插即用组件,可以无缝集成到现有的样本选择技术中。在多个标准基准数据集和真实世界数据集上的实验结果证明,该方法能够提升现有处理噪声标签学习方法的性能。

链接: https://arxiv.org/abs/2504.17474
作者: Weiran Pan,Wei Wei,Feida Zhu,Yong Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a novel sample selection method for image classification in the presence of noisy labels. Existing methods typically consider small-loss samples as correctly labeled. However, some correctly labeled samples are inherently difficult for the model to learn and can exhibit high loss similar to mislabeled samples in the early stages of training. Consequently, setting a threshold on per-sample loss to select correct labels results in a trade-off between precision and recall in sample selection: a lower threshold may miss many correctly labeled hard-to-learn samples (low recall), while a higher threshold may include many mislabeled samples (low precision). To address this issue, our goal is to accurately distinguish correctly labeled yet hard-to-learn samples from mislabeled ones, thus alleviating the trade-off dilemma. We achieve this by considering the trends in model prediction confidence rather than relying solely on loss values. Empirical observations show that only for correctly labeled samples, the model’s prediction confidence for the annotated labels typically increases faster than for any other classes. Based on this insight, we propose tracking the confidence gaps between the annotated labels and other classes during training and evaluating their trends using the Mann-Kendall Test. A sample is considered potentially correctly labeled if all its confidence gaps tend to increase. Our method functions as a plug-and-play component that can be seamlessly integrated into existing sample selection techniques. Experiments on several standard benchmarks and real-world datasets demonstrate that our method enhances the performance of existing methods for learning with noisy labels.
zh

[CV-33] Unveiling Hidden Vulnerabilities in Digital Human Generation via Adversarial Attacks

【速读】:该论文旨在解决现有表达性人体姿态与形状估计(EHPS)模型在鲁棒性和安全性方面的不足,这些模型容易受到对抗攻击的影响。为应对这一挑战,论文提出了一种名为Tangible Attack (TBA) 的新框架,其关键在于引入了Dual Heterogeneous Noise Generator (DHNG),通过结合Variational Autoencoders (VAE) 和ControlNet生成针对原始图像特征定制的多样化、定向噪声,并设计了一个自定义的对抗损失函数以优化噪声,确保高可控性和强大的干扰能力。此外,通过迭代利用来自噪声和最先进的EHPS模型的多梯度信号来优化对抗样本,进一步提升了对抗攻击的有效性。实验结果表明,TBA显著提高了对抗攻击的效果,在估计误差方面实现了41.0%的提升,平均改善约为17.0%,揭示了当前EHPS模型中的重要安全漏洞,并强调了数字人生成系统中加强防御的必要性。

链接: https://arxiv.org/abs/2504.17457
作者: Zhiying Li,Yeying Jin,Fan Shen,Zhi Liu,Weibin Chen,Pengju Zhang,Xiaomei Zhang,Boyu Chen,Michael Shen,Kejian Wu,Zhaoxin Fan,Jin Dong
机构: College of Cyber Security, Jinan University (暨南大学网络空间安全学院); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University (北京航空航天大学未来区块链与隐私计算高等工程中心,人工智能学院); Hangzhou International Innovation Institute, Beihang University (北京航空航天大学杭州国际创新研究院); Department of Electrical and Computer Engineering, National University of Singapore (新加坡国立大学电子与计算机工程系); Department of Electrical and Computer Science, University of Pittsburgh (匹兹堡大学电气与计算机科学系); State Key Laboratory of Multimodal Artificial Intelligence, Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所多模态人工智能国家重点实验室); Institute of Automation, Chinese Academy of Science (中国科学院自动化研究所); University College London (伦敦大学学院); Mingdu Tech (明渡科技); Xreal (视透科技); Beijing Academy of Blockchain and Edge Computing (北京区块链与边缘计算研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Expressive human pose and shape estimation (EHPS) is crucial for digital human generation, especially in applications like live streaming. While existing research primarily focuses on reducing estimation errors, it largely neglects robustness and security aspects, leaving these systems vulnerable to adversarial attacks. To address this significant challenge, we propose the \textbfTangible Attack (TBA), a novel framework designed to generate adversarial examples capable of effectively compromising any digital human generation model. Our approach introduces a \textbfDual Heterogeneous Noise Generator (DHNG), which leverages Variational Autoencoders (VAE) and ControlNet to produce diverse, targeted noise tailored to the original image features. Additionally, we design a custom \textbfadversarial loss function to optimize the noise, ensuring both high controllability and potent disruption. By iteratively refining the adversarial sample through multi-gradient signals from both the noise and the state-of-the-art EHPS model, TBA substantially improves the effectiveness of adversarial attacks. Extensive experiments demonstrate TBA’s superiority, achieving a remarkable 41.0% increase in estimation error, with an average improvement of approximately 17.0%. These findings expose significant security vulnerabilities in current EHPS models and highlight the need for stronger defenses in digital human generation systems.
zh

[CV-34] FRAG : Frame Selection Augmented Generation for Long Video and Long Document Understanding

【速读】:该论文旨在解决现有Large Multimodal Models (LMMs) 在处理长输入(如多页文档和长视频)时,由于训练和推理过程中的计算成本限制,导致模型规模和性能受限的问题。论文提出了一种名为Frame Selection Augmented Generation (FRAG) 的方法作为解决方案。FRAG 的关键在于不依赖于长上下文的处理方式,而是首先通过独立为每个帧打分来选择相关帧,然后仅基于所选帧生成最终输出。这种简单的Top-K选择机制避免了长上下文带来的计算负担,同时在无需对现有模型进行微调的情况下,显著提升了长视频和多页文档的理解性能。实验表明,FRAG 在MLVU和Video-MME数据集上的性能分别提升了5.8%和3.7%,并在MP-DocVQA任务中实现了超过20%的改进。

链接: https://arxiv.org/abs/2504.17447
作者: De-An Huang,Subhashree Radhakrishnan,Zhiding Yu,Jan Kautz
机构: NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has been impressive progress in Large Multimodal Models (LMMs). Recent works extend these models to long inputs, including multi-page documents and long videos. However, the model size and performance of these long context models are still limited due to the computational cost in both training and inference. In this work, we explore an orthogonal direction and process long inputs without long context LMMs. We propose Frame Selection Augmented Generation (FRAG), where the model first selects relevant frames within the input, and then only generates the final outputs based on the selected frames. The core of the selection process is done by scoring each frame independently, which does not require long context processing. The frames with the highest scores are then selected by a simple Top-K selection. We show that this frustratingly simple framework is applicable to both long videos and multi-page documents using existing LMMs without any fine-tuning. We consider two models, LLaVA-OneVision and InternVL2, in our experiments and show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding. For videos, FRAG substantially improves InternVL2-76B by 5.8% on MLVU and 3.7% on Video-MME. For documents, FRAG achieves over 20% improvements on MP-DocVQA compared with recent LMMs specialized in long document understanding. Code is available at: this https URL
zh

[CV-35] Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding

【速读】:该论文试图解决的问题是如何通过增加观察时间,实现对未知物体4D(三维空间+时间)配置的更好理解。传统方法要么从多视角观测优化底层表示,要么利用监督数据集训练前馈预测器,但这些方法在处理长视频时存在局限性。论文的关键在于提出了一种名为Predict-Optimize-Distill (POD) 的自改进框架,它通过在预测和优化之间交替迭代形成相互增强的循环来解决问题。具体而言,POD首先从RGB帧中迭代训练神经网络以预测局部部件姿态,然后利用此预测器初始化全局优化以通过逆向渲染细化输出姿态,并最终通过从新视角生成合成自标记训练数据将优化结果蒸馏回模型中。这种方法不仅提升了预测模型的表现,还改善了优化后的运动轨迹,形成了一个良性循环,从而利用自身训练数据学习物体的姿态配置。此外,论文还引入了一种准多视角挖掘策略,以减少深度模糊,进一步提升性能。

链接: https://arxiv.org/abs/2504.17441
作者: Mingxuan Wu,Huang Huang,Justin Kerr,Chung Min Kim,Anthony Zhang,Brent Yi,Angjoo Kanazawa
机构: University of California, Berkeley
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: See our website at: this https URL First two authors contributed equally

点击查看摘要

Abstract:Humans can resort to long-form inspection to build intuition on predicting the 3D configurations of unseen objects. The more we observe the object motion, the better we get at predicting its 3D state immediately. Existing systems either optimize underlying representations from multi-view observations or train a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging long video. We evaluate POD on 14 real-world and 5 synthetic objects with various joint types, including revolute and prismatic joints as well as multi-body configurations where parts detach or reattach independently. POD demonstrates significant improvement over a pure optimization baseline which gets stuck in local minima, particularly for longer videos. We also find that POD’s performance improves with both video length and successive iterations of the self-improving cycle, highlighting its ability to scale performance with additional observations and looped refinement.
zh

[CV-36] Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLM s

【速读】:该论文旨在解决现有跨模态预训练框架(如CLIP)在多模态表示学习中的三个关键局限性:文本标记截断、孤立的图像-文本编码以及因词袋行为导致的组成性不足。同时,尽管多模态大型语言模型(MLLMs)在通用视觉-语言理解方面取得了显著进展,但其在学习可迁移多模态表征方面的潜力尚未充分发挥。为了解决这些问题,论文提出了一种名为UniME(Universal Multimodal Embedding)的新型两阶段框架。该框架的关键在于通过强大的基于LLM的教师模型进行文本判别知识蒸馏以增强MLLM的语言组件表征能力,并引入硬负样本增强指令微调,通过减少假负样本污染和采样多个硬负样本来迫使模型关注具有挑战性的样本,从而进一步提升判别性表征学习能力。实验结果表明,UniME在所有任务上均实现了性能的一致提升,展现了卓越的判别性和组成性能力。

链接: https://arxiv.org/abs/2504.17432
作者: Tiancheng Gu,Kaicheng Yang,Ziyong Feng,Xingjun Wang,Yanzhao Zhang,Dingkun Long,Yingda Chen,Weidong Cai,Jiankang Deng
机构: The University of Sydney (悉尼大学); DeepGlint (深光科技); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, Project page: this https URL

点击查看摘要

Abstract:The Contrastive Language-Image Pre-training (CLIP) framework has become a widely used approach for multimodal representation learning, particularly in image-text retrieval and clustering. However, its efficacy is constrained by three key limitations: (1) text token truncation, (2) isolated image-text encoding, and (3) deficient compositionality due to bag-of-words behavior. While recent Multimodal Large Language Models (MLLMs) have demonstrated significant advances in generalized vision-language understanding, their potential for learning transferable multimodal representations remains this http URL this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. In the first stage, we perform textual discriminative knowledge distillation from a powerful LLM-based teacher model to enhance the embedding capability of the MLLMś language component. In the second stage, we introduce hard negative enhanced instruction tuning to further advance discriminative representation learning. Specifically, we initially mitigate false negative contamination and then sample multiple hard negatives per instance within each batch, forcing the model to focus on challenging samples. This approach not only improves discriminative power but also enhances instruction-following ability in downstream tasks. We conduct extensive experiments on the MMEB benchmark and multiple retrieval tasks, including short and long caption retrieval and compositional retrieval. Results demonstrate that UniME achieves consistent performance improvement across all tasks, exhibiting superior discriminative and compositional capabilities.
zh

[CV-37] 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

【速读】:该论文旨在解决视频试穿(Video Try-on)领域中处理复杂服装图案和多样人体姿态时难以生成高质量且时间一致性良好的结果的问题。论文的关键解决方案是提出了一种基于扩散模型的新框架3DV-TON,其核心在于利用生成的可动画化纹理化3D网格作为显式的帧级引导,从而缓解了模型过度关注外观保真度而牺牲运动一致性的弊端。这一目标通过直接引用整个视频序列中一致的服装纹理运动得以实现。此外,论文还引入了一种鲁棒的矩形掩码策略,有效减少了动态人体与服装运动过程中因衣物信息泄漏导致的伪影传播问题。

链接: https://arxiv.org/abs/2504.17414
作者: Min Wei,Chaohui Yu,Jingkai Zhou,Fan Wang
机构: DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Hupan Lab (湖畔实验室); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video try-on replaces clothing in videos with target garments. Existing methods struggle to generate high-quality and temporally consistent results when handling complex clothing patterns and diverse body poses. We present 3DV-TON, a novel diffusion-based framework for generating high-fidelity and temporally consistent video try-on results. Our approach employs generated animatable textured 3D meshes as explicit frame-level guidance, alleviating the issue of models over-focusing on appearance fidelity at the expanse of motion coherence. This is achieved by enabling direct reference to consistent garment texture movements throughout video sequences. The proposed method features an adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe for initial 2D image try-on, followed by (2) reconstructing and animating a textured 3D mesh synchronized with original video poses. We further introduce a robust rectangular masking strategy that successfully mitigates artifact propagation caused by leaking clothing information during dynamic human and garment movements. To advance video try-on research, we introduce HR-VVT, a high-resolution benchmark dataset containing 130 videos with diverse clothing types and scenarios. Quantitative and qualitative results demonstrate our superior performance over existing methods. The project page is at this link this https URL
zh

[CV-38] StereoMamba: Real-time and Robust Intraoperative Stereo Disparity Estimation via Long-range Spatial Dependencies

【速读】:该论文旨在解决机器人辅助微创手术(RAMIS)中立体视差估计面临的挑战,特别是在准确性、鲁棒性和推理速度之间实现最优平衡的问题。为了解决这些问题,论文提出了一种名为StereoMamba的新架构。其关键在于两个创新模块:首先,Feature Extraction Mamba (FE-Mamba) 模块通过增强左右图像之间的长程空间依赖性来改进特征提取;其次,Multidimensional Feature Fusion (MFF) 模块有效整合多尺度特征,进一步提升模型性能。实验结果表明,StereoMamba在多个指标上实现了卓越表现,同时保持了高效的推理速度,验证了其在精度、鲁棒性和效率之间的最佳权衡能力。

链接: https://arxiv.org/abs/2504.17401
作者: Xu Wang,Jialang Xu,Shuai Zhang,Baoru Huang,Danail Stoyanov,Evangelos B. Mazomenos
机构: UCL Hawkes Institute (伦敦大学学院霍克斯研究所), University College London, London, UK (伦敦大学学院, 伦敦, 英国); Department of Medical Physics and Biomedical Engineering (医学物理与生物医学工程系), University College London, London, UK (伦敦大学学院, 伦敦, 英国); Department of Computer Science (计算机科学系), University College London, London, UK (伦敦大学学院, 伦敦, 英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stereo disparity estimation is crucial for obtaining depth information in robot-assisted minimally invasive surgery (RAMIS). While current deep learning methods have made significant advancements, challenges remain in achieving an optimal balance between accuracy, robustness, and inference speed. To address these challenges, we propose the StereoMamba architecture, which is specifically designed for stereo disparity estimation in RAMIS. Our approach is based on a novel Feature Extraction Mamba (FE-Mamba) module, which enhances long-range spatial dependencies both within and across stereo images. To effectively integrate multi-scale features from FE-Mamba, we then introduce a novel Multidimensional Feature Fusion (MFF) module. Experiments against the state-of-the-art on the ex-vivo SCARED benchmark demonstrate that StereoMamba achieves superior performance on EPE of 2.64 px and depth MAE of 2.55 mm, the second-best performance on Bad2 of 41.49% and Bad3 of 26.99%, while maintaining an inference speed of 21.28 FPS for a pair of high-resolution images (1280*1024), striking the optimum balance between accuracy, robustness, and efficiency. Furthermore, by comparing synthesized right images, generated from warping left images using the generated disparity maps, with the actual right image, StereoMamba achieves the best average SSIM (0.8970) and PSNR (16.0761), exhibiting strong zero-shot generalization on the in-vivo RIS2017 and StereoMIS datasets.
zh

[CV-39] S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception

【速读】:本文旨在解决在车到车(V2V)集体感知(Collective Perception, CP)中因连接与自动化车辆(Connected and Automated Vehicles, CAVs)采用异构传感器系统而导致的Sensor2Sensor域间隙(domain gap)问题。这一问题长期未被充分解决,主要由于缺乏包含异构传感器配置的公开数据集。为应对这一挑战,SCOPE数据集提供了每辆CAV的三种不同LiDAR传感器数据。论文的关键解决方案是提出了一种名为S2S-Net的传感器域鲁棒架构,它能够适应不同的传感器域并在SCOPE数据集上实现了最先进的性能,特别是在未见传感器域中保持了极高的性能表现。

链接: https://arxiv.org/abs/2504.17399
作者: Sven Teufel,Jörg Gamerdinger,Oliver Bringmann
机构: University of Tübingen (蒂宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Collective Perception (CP) has emerged as a promising approach to overcome the limitations of individual perception in the context of autonomous driving. Various approaches have been proposed to realize collective perception; however, the Sensor2Sensor domain gap that arises from the utilization of different sensor systems in Connected and Automated Vehicles (CAVs) remains mostly unaddressed. This is primarily due to the paucity of datasets containing heterogeneous sensor setups among the CAVs. The recently released SCOPE datasets address this issue by providing data from three different LiDAR sensors for each CAV. This study is the first to tackle the Sensor2Sensor domain gap in vehicle to vehicle (V2V) collective perception. First, we present our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is conducted. S2S-Net demonstrates the capability to maintain very high performance in unseen sensor domains and achieved state-of-the-art results on the SCOPE dataset.
zh

[CV-40] Fine-tune Smarter Not Harder: Parameter-Efficient Fine-Tuning for Geospatial Foundation Models

【速读】:该论文试图解决大尺寸基础模型在遥感图像分析领域因全量微调导致的计算资源消耗高、成本高昂以及模型泛化能力下降的问题。解决方案的关键在于采用Parameter-Efficient Fine-Tuning (PEFT) 技术,通过减少参数更新量实现高效且经济的模型适配,同时保持甚至提升模型性能与泛化能力至未见过的地理区域,并显著降低训练时间和内存需求。实验结果表明,推荐的架构配置包括UNet解码器和不使用元数据的微调方式。

链接: https://arxiv.org/abs/2504.17397
作者: Francesc Marti-Escofet,Benedikt Blumenstiel,Linus Scheibenreif,Paolo Fraccaro,Konrad Schindler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:Earth observation (EO) is crucial for monitoring environmental changes, responding to disasters, and managing natural resources. In this context, foundation models facilitate remote sensing image analysis to retrieve relevant geoinformation accurately and efficiently. However, as these models grow in size, fine-tuning becomes increasingly challenging due to the associated computational resources and costs, limiting their accessibility and scalability. Furthermore, full fine-tuning can lead to forgetting pre-trained features and even degrade model generalization. To address this, Parameter-Efficient Fine-Tuning (PEFT) techniques offer a promising solution. In this paper, we conduct extensive experiments with various foundation model architectures and PEFT techniques to evaluate their effectiveness on five different EO datasets. Our results provide a comprehensive comparison, offering insights into when and how PEFT methods support the adaptation of pre-trained geospatial models. We demonstrate that PEFT techniques match or even exceed full fine-tuning performance and enhance model generalisation to unseen geographic regions, while reducing training time and memory requirements. Additional experiments investigate the effect of architecture choices such as the decoder type or the use of metadata, suggesting UNet decoders and fine-tuning without metadata as the recommended configuration. We have integrated all evaluated foundation models and techniques into the open-source package TerraTorch to support quick, scalable, and cost-effective model adaptation.
zh

[CV-41] SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting

【速读】:本文旨在解决开放世界目标计数(Open-world Object Counting)任务中,现有方法因仅关注训练集中已知类别而导致对未知类别泛化能力有限的问题。为应对这一挑战,论文提出了一种即插即用的语义驱动视觉提示微调框架(Semantic-Driven Visual Prompt Tuning, SDVPT)。该框架的关键在于引入了一种两阶段视觉提示学习策略:首先通过类别特定视觉提示初始化(Category-Specific Prompt Initialization, CSPI)生成针对特定类别的视觉提示;然后利用拓扑引导的提示优化(Topology-Guided Prompt Refinement, TGPR),从预训练视觉语言模型(Vision-Language Model, VLM)的文本编码器中蒸馏潜在结构模式以进一步优化这些提示。在推理阶段,SDVPT能够基于未见类别与训练集类别之间的语义相关性动态合成视觉提示,从而实现对未知类别的稳健文本-图像对齐。实验结果表明,该方法在FSC-147、CARPK和PUCPR+三个常用数据集上的有效性与适应性。

链接: https://arxiv.org/abs/2504.17395
作者: Yiming Zhao,Guorong Li,Laiyun Qing,Amin Beheshti,Jian Yang,Michael Sheng,Yuankai Qi,Qingming Huang
机构: UCAS(中国科学院大学)(北京, 中国); Macquarie University(麦考瑞大学)(悉尼, 澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM’s text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
zh

[CV-42] Highly Accurate and Diverse Traffic Data: The DeepScenario Open 3D Dataset

【速读】:本文旨在解决传统3D轨迹数据集在自动驾驶领域中存在的两个主要问题:一是由于固定传感器导致的遮挡(occlusion)问题;二是仅能精确重建测量车辆附近动态环境,而忽视远处物体。为了解决这些问题,论文提出了一种基于单目相机无人机跟踪管道的新型方法,构建了一个高质量且无遮挡的DeepScenario Open 3D Dataset (DSC3D),包含超过175,000条来自14类交通参与者的6自由度边界框轨迹数据。该方案的关键在于利用无人机视角克服地面传感器的局限性,从而显著提升数据的多样性和规模,并涵盖复杂场景如城市街道上的车辆-行人交互及全面的停车操作。通过提供详细的三维环境表示,DSC3D有望改善障碍物交互能力和安全性,广泛应用于运动预测、规划、场景挖掘以及生成式反应性交通代理等领域。

链接: https://arxiv.org/abs/2504.17371
作者: Oussema Dhaouadi,Johannes Meier,Luca Wahl,Jacques Kaiser,Luca Scalerandi,Nick Wandelburg,Zhuolun Zhou,Nijanthan Berinpanathan,Holger Banzhaf,Daniel Cremers
机构: DeepScenario (深度场景); TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at this http URL, facilitating research in motion prediction, behavior modeling, and safety validation.
zh

[CV-43] I-INR: Iterative Implicit Neural Representations

【速读】:该论文试图解决传统隐式神经表示(Implicit Neural Representations, INRs)在信号重建过程中容易回归到均值、难以捕捉精细细节、保留高频信息以及有效处理噪声的问题。为了解决这些问题,论文提出了一种名为迭代隐式神经表示(Iterative Implicit Neural Representations, I-INRs)的新框架。I-INRs通过引入迭代细化过程,显著提升了信号重建的质量,增强了对噪声的鲁棒性,并有效恢复了高频率细节。其关键在于通过迭代优化机制逐步改善表示精度,同时与现有INR架构无缝集成,实现了性能的全面提升。实验结果表明,I-INRs在图像修复、图像去噪和物体占用预测等任务中优于基线方法(如WIRE、SIREN和Gauss)。

链接: https://arxiv.org/abs/2504.17364
作者: Ali Haider,Muhammad Salman Ali,Maryam Qamar,Tahir Khalil,Soo Ye Kim,Jihyong Oh,Enzo Tartaglione,Sung-Ho Bae
机构: Kyung Hee University (庆熙大学), Republic of Korea; Adobe Research (Adobe 研究院); Chung-Ang University (中央大学), Republic of Korea; LTCI, Télécom Paris, Institut Polytechnique de Paris (Télécom 巴黎研究院, 巴黎高等理工学院), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have revolutionized signal processing and computer vision by modeling signals as continuous, differentiable functions parameterized by neural networks. However, their inherent formulation as a regression problem makes them prone to regression to the mean, limiting their ability to capture fine details, retain high-frequency information, and handle noise effectively. To address these challenges, we propose Iterative Implicit Neural Representations (I-INRs) a novel plug-and-play framework that enhances signal reconstruction through an iterative refinement process. I-INRs effectively recover high-frequency details, improve robustness to noise, and achieve superior reconstruction quality. Our framework seamlessly integrates with existing INR architectures, delivering substantial performance gains across various tasks. Extensive experiments show that I-INRs outperform baseline methods, including WIRE, SIREN, and Gauss, in diverse computer vision applications such as image restoration, image denoising, and object occupancy prediction.
zh

[CV-44] DRC: Enhancing Personalized Image Generation via Disentangled Representation Composition

【速读】:该论文致力于解决个性化图像生成中用户风格偏好与语义意图难以被准确捕捉和融合的问题。现有方法(无论是基于扩散模型、大语言模型还是大规模多模态模型)在处理这一任务时,容易出现视觉特征纠缠以及指导崩溃(Guidance Collapse),导致生成的图像无法保留用户的偏好风格或正确反映指定的语义信息。为此,论文提出了一种名为DRC的新框架,其关键在于通过解缠表示组合(Disentangled Representation Composition)增强大规模多模态模型。DRC通过双塔解缠器显式分离历史图像中的风格特征和参考图像中的语义特征,并形成特定于用户的潜在指令来引导生成过程。该方案包含两个核心学习阶段:1)解缠学习,采用基于重建驱动范式的难度感知重要性采样优化方法;2)个性化建模,利用语义保持增强技术有效调整解缠表示以实现鲁棒的个性化生成。实验结果表明,DRC在缓解指导崩溃问题的同时展现出竞争力,强调了解缠表示学习对于可控且有效的个性化图像生成的重要性。

链接: https://arxiv.org/abs/2504.17349
作者: Yiyan Xu,Wuqiang Zheng,Wenjie Wang,Fengbin Zhu,Xinting Hu,Yang Zhang,Fuli Feng,Tat-Seng Chua
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalized image generation has emerged as a promising direction in multimodal content creation. It aims to synthesize images tailored to individual style preferences (e.g., color schemes, character appearances, layout) and semantic intentions (e.g., emotion, action, scene contexts) by leveraging user-interacted history images and multimodal instructions. Despite notable progress, existing methods – whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) – struggle to accurately capture and fuse user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics. To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. DRC explicitly extracts user style preferences and semantic intentions from history images and the reference image, respectively, to form user-specific latent instructions that guide image generation within LMMs. Specifically, it involves two critical learning stages: 1) Disentanglement learning, which employs a dual-tower disentangler to explicitly separate style and semantic features, optimized via a reconstruction-driven paradigm with difficulty-aware importance sampling; and 2) Personalized modeling, which applies semantic-preserving augmentations to effectively adapt the disentangled representations for robust personalized generation. Extensive experiments on two benchmarks demonstrate that DRC shows competitive performance while effectively mitigating the guidance collapse issue, underscoring the importance of disentangled representation learning for controllable and effective personalized image generation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2504.17349 [cs.CV] (or arXiv:2504.17349v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.17349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-45] meChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

【速读】:该论文旨在解决在线视频平台,特别是直播服务中实时视频理解系统的迫切需求。现有VideoLLMs在处理完整视频方面表现出色,但在流媒体场景下因无法高效处理密集冗余帧而面临显著限制。为应对这一挑战,论文提出了TimeChat-Online,这是一种创新的在线VideoLLM,其核心是Differential Token Drop (DTD)模块,该模块通过过滤静态冗余内容并保留有意义的时间变化来解决流媒体视频中的视觉冗余问题。DTD的设计灵感来源于人类视觉感知的Change Blindness现象。实验表明,DTD在将视频令牌减少82.8%的同时,在StreamingBench上保持了98%的性能,证明超过80%的流媒体视频内容本质上是冗余的且无需语言指导。此外,TimeChat-Online-139K作为一个包含多样化交互模式的全面流媒体视频数据集,进一步支持了实时交互能力。TimeChat-Online的独特之处在于其Proactive Response能力,这得益于DTD对视频场景转换的持续监控。论文的广泛评估显示,TimeChat-Online在流媒体基准(StreamingBench和OvOBench)上表现优异,并在长视频任务(如Video-MME和MLVU)中保持竞争力。

链接: https://arxiv.org/abs/2504.17343
作者: Linli Yao,Yicheng Li,Yuancheng Wei,Lei Li,Shuhuai Ren,Yuanxin Liu,Kun Ouyang,Lean Wang,Shicheng Li,Sida Li,Lingpeng Kong,Qi Liu,Yuanxing Zhang,Xu Sun
机构: Peking University (北京大学); South China University of Technology (华南理工大学); The University of Hong Kong (香港大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid growth of online video platforms, particularly live streaming services, has created an urgent need for real-time video understanding systems. These systems must process continuous video streams and respond to user queries instantaneously, presenting unique challenges for current Video Large Language Models (VideoLLMs). While existing VideoLLMs excel at processing complete videos, they face significant limitations in streaming scenarios due to their inability to handle dense, redundant frames efficiently. We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. At its core lies our innovative Differential Token Drop (DTD) module, which addresses the fundamental challenge of visual redundancy in streaming videos. Drawing inspiration from human visual perception’s Change Blindness phenomenon, DTD preserves meaningful temporal changes while filtering out static, redundant content between frames. Remarkably, our experiments demonstrate that DTD achieves an 82.8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance. To enable seamless real-time interaction, we present TimeChat-Online-139K, a comprehensive streaming video dataset featuring diverse interaction patterns including backward-tracing, current-perception, and future-responding scenarios. TimeChat-Online’s unique Proactive Response capability, naturally achieved through continuous monitoring of video scene transitions via DTD, sets it apart from conventional approaches. Our extensive evaluation demonstrates TimeChat-Online’s superior performance on streaming benchmarks (StreamingBench and OvOBench) and maintaining competitive results on long-form video tasks such as Video-MME and MLVU.
zh

[CV-46] DIMT25@ICDAR2025: HW-TSCs End-to-End Document Image Machine Translation System Leverag ing Large Vision-Language Model

【速读】:该论文旨在解决“端到端复杂版式文档图像机器翻译”问题,具体是开发一种能够处理包含复杂布局信息的文档图像翻译的系统。论文的关键解决方案在于提出了一种结合多任务学习(Multi-task Learning)与感知链式思维(Perceptual Chain-of-Thought)的训练框架,并基于先进的开源视觉-语言大模型(LVLM),构建了一个统一的端到端文档翻译系统。此外,在推理阶段,通过最小贝叶斯解码(Minimum Bayesian Decoding)和后处理策略进一步优化翻译性能,同时实现了基于OCR和无OCR两种文档图像翻译任务的有效融合。

链接: https://arxiv.org/abs/2504.17315
作者: Zhanglin Wu,Tengfei Song,Ning Xie,Weidong Zhang,Pengfei Li,Shuang Wu,Chong Li,Junhao Zhu,Hao Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figures, 2 tables

点击查看摘要

Abstract:This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the “End-to-End Document Image Machine Translation for Complex Layouts” competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system’s translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
zh

[CV-47] Class-Conditional Distribution Balancing for Group Robust Classification

【速读】:该论文旨在解决由虚假相关性(spurious correlations)引起的模型正确预测却基于错误理由的问题,这是实现鲁棒实际泛化(robust real-world generalization)的关键挑战。现有研究将此问题归因于组别不平衡(group imbalance),并通过最大化组平衡或最差组准确率来应对,但这些方法高度依赖昂贵的偏差注释(bias annotations)。另一种折中方案是利用大规模预训练的基础模型预测偏差信息,但这需要大量数据,在资源受限的稀有领域变得不切实际。

论文的关键解决方案在于重新定义虚假相关性为类别条件分布(class-conditional distributions)的不平衡或不匹配,并提出了一种简单而有效的鲁棒学习方法,无需依赖偏差注释或预测。该方法通过减少虚假因素与标签信息之间的互信息(mutual information),采用样本重加权策略实现类别条件分布的平衡,从而自动突出少数群体和类别,有效消除虚假相关性并生成去偏的数据分布用于分类任务。实验结果表明,该方法在多个任务中达到了最先进的性能水平,且表现优于依赖偏差监督的方法。

链接: https://arxiv.org/abs/2504.17314
作者: Miaoyun Zhao,Qiang Zhang,Chenrong Li
机构: Key Laboratory of Social Computing and Cognitive Intelligence (Dalian University of Technology), Ministry of Education (大连理工大学社会计算与认知智能重点实验室,教育部); School of Computer Science and Technology, Dalian University of Technology (大连理工大学计算机科学与技术学院), Dalian 116024, China
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spurious correlations that lead models to correct predictions for the wrong reasons pose a critical challenge for robust real-world generalization. Existing research attributes this issue to group imbalance and addresses it by maximizing group-balanced or worst-group accuracy, which heavily relies on expensive bias annotations. A compromise approach involves predicting bias information using extensively pretrained foundation models, which requires large-scale data and becomes impractical for resource-limited rare domains. To address these challenges, we offer a novel perspective by reframing the spurious correlations as imbalances or mismatches in class-conditional distributions, and propose a simple yet effective robust learning method that eliminates the need for both bias annotations and predictions. With the goal of reducing the mutual information between spurious factors and label information, our method leverages a sample reweighting strategy to achieve class-conditional distribution balancing, which automatically highlights minority groups and classes, effectively dismantling spurious correlations and producing a debiased data distribution for classification. Extensive experiments and analysis demonstrate that our approach consistently delivers state-of-the-art performance, rivaling methods that rely on bias supervision.
zh

[CV-48] Advanced Segmentation of Diabetic Retinopathy Lesions Using DeepLabv3 CCS

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)病变(微动脉瘤、出血、硬性渗出和软性渗出)分割精度不足的问题。为应对数据集限制和标注复杂性的挑战,论文提出了一种针对每种病变类型的二值化分割方法,并在后处理阶段将各模型输出融合为单一图像以优化分析。解决方案的关键在于结合特定的预处理步骤(如LAB颜色空间L通道的裁剪与对比度受限自适应直方图均衡化)、针对性的数据增强技术以及基于DeepLabv3+模型的创新分割策略,最终实现了99%的分割精度,显著提升了病变分割的准确性与鲁棒性。

链接: https://arxiv.org/abs/2504.17306
作者: Meher Boulaabi,Takwa Ben Aïcha Gader,Afef Kacem Echi,Sameh Mbarek
机构: University of Monastir (蒙斯特大学), FSM; University of Tunis (突尼斯大学), ENSIT-LaTICE, Tunisia; Faculty of Medicine of Monastir (蒙斯特医学院), Tunisia; Department of Ophthalmology, Taher Sfar Mahdia Hospital (塔赫尔·斯法尔医院), Tunisia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work was accepted at the ACS/IEEE International Conference on Computer Systems and Applications (AICCSA) 2024

点击查看摘要

Abstract:To improve the segmentation of diabetic retinopathy lesions (microaneurysms, hemorrhages, exudates, and soft exudates), we implemented a binary segmentation method specific to each type of lesion. As post-segmentation, we combined the individual model outputs into a single image to better analyze the lesion types. This approach facilitated parameter optimization and improved accuracy, effectively overcoming challenges related to dataset limitations and annotation complexity. Specific preprocessing steps included cropping and applying contrast-limited adaptive histogram equalization to the L channel of the LAB image. Additionally, we employed targeted data augmentation techniques to further refine the model’s efficacy. Our methodology utilized the DeepLabv3+ model, achieving a segmentation accuracy of 99%. These findings highlight the efficacy of innovative strategies in advancing medical image analysis, particularly in the precise segmentation of diabetic retinopathy lesions. The IDRID dataset was utilized to validate and demonstrate the robustness of our approach.
zh

[CV-49] EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

【速读】:本文旨在解决轻量级关键点检测与描述在边缘计算应用中的效率与准确性平衡问题,同时应对高维特征描述符带来的分布式通信挑战。为实现这一目标,论文的关键创新在于提出EdgePoint2系列网络,其架构优化了效率而不牺牲精度,并通过结合正交Procrustes损失和相似性损失训练紧凑型描述符,提供了一种通用的超球嵌入蒸馏方法。此外,提供了14种子模型以满足多样化需求。实验表明,EdgePoint2在多种复杂场景下实现了最先进的准确性和效率,同时使用低维描述符(32/48/64维)。

链接: https://arxiv.org/abs/2504.17280
作者: Haodi Yao,Fenghua He,Ning Hao,Chen Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.
zh

[CV-50] owards Generalized and Training-Free Text-Guided Semantic Manipulation

【速读】:该论文试图解决文本引导的语义操作(Text-guided Semantic Manipulation)中存在的效率低下、扩展性差以及泛化能力有限的问题。现有方法通常需要耗时的微调(低效)、难以实现多种语义操作(扩展性差),或缺乏跨模态任务的支持(泛化能力有限)。论文的关键发现是扩散模型中的噪声几何特性与语义变化之间存在强相关性,基于此提出了一个新的通用且无需训练的文本引导语义操作框架(\textit{GTF})。其关键解决方案在于通过控制噪声之间的几何关系来实现高保真度的语义编辑,而无需进行参数调整或优化,从而实现了对多种语义操作的支持,并能够无缝集成到不同模态的扩散模型方法中(即插即用,Plug-and-play)。

链接: https://arxiv.org/abs/2504.17269
作者: Yu Hong,Xiao Cai,Pengpeng Zeng,Shuai Zhang,Jingkuan Song,Lianli Gao,Heng Tao Shen
机构: UESTC(电子科技大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided semantic manipulation refers to semantically editing an image generated from a source prompt to match a target prompt, enabling the desired semantic changes (e.g., addition, removal, and style transfer) while preserving irrelevant contents. With the powerful generative capabilities of the diffusion model, the task has shown the potential to generate high-fidelity visual content. Nevertheless, existing methods either typically require time-consuming fine-tuning (inefficient), fail to accomplish multiple semantic manipulations (poorly extensible), and/or lack support for different modality tasks (limited generalizability). Upon further investigation, we find that the geometric properties of noises in the diffusion model are strongly correlated with the semantic changes. Motivated by this, we propose a novel \textitGTF for text-guided semantic manipulation, which has the following attractive capabilities: 1) \textbfGeneralized : our \textitGTF supports multiple semantic manipulations (e.g., addition, removal, and style transfer) and can be seamlessly integrated into all diffusion-based methods (i.e., Plug-and-play) across different modalities (i.e., modality-agnostic); and 2) \textbfTraining-free : \textitGTF produces high-fidelity results via simply controlling the geometric relationship between noises without tuning or optimization. Our extensive experiments demonstrate the efficacy of our approach, highlighting its potential to advance the state-of-the-art in semantics manipulation.
zh

[CV-51] Precision Neural Network Quantization via Learnable Adaptive Modules

【速读】:该论文旨在解决在量化感知训练(Quantization Aware Training, QAT)中,通过使量化参数可训练以提升性能的同时,导致推理阶段灵活性降低的问题,尤其是在处理具有显著不同分布的激活值时。此外,论文还关注传统基于2的幂次(Power of Two, POT)量化方法中存在的刚性分辨率问题。
论文的关键解决方案是提出了一种名为自适应步长量化(Adaptive Step Size Quantization, ASQ)的方法。首先,ASQ 方法通过一个经过训练的模块动态调整量化缩放因子,以适配不同的激活值分布;其次,引入基于平方根的2的幂次(Power Of Square root of Two, POST)非均匀量化方案,有效应对神经网络权重在多种位宽下的钟形分布,并通过查找表(Look-Up Table, LUT)方法保持计算效率。实验结果表明,ASQ 方法在性能上优于现有最先进的 QAT 方法,并且在某些情况下甚至与全精度模型相当,例如其4位量化的ResNet34模型在ImageNet上的准确率提升了1.2%。

链接: https://arxiv.org/abs/2504.17263
作者: Wenqiang Zhou,Zhendong Yu,Xinyu Liu,Jiaming Yang,Rong Xiao,Tao Wang,Chenwei Tang,Jiancheng Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:Quantization Aware Training (QAT) is a neural network quantization technique that compresses model size and improves operational efficiency while effectively maintaining model performance. The paradigm of QAT is to introduce fake quantization operators during the training process, allowing the model to autonomously compensate for information loss caused by quantization. Making quantization parameters trainable can significantly improve the performance of QAT, but at the cost of compromising the flexibility during inference, especially when dealing with activation values with substantially different distributions. In this paper, we propose an effective learnable adaptive neural network quantization method, called Adaptive Step Size Quantization (ASQ), to resolve this conflict. Specifically, the proposed ASQ method first dynamically adjusts quantization scaling factors through a trained module capable of accommodating different activations. Then, to address the rigid resolution issue inherent in Power of Two (POT) quantization, we propose an efficient non-uniform quantization scheme. We utilize the Power Of Square root of Two (POST) as the basis for exponential quantization, effectively handling the bell-shaped distribution of neural network weights across various bit-widths while maintaining computational efficiency through a Look-Up Table method (LUT). Extensive experimental results demonstrate that the proposed ASQ method is superior to the state-of-the-art QAT approaches. Notably that the ASQ is even competitive compared to full precision baselines, with its 4-bit quantized ResNet34 model improving accuracy by 1.2% on ImageNet.
zh

[CV-52] Group Downsampling with Equivariant Anti-aliasing

【速读】:该论文旨在研究均匀下采样层在群等变架构(如G-CNNs)中的泛化能力,特别是针对一般有限群信号(特征图)的抗混叠下采样方法。论文的关键在于:(a) 针对给定的有限群和下采样率,提出了一种选择合适子群的算法;(b) 在给定群及其子群的情况下,探讨带限性的概念,并提出如何实现抗混叠处理。该方法基于经典采样理论扩展了传统下采样的概念,在信号为循环群(即周期性)时,恢复了理想低通滤波器后接子采样操作的标准下采样过程。实验表明,将所提出的下采样操作集成到G-等变网络中,可以提高图像分类任务的精度,更好地保持等变性并减小模型规模。

链接: https://arxiv.org/abs/2504.17258
作者: Md Ashiqur Rahman,Raymond A. Yeh
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Group Theory (math.GR)
备注:

点击查看摘要

Abstract:Downsampling layers are crucial building blocks in CNN architectures, which help to increase the receptive field for learning high-level features and reduce the amount of memory/computation in the model. In this work, we study the generalization of the uniform downsampling layer for group equivariant architectures, e.g., G-CNNs. That is, we aim to downsample signals (feature maps) on general finite groups with anti-aliasing. This involves the following: (a) Given a finite group and a downsampling rate, we present an algorithm to form a suitable choice of subgroup. (b) Given a group and a subgroup, we study the notion of bandlimited-ness and propose how to perform anti-aliasing. Notably, our method generalizes the notion of downsampling based on classical sampling theory. When the signal is on a cyclic group, i.e., periodic, our method recovers the standard downsampling of an ideal low-pass filter followed by a subsampling operation. Finally, we conducted experiments on image classification tasks demonstrating that the proposed downsampling operation improves accuracy, better preserves equivariance, and reduces model size when incorporated into G-equivariant networks
zh

[CV-53] DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks

【速读】:该论文试图解决如何利用预训练的扩散模型(Pretrained Diffusion Models)执行判别任务的问题。具体而言,研究通过“反转”一个预训练的布局到图像扩散模型(Layout-to-Image Diffusion Model),将预训练冻结的生成式扩散模型的判别能力从分类任务扩展到更复杂的对象检测任务。解决方案的关键在于提出了一种基于梯度的离散优化方法以替代繁重的预测枚举过程,并设计了一种先验分布模型以更准确地应用贝叶斯规则。实验结果表明,该方法在COCO数据集上的性能与基本的对象检测基线相当,同时能够显著加速基于扩散模型的分类方法而不牺牲准确性。

链接: https://arxiv.org/abs/2504.17253
作者: Yinqi Li,Hong Chang,Ruibing Hou,Shiguang Shan,Xilin Chen
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所智能信息处理重点实验室), Beijing, 100190, China; University of Chinese Academy of Sciences (中国科学院大学), Beijing 100049, China
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Diffusion models have shown remarkable progress in various generative tasks such as image and video generation. This paper studies the problem of leveraging pretrained diffusion models for performing discriminative tasks. Specifically, we extend the discriminative capability of pretrained frozen generative diffusion models from the classification task to the more complex object detection task, by “inverting” a pretrained layout-to-image diffusion model. To this end, a gradient-based discrete optimization approach for replacing the heavy prediction enumeration process, and a prior distribution model for making more accurate use of the Bayes’ rule, are proposed respectively. Empirical results show that this method is on par with basic discriminative object detection baselines on COCO dataset. In addition, our method can greatly speed up the previous diffusion-based method for classification without sacrificing accuracy. Code and models are available at this https URL .
zh

[CV-54] Scene Perceived Image Perceptual Score (SPIPS): combining global and local perception for image quality assessment

【速读】:该论文旨在解决现代图像质量评估(IQA)方法在应对深度神经网络(DNN)驱动的图像生成、增强和修复技术时面临的挑战,即传统IQA技术主要依赖于空间特征(如信噪比、局部结构失真和纹理不一致),难以准确反映人类视觉感知,并且在处理由DNN生成或优化的图像时表现不佳。论文的关键解决方案在于提出了一种结合深度学习与人类感知的新颖IQA方法:通过将深度特征解耦为高层语义信息和低层知觉细节,并分别处理这两条流,再与传统的IQA指标相结合,构建了一个更全面的评估框架。这种混合设计能够同时评价全局上下文和图像细节,更好地模拟人类视觉过程。最终,利用多层感知机(MLP)将整合后的特征映射为简洁的质量评分,从而实现与人类感知判断更高的一致性。

链接: https://arxiv.org/abs/2504.17234
作者: Zhiqiang Lao,Heather Yu
机构: Futurewei Technologies Inc (未来wei技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of artificial intelligence and widespread use of smartphones have resulted in an exponential growth of image data, both real (camera-captured) and virtual (AI-generated). This surge underscores the critical need for robust image quality assessment (IQA) methods that accurately reflect human visual perception. Traditional IQA techniques primarily rely on spatial features - such as signal-to-noise ratio, local structural distortions, and texture inconsistencies - to identify artifacts. While effective for unprocessed or conventionally altered images, these methods fall short in the context of modern image post-processing powered by deep neural networks (DNNs). The rise of DNN-based models for image generation, enhancement, and restoration has significantly improved visual quality, yet made accurate assessment increasingly complex. To address this, we propose a novel IQA approach that bridges the gap between deep learning methods and human perception. Our model disentangles deep features into high-level semantic information and low-level perceptual details, treating each stream separately. These features are then combined with conventional IQA metrics to provide a more comprehensive evaluation framework. This hybrid design enables the model to assess both global context and intricate image details, better reflecting the human visual process, which first interprets overall structure before attending to fine-grained elements. The final stage employs a multilayer perceptron (MLP) to map the integrated features into a concise quality score. Experimental results demonstrate that our method achieves improved consistency with human perceptual judgments compared to existing IQA models.
zh

[CV-55] Range Image-Based Implicit Neural Compression for LiDAR Point Clouds

【速读】:本文旨在解决轻量级格式(如2D范围图像,Range Images, RIs)表示的3D LiDAR点云高效压缩的问题。传统图像压缩技术虽可应用于RIs,但受限于其浮点像素值精度及与自然图像不同的像素分布特性,实际性能有限。为克服这些限制,论文提出了一种基于隐式神经表示(Implicit Neural Representation, INR)的新型RI压缩方法。该方法的关键在于将RIs分解为深度图和掩膜图,并分别通过基于块的INR架构和基于像素的INR架构进行压缩,同时结合模型剪枝和量化技术以进一步提升效率。实验结果表明,该方法在低比特率下实现了优于现有图像、点云、RI及INR基方法的3D重建与检测质量,同时保持较低的解码延迟。

链接: https://arxiv.org/abs/2504.17229
作者: Akihiro Kuwabara,Sorachi Kato,Takuya Fujihashi,Toshiaki Koike-Akino,Takashi Watanabe
机构: Graduate School of Information Science and Technology, Osaka University (大阪大学信息科学与技术研究生院); Mitsubishi Electric Research Laboratories (MERL) (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a novel scheme to efficiently compress Light Detection and Ranging~(LiDAR) point clouds, enabling high-precision 3D scene archives, and such archives pave the way for a detailed understanding of the corresponding 3D scenes. We focus on 2D range images~(RIs) as a lightweight format for representing 3D LiDAR observations. Although conventional image compression techniques can be adapted to improve compression efficiency for RIs, their practical performance is expected to be limited due to differences in bit precision and the distinct pixel value distribution characteristics between natural images and RIs. We propose a novel implicit neural representation~(INR)–based RI compression method that effectively handles floating-point valued pixels. The proposed method divides RIs into depth and mask images and compresses them using patch-wise and pixel-wise INR architectures with model pruning and quantization, respectively. Experiments on the KITTI dataset show that the proposed method outperforms existing image, point cloud, RI, and INR-based compression methods in terms of 3D reconstruction and detection quality at low bitrates and decoding latency.
zh

[CV-56] Visual and textual prompts for enhancing emotion recognition in video

【速读】:该论文致力于解决视频情感识别领域中视觉大语言模型(Vision Large Language Models, VLLMs)因空间和上下文意识不足导致的应用局限性问题。传统方法侧重于孤立的面部特征,而忽视了身体语言、环境上下文及社会互动等关键非言语线索,从而在真实场景中表现出较低的鲁棒性。为弥补这一差距,论文提出了一种名为Set-of-Vision-Text Prompting (SoVTP) 的新框架,其关键是通过整合空间标注(如边界框、面部标志)、生理信号(面部动作单元)以及上下文线索(身体姿势、场景动态、他人情绪)到统一的提示策略中,以增强零样本情感识别能力。这种方法不仅保留了场景的整体信息,还实现了对面部肌肉运动和人际动态的细粒度分析。实验结果表明,SoVTP显著优于现有视觉提示方法,验证了其在提升VLLMs视频情感识别能力方面的有效性。

链接: https://arxiv.org/abs/2504.17224
作者: Zhifeng Wang,Qixuan Zhang,Peter Zhang,Wenjia Niu,Kaihao Zhang,Ramesh Sankaranarayana,Sabrina Caldwell,Tom Gedeon
机构: School of Computing, Australian National University (ANU) Canberra, ACT 2601, Australia (澳大利亚国立大学计算学院); Quriosity Pty Ltd (Quriosity Pty Ltd); Human-Centric Advancements Chair in AI, Curtin University and Australian National University (Curtin University 和 澳大利亚国立大学人本智能卓越研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others’ emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs’ video emotion recognition capabilities.
zh

[CV-57] owards Generalizable Deepfake Detection with Spatial-Frequency Collaborative Learning and Hierarchical Cross-Modal Fusion

【速读】:该论文旨在解决深度伪造(Deepfake)检测领域中,因深度生成模型快速演进导致的传统检测方法在应对未见过的伪造内容时性能显著下降的问题。现有方法主要依赖于空间域分析,而频域操作局限于特征级增强,未能充分挖掘频域固有的伪造痕迹及其与空间域的交互关系。为此,论文提出了一种新颖的通用深度伪造检测框架,其关键是通过多尺度空间-频域分析来实现更全面的伪造检测能力。该框架包含三个关键组件:(1) 结合分块离散余弦变换与多尺度级联卷积的局部频谱特征提取管道,用于捕捉细微的频谱伪造线索;(2) 利用尺度不变差积累机制的全局频谱特征提取管道,以识别整体伪造分布模式;(3) 多阶段跨模态融合机制,结合浅层注意力增强与深层动态调制,以建模空间-频域交互关系。广泛的基准测试表明,该方法在准确性和泛化能力方面均超越了当前最先进的深度伪造检测方法。

链接: https://arxiv.org/abs/2504.17223
作者: Mengyu Qiao,Runze Tian,Yang Wang
机构: North China University of Technology (华北理工大学); Ultramain Systems, Inc. (Ultramain 系统公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of deep generative models poses a critical challenge to deepfake detection, as detectors trained on forgery-specific artifacts often suffer significant performance degradation when encountering unseen forgeries. While existing methods predominantly rely on spatial domain analysis, frequency domain operations are primarily limited to feature-level augmentation, leaving frequency-native artifacts and spatial-frequency interactions insufficiently exploited. To address this limitation, we propose a novel detection framework that integrates multi-scale spatial-frequency analysis for universal deepfake detection. Our framework comprises three key components: (1) a local spectral feature extraction pipeline that combines block-wise discrete cosine transform with cascaded multi-scale convolutions to capture subtle spectral artifacts; (2) a global spectral feature extraction pipeline utilizing scale-invariant differential accumulation to identify holistic forgery distribution patterns; and (3) a multi-stage cross-modal fusion mechanism that incorporates shallow-layer attention enhancement and deep-layer dynamic modulation to model spatial-frequency interactions. Extensive evaluations on widely adopted benchmarks demonstrate that our method outperforms state-of-the-art deepfake detection methods in both accuracy and generalizability.
zh

[CV-58] MCAF: Efficient Agent -based Video Understanding Framework through Multimodal Coarse-to-Fine Attention Focusing

【速读】:该论文旨在解决视频理解,尤其是长视频理解中的高挑战性问题。与文本或图像信息相比,视频通常包含冗余且更丰富的信息,这要求大模型在全局层面进行策略性的注意力分配以实现准确理解。为应对这一挑战,论文提出了一种名为MCAF(基于代理、无需训练的多模态粗到细注意力聚焦框架)。其关键创新在于能够感知并优先处理与理解任务高度相关的视频片段。具体而言,MCAF首先通过多模态信息分层关注高度相关的帧,增强获取上下文信息与查询之间的相关性;其次,采用扩张时间扩展机制减少从集中帧中提取信息时遗漏关键细节的风险。此外,框架还结合了利用模型响应置信度作为反馈的自省机制。通过迭代应用这两种创新的聚焦策略,MCAF能够自适应调整注意力,捕捉与查询高度关联的上下文信息,从而提升响应准确性,并在多个数据集上表现出色。

链接: https://arxiv.org/abs/2504.17213
作者: Shiwen Cao,Zhaoxing Zhang,Junming Jiao,Juyi Qiao,Guowen Song,Rong Shen
机构: Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Even in the era of rapid advances in large models, video understanding, particularly long videos, remains highly challenging. Compared with textual or image-based information, videos commonly contain more information with redundancy, requiring large models to strategically allocate attention at a global level for accurate comprehension. To address this, we propose MCAF, an agent-based, training-free framework perform video understanding through Multimodal Coarse-to-fine Attention Focusing. The key innovation lies in its ability to sense and prioritize segments of the video that are highly relevant to the understanding task. First, MCAF hierarchically concentrates on highly relevant frames through multimodal information, enhancing the correlation between the acquired contextual information and the query. Second, it employs a dilated temporal expansion mechanism to mitigate the risk of missing crucial details when extracting information from these concentrated frames. In addition, our framework incorporates a self-reflection mechanism utilizing the confidence level of the model’s responses as feedback. By iteratively applying these two creative focusing strategies, it adaptively adjusts attention to capture highly query-connected context and thus improves response accuracy. MCAF outperforms comparable state-of-the-art methods on average. On the EgoSchema dataset, it achieves a remarkable 5% performance gain over the leading approach. Meanwhile, on Next-QA and IntentQA datasets, it outperforms the current state-of-the-art standard by 0.2% and 0.3% respectively. On the Video-MME dataset, which features videos averaging nearly an hour in length, MCAF also outperforms other agent-based methods.
zh

[CV-59] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在视角感知推理(perspective-aware reasoning)方面的能力不足问题。尽管现有VLMs在空间推理方面取得了进展,但研究表明它们显著缺乏从不同视角理解环境的能力,并倾向于以自我中心(egocentric)方式进行解读。为弥合这一差距,论文提出了一种名为抽象视角变化(Abstract Perspective Change, APC)的框架,其关键是利用视觉基础模型(如目标检测、分割和方向估计)构建场景抽象表示,并支持视角转换。通过在合成和真实图像基准上的实验,该框架在视角感知推理任务中表现出显著改进,优于微调的空间推理模型及基于新视图合成的方法。

链接: https://arxiv.org/abs/2504.17207
作者: Phillip Y. Lee,Jihyeon Je,Chanho Park,Mikaela Angelina Uy,Leonidas Guibas,Minhyuk Sung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
zh

[CV-60] Well Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

【速读】:该论文旨在解决现有文本到视频(Text-to-Video, T2V)生成模型在处理长且复杂的提示(包含多个对象或顺序事件)时,难以生成语义一致性和时间连贯性视频的问题。此外,由于训练或微调这些模型所需的高计算成本,直接优化方法在实际应用中不可行。为克服这些限制,论文提出了一种名为(\projectname)的新颖零样本视频精炼管道,其关键在于利用神经符号反馈(neuro-symbolic feedback)来自动增强视频生成效果。具体而言,该方法通过分析视频的正式表示形式提取神经符号反馈,识别出语义不一致的事件、对象及其对应的帧,并以此指导对原始视频的针对性编辑,从而显著提升视频与提示之间的时序和逻辑一致性,实验表明其性能提升了近40%。

链接: https://arxiv.org/abs/2504.17180
作者: Minkyu Choi,S P Sharan,Harsh Goel,Sahil Shah,Sandeep Chinchali
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce (\projectname), a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that (\projectname) significantly enhances temporal and logical alignment across diverse prompts by almost 40% .
zh

[CV-61] AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models

【速读】:该论文旨在解决自动驾驶车辆(Autonomous Vehicles, AVs)在处理罕见失败模式(Rare Failure Modes, RFMs)时面临的“长尾挑战”(long-tail challenge),即现有模型因缺乏对少见场景的数据覆盖而难以检测和应对这些罕见情况。论文的关键创新在于结合先进的生成式 AI(Generative AI)与可解释 AI(Explainable AI)技术,通过创建多样化且针对性强的合成环境来揭示和模拟罕见失败模式,从而增强自动驾驶系统的鲁棒性和可靠性。具体而言,论文首先提取感兴趣对象(如汽车)的分割掩码并反转生成环境掩码,再结合精心设计的文本提示输入到定制的扩散模型中,利用对抗性噪声优化引导的稳定扩散补丁模型生成包含复杂场景的图像,以测试和暴露现有 AI 系统的漏洞,并最终生成自然语言描述,为开发者和政策制定者提供改进建议。

链接: https://arxiv.org/abs/2504.17179
作者: Mohammad Zarei,Melanie A Jutras,Eliana Evans,Mike Tan,Omid Aaramoon
机构: MITRE CORP. (MITRE公司); Cranium; Booz Allen Hamilton (博斯·艾伦·汉密尔顿公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 10 figures. Accepted to IEEE Conference on Artificial Intelligence (CAI), 2025

点击查看摘要

Abstract:Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately detect objects and interpret their surroundings. However, even when trained using millions of miles of real-world data, AVs are often unable to detect rare failure modes (RFMs). The problem of RFMs is commonly referred to as the “long-tail challenge”, due to the distribution of data including many instances that are very rarely seen. In this paper, we present a novel approach that utilizes advanced generative and explainable AI techniques to aid in understanding RFMs. Our methods can be used to enhance the robustness and reliability of AVs when combined with both downstream model training and testing. We extract segmentation masks for objects of interest (e.g., cars) and invert them to create environmental masks. These masks, combined with carefully crafted text prompts, are fed into a custom diffusion model. We leverage the Stable Diffusion inpainting model guided by adversarial noise optimization to generate images containing diverse environments designed to evade object detection models and expose vulnerabilities in AI systems. Finally, we produce natural language descriptions of the generated RFMs that can guide developers and policymakers to improve the safety and reliability of AV systems.
zh

[CV-62] A Genealogy of Multi-Sensor Foundation Models in Remote Sensing

【速读】:该论文试图解决在遥感领域中基于基础模型(foundation models)的表示学习所面临的挑战,包括现有方法的优势与局限性,以及如何进一步优化遥感特定的基础模型。论文的关键在于探讨不同方法的特性及其在计算机视觉领域的根源,以评估潜在的优点与风险,并提出未来发展方向。此外,论文关注如何提升学到的表示质量同时减少对大量计算资源的依赖,强调多传感器数据的应用,并探索如何更有效地利用海量未标注、季节性和多源遥感数据。解决方案的关键在于综合分析现有方法的优劣,结合多模态学习策略,推动遥感领域基础模型的发展。

链接: https://arxiv.org/abs/2504.17177
作者: Kevin Lane,Morteza Karimzadeh
机构: University of Colorado, Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, submitted to ACM SigSpatial, currently under peer review

点击查看摘要

Abstract:Foundation models have garnered increasing attention for representation learning in remote sensing, primarily adopting approaches that have demonstrated success in computer vision with minimal domain-specific modification. However, the development and application of foundation models in this field are still burgeoning, as there are a variety of competing approaches that each come with significant benefits and drawbacks. This paper examines these approaches along with their roots in the computer vision field in order to characterize potential advantages and pitfalls while outlining future directions to further improve remote sensing-specific foundation models. We discuss the quality of the learned representations and methods to alleviate the need for massive compute resources. We place emphasis on the multi-sensor aspect of Earth observations, and the extent to which existing approaches leverage multiple sensors in training foundation models in relation to multi-modal foundation models. Finally, we identify opportunities for further harnessing the vast amounts of unlabeled, seasonal, and multi-sensor remote sensing observations.
zh

[CV-63] PhysioSync: Temporal and Cross-Modal Contrastive Learning Inspired by Physiological Synchronization for EEG-Based Emotion Recognition

【速读】:该论文旨在解决基于脑电图(EEG)信号的情绪识别面临的挑战,包括信号噪声、个体间差异以及与行为线索相比的优势不足等问题。此外,尽管多模态方法引入了如皮肤电反应(GSR)等周边生理信号(Peripheral Physiological Signals, PPS)来补充EEG,但这些方法往往忽略了模态间的动态同步性和一致性语义。同时,PPS中情绪波动在不同时间分辨率下的时序特性尚未得到充分探索。

为了解决上述问题,论文提出了一种名为PhysioSync的新预训练框架,其关键在于结合时间对比学习和跨模态对比学习。具体而言,PhysioSync通过跨模态一致性对齐(Cross-Modal Consistency Alignment, CM-CA)建模EEG与互补PPS之间的动态关系,以实现模态间的情绪相关同步化;同时引入长短期时序对比学习(Long- and Short-Term Temporal Contrastive Learning, LS-TCL),捕捉模态内不同时间分辨率下的情绪同步性。预训练后,通过层次化特征融合与微调,进一步提升情绪识别性能。实验结果表明,PhysioSync在单模态和跨模态条件下均表现出色,尤其适用于以EEG为中心的情绪识别任务。

链接: https://arxiv.org/abs/2504.17163
作者: Kai Cui,Jia Li,Yu Liu,Xuesong Zhang,Zhenzhen Hu,Meng Wang
机构: School of Instrument Science and Opto-electronics Engineering, Hefei University of Technology (合肥工业大学仪器科学与光电工程学院), China; School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机科学与信息工程学院), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The source code will be publicly available at this https URL

点击查看摘要

Abstract:Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync’s advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.
zh

[CV-64] A Comprehensive Review on RNA Subcellular Localization Prediction

【速读】:该论文旨在解决通过传统实验方法(如原位杂交)确定RNA亚细胞定位耗时、资源密集且成本高的问题。论文的关键解决方案是利用基于人工智能(AI)和机器学习(ML)的计算方法,这些方法能够实现RNA亚细胞定位的大规模预测。具体而言,论文综述了针对不同类型RNA(包括长链非编码RNA、信使RNA等)的最新AI技术进展,涵盖了基于序列、基于图像以及混合型方法,并强调了这些方法在加速RNA研究、揭示分子机制及指导疾病治疗中的潜力。同时,论文也讨论了当前AI/ML方法面临的挑战,如数据稀缺性和缺乏标准化基准,并提出了可能的改进方向。

链接: https://arxiv.org/abs/2504.17162
作者: Cece Zhang,Xuehuan Zhu,Nick Peterson,Jieqiong Wang,Shibiao Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Genomics (q-bio.GN); Subcellular Processes (q-bio.SC)
备注:

点击查看摘要

Abstract:The subcellular localization of RNAs, including long non-coding RNAs (lncRNAs), messenger RNAs (mRNAs), microRNAs (miRNAs) and other smaller RNAs, plays a critical role in determining their biological functions. For instance, lncRNAs are predominantly associated with chromatin and act as regulators of gene transcription and chromatin structure, while mRNAs are distributed across the nucleus and cytoplasm, facilitating the transport of genetic information for protein synthesis. Understanding RNA localization sheds light on processes like gene expression regulation with spatial and temporal precision. However, traditional wet lab methods for determining RNA localization, such as in situ hybridization, are often time-consuming, resource-demanding, and costly. To overcome these challenges, computational methods leveraging artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternatives, enabling large-scale prediction of RNA subcellular localization. This paper provides a comprehensive review of the latest advancements in AI-based approaches for RNA subcellular localization prediction, covering various RNA types and focusing on sequence-based, image-based, and hybrid methodologies that combine both data types. We highlight the potential of these methods to accelerate RNA research, uncover molecular pathways, and guide targeted disease treatments. Furthermore, we critically discuss the challenges in AI/ML approaches for RNA subcellular localization, such as data scarcity and lack of benchmarks, and opportunities to address them. This review aims to serve as a valuable resource for researchers seeking to develop innovative solutions in the field of RNA subcellular localization and beyond.
zh

[CV-65] OUI Need to Talk About Weight Decay: A New Perspective on Overfitting Detection

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)训练过程中超参数调节困难以及过拟合与欠拟合难以实时监控的问题。论文的关键解决方案是提出了一种名为过拟合-欠拟合指标(Overfitting-Underfitting Indicator, OUI)的新工具,它能够通过指示模型在训练过程中的状态(过拟合或欠拟合),无需依赖验证数据即可有效指导权值衰减(Weight Decay, WD)等正则化超参数的选择。实验表明,保持OUI在特定区间内与模型泛化能力及验证分数的提升高度相关,并且其收敛速度显著快于传统的损失或准确率指标,从而实现早期精确的WD调优,确保模型在训练初期即能找到最优的超参数配置以最大化验证集表现。

链接: https://arxiv.org/abs/2504.17160
作者: Alberto Fernández-Hernández,Jose I. Mestre,Manuel F. Dolz,Jose Duato,Enrique S. Quintana-Ortí
机构: Universitat Politècnica de València (瓦伦西亚理工大学); Universitat Jaume I (哈乌米一世大学); Qsimov Quantum Computing S.L. (Qsimov量子计算有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:We introduce the Overfitting-Underfitting Indicator (OUI), a novel tool for monitoring the training dynamics of Deep Neural Networks (DNNs) and identifying optimal regularization hyperparameters. Specifically, we validate that OUI can effectively guide the selection of the Weight Decay (WD) hyperparameter by indicating whether a model is overfitting or underfitting during training without requiring validation data. Through experiments on DenseNet-BC-100 with CIFAR- 100, EfficientNet-B0 with TinyImageNet and ResNet-34 with ImageNet-1K, we show that maintaining OUI within a prescribed interval correlates strongly with improved generalization and validation scores. Notably, OUI converges significantly faster than traditional metrics such as loss or accuracy, enabling practitioners to identify optimal WD (hyperparameter) values within the early stages of training. By leveraging OUI as a reliable indicator, we can determine early in training whether the chosen WD value leads the model to underfit the training data, overfit, or strike a well-balanced trade-off that maximizes validation scores. This enables more precise WD tuning for optimal performance on the tested datasets and DNNs. All code for reproducing these experiments is available at this https URL.
zh

[CV-66] Latent Video Dataset Distillation

【速读】:该论文旨在解决视频数据集蒸馏中过度关注像素空间压缩而忽视潜在空间(latent space)技术的问题。现有方法在处理冗余丰富的视频数据集时,未能充分利用现代文本到图像或文本到视频模型广泛采用的潜在表征优势。论文的关键创新在于提出了一种新颖的基于潜在空间的视频数据集蒸馏方法,利用最先进的变分编码器(Variational Encoder)实现高效蒸馏,并结合一种考虑多样性的数据选择策略以确保所选样本的代表性和多样性。此外,还引入了一种无需额外训练的简单方法进一步压缩蒸馏后的潜在数据集。通过这些技术的综合应用,该研究实现了数据集蒸馏性能的新突破,在多个数据集上的表现均优于先前方法,例如在HMDB51数据集上提升了2.6%,MiniUCF数据集上提升了7.8%。

链接: https://arxiv.org/abs/2504.17132
作者: Ning Li,Antai Andy Liu,Jingran Zhang,Justin Cui
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Dataset distillation has demonstrated remarkable effectiveness in high-compression scenarios for image datasets. While video datasets inherently contain greater redundancy, existing video dataset distillation methods primarily focus on compression in the pixel space, overlooking advances in the latent space that have been widely adopted in modern text-to-image and text-to-video models. In this work, we bridge this gap by introducing a novel video dataset distillation approach that operates in the latent space using a state-of-the-art variational encoder. Furthermore, we employ a diversity-aware data selection strategy to select both representative and diverse samples. Additionally, we introduce a simple, training-free method to further compress the distilled latent dataset. By combining these techniques, our approach achieves a new state-of-the-art performance in dataset distillation, outperforming prior methods on all datasets, e.g. on HMDB51 IPC 1, we achieve a 2.6% performance increase; on MiniUCF IPC 5, we achieve a 7.8% performance increase.
zh

[CV-67] ransferring Spatial Filters via Tangent Space Alignment in Motor Imagery BCIs

【速读】:该论文试图解决运动想象脑机接口(Motor Imagery BCI)中受试者间差异导致的适应性问题。解决方案的关键在于通过在黎曼流形上对协方差矩阵进行对齐操作,随后基于新的共空间模式(Common Spatial Patterns, CSP)计算空间滤波器,从而提升跨受试者的迁移性能。此外,论文探索了多种多受试者信息整合方法,并在三个数据集上的实验表明,当训练数据受限时,该方法较标准CSP有更显著的性能提升。

链接: https://arxiv.org/abs/2504.17111
作者: Tekin Gunasar,Virginia de Sa
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:We propose a method to improve subject transfer in motor imagery BCIs by aligning covariance matrices on a Riemannian manifold, followed by computing a new common spatial patterns (CSP) based spatial filter. We explore various ways to integrate information from multiple subjects and show improved performance compared to standard CSP. Across three datasets, our method shows marginal improvements over standard CSP; however, when training data are limited, the improvements become more significant.
zh

[CV-68] Scene-Aware Location Modeling for Data Augmentation in Automotive Object Detection

【速读】:该论文旨在解决现有生成式数据增强方法在场景布局(scene layout)建模方面的不足,即忽视了对象在场景中实际位置的合理性。传统方法要么直接复用原始帧布局,要么随机放置生成对象,导致增强效果受限或缺乏真实性。论文的关键创新在于引入了一种场景感知的概率位置模型(scene-aware probabilistic location model),用于预测生成对象在现有场景中的合理位置,并结合生成模型对这些位置进行inpainting操作。这种方法显著提升了生成式数据增强的效果,在两项汽车目标检测任务中实现了比现有最佳方法高出2.8倍的平均精度均值(mAP)提升,并在实例分割任务中也表现出显著改进。

链接: https://arxiv.org/abs/2504.17076
作者: Jens Petersen,Davide Abati,Amirhossein Habibian,Auke Wiggers
机构: Qualcomm Technologies, Inc. (高通技术公司); Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene-aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to 2.8\times higher gains than the best competing approach ( +1.4 vs. +0.5 mAP boost). We also demonstrate significant improvements for instance segmentation.
zh

[CV-69] Distilling semantically aware orders for autoregressive image generation

【速读】:该论文试图解决传统自回归图像生成模型中因固定扫描顺序(raster-scan order)导致的因果关系不匹配问题,即生成过程中未能充分利用图像内容的自然因果关系。例如,在生成夕阳场景时,传统方法可能先生成云朵再生成太阳,而忽略了云朵的颜色应由太阳决定的因果逻辑。解决方案的关键在于首先训练一个可以以任意顺序生成图像块的模型,从而同时推断每个图像块的内容及其生成顺序(位置),然后利用这些提取出的顺序微调模型,以生成更高质量的图像。这种方法在无需额外标注且训练成本相似的情况下,显著提升了生成图像的质量。

链接: https://arxiv.org/abs/2504.17069
作者: Rishav Pramanik,Antoine Poupon,Juan A. Rodriguez,Masih Aminbeidokhti,David Vazquez,Christopher Pal,Zhaozheng Yin,Marco Pedersoli
机构: Stony Brook University (石溪大学); International Laboratory on Learning Systems (ILLS); Université Paris-Saclay (巴黎萨克雷大学), CentraleSupélec (中央理工学院), France; ServiceNow Research; Mila-Quebec AI Institute; École de technologie supérieure (魁北克高等技术学院), QC, Canada; Polytechnique Montréal; Canada CIFAR AI Chair
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.
zh

[CV-70] PPS-Ctrl: Controllable Sim-to-Real Translation for Colonoscopy Depth Estimation

【速读】:该论文旨在解决临床环境中获取真实深度标签困难的问题,同时克服现有合成数据集在实际场景中泛化能力不足的局限。论文提出了一种新颖的图像到图像翻译框架,能够在保留结构的同时从临床数据中生成逼真的纹理。解决方案的关键在于将Stable Diffusion与ControlNet相结合,并以Per-Pixel Shading (PPS) 图的地图提取的潜在表示作为条件,PPS能够捕捉表面光照效果,相比深度图提供更强的结构约束。实验结果表明,该方法生成的翻译结果更加逼真,并且在深度估计任务上优于基于GAN的MI-CycleGAN方法。

链接: https://arxiv.org/abs/2504.17067
作者: Xinqi Xiong,Andrea Dunn Beltran,Jun Myeong Choi,Marc Niethammer,Roni Sengupta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate depth estimation enhances endoscopy navigation and diagnostics, but obtaining ground-truth depth in clinical settings is challenging. Synthetic datasets are often used for training, yet the domain gap limits generalization to real data. We propose a novel image-to-image translation framework that preserves structure while generating realistic textures from clinical data. Our key innovation integrates Stable Diffusion with ControlNet, conditioned on a latent representation extracted from a Per-Pixel Shading (PPS) map. PPS captures surface lighting effects, providing a stronger structural constraint than depth maps. Experiments show our approach produces more realistic translations and improves depth estimation over GAN-based MI-CycleGAN. Our code is publicly accessible at this https URL.
zh

[CV-71] PBR: Extended PBR Materials in Image Synthesis CVPR

【速读】:该论文旨在解决传统基于物理的渲染(Physically Based Rendering, PBR)在处理复杂表面模型(尤其是高镜面反射和透明表面)时存在的挑战,同时克服学习方法缺乏物理一致性的局限。论文的关键在于扩展内在图像表示(Intrinsic Image Representation),将其与反射和透射属性相结合,提出了一种明确的内在合成框架(Explicit Intrinsic Compositing Framework)。这一方案能够实现可控且可解释的图像合成,尤其适用于透明材料(如玻璃和窗户)的生成,并通过扩展的PBR材料(Extended PBR, ePBR)提供了精确的材质编辑能力。

链接: https://arxiv.org/abs/2504.17062
作者: Yu Guo,Zhiqiang Lao,Xiyun Song,Yubin Zhou,Zongfang Lin,Heather Yu
机构: Futurewei Technologies (未来wei技术)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages without references, 7 figures, accepted in CVPRW 2025

点击查看摘要

Abstract:Realistic indoor or outdoor image synthesis is a core challenge in computer vision and graphics. The learning-based approach is easy to use but lacks physical consistency, while traditional Physically Based Rendering (PBR) offers high realism but is computationally expensive. Intrinsic image representation offers a well-balanced trade-off, decomposing images into fundamental components (intrinsic channels) such as geometry, materials, and illumination for controllable synthesis. However, existing PBR materials struggle with complex surface models, particularly high-specular and transparent surfaces. In this work, we extend intrinsic image representations to incorporate both reflection and transmission properties, enabling the synthesis of transparent materials such as glass and windows. We propose an explicit intrinsic compositing framework that provides deterministic, interpretable image synthesis. With the Extended PBR (ePBR) Materials, we can effectively edit the materials with precise controls.
zh

[CV-72] DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在处理计算负担时效率低下且固定长度输出导致的资源浪费问题,同时保持高任务性能。论文提出的解决方案关键在于两个组件:首先,动态令牌合并(Dynamic Token Merging, DToMe)通过基于图像复杂度合并相似的视觉标记嵌入,减少了视觉标记的数量;其次,虚拟令牌解压缩(Virtual Token Unmerging, VTU)通过高效重建完整序列的注意力动态,模拟大型语言模型(Large Language Models, LLMs)的预期标记序列,从而在不进行额外微调的情况下保持下游性能。与现有方法不同,DyMU能够根据图像内容动态调整标记压缩,并且完全无需训练,使其易于应用于大多数最先进的VLM架构中。

链接: https://arxiv.org/abs/2504.17040
作者: Zhenhailong Wang,Senthil Purushwalkam,Caiming Xiong,Silvio Savarese,Heng Ji,Ran Xu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校); Salesforce Research (Salesforce 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present DyMU, an efficient, training-free framework that dynamically reduces the computational burden of vision-language models (VLMs) while maintaining high task performance. Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity, addressing the inherent inefficiency of fixed-length outputs in vision transformers. Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence, thus preserving the downstream performance without additional fine-tuning. Unlike previous approaches, our method dynamically adapts token compression to the content of the image and operates completely training-free, making it readily applicable to most state-of-the-art VLM architectures. Extensive experiments on image and video understanding tasks demonstrate that DyMU can reduce the average visual token count by 32%-85% while achieving comparable performance to full-length models across diverse VLM architectures, including the recently popularized AnyRes-based visual encoders. Furthermore, through qualitative analyses, we demonstrate that DToMe effectively adapts token reduction based on image complexity and, unlike existing systems, provides users more control over computational costs. Project page: this https URL.
zh

[CV-73] Dense Air Pollution Estimation from Sparse in-situ Measurements and Satellite Data

【速读】:该论文旨在解决基于卫星估算环境二氧化氮(NO₂)浓度这一关键的公共健康与环境政策挑战。传统方法通过在特定点位建模卫星与现场测量数据之间的关系,在全球范围内提供了空气质量估算能力,但存在计算强度高、难以扩展的问题。论文的关键创新在于提出了一种密集估算技术,通过均匀随机偏移采样策略将地面真实数据的像素位置分散到更大的区域网格中,从而在推理阶段能够一次性生成整个网格的估算结果,显著降低了大范围估算所需的计算资源。这种方法不仅提高了计算效率,还以9.45%的优势超越现有逐点估算方法,实现了4.98 μg/m³的平均绝对误差(MAE),证明了其在高精度与高效性方面的优越性,为全球环境评估提供了一种可行的解决方案。

链接: https://arxiv.org/abs/2504.17039
作者: Ruben Gonzalez Avilés,Linus Scheibenreif,Damian Borth
机构: University of St. Gallen (圣加仑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the critical environmental challenge of estimating ambient Nitrogen Dioxide (NO _2 ) concentrations, a key issue in public health and environmental policy. Existing methods for satellite-based air pollution estimation model the relationship between satellite and in-situ measurements at select point locations. While these approaches have advanced our ability to provide air quality estimations on a global scale, they come with inherent limitations. The most notable limitation is the computational intensity required for generating comprehensive estimates over extensive areas. Motivated by these limitations, this study introduces a novel dense estimation technique. Our approach seeks to balance the accuracy of high-resolution estimates with the practicality of computational constraints, thereby enabling efficient and scalable global environmental assessment. By utilizing a uniformly random offset sampling strategy, our method disperses the ground truth data pixel location evenly across a larger patch. At inference, the dense estimation method can then generate a grid of estimates in a single step, significantly reducing the computational resources required to provide estimates for larger areas. Notably, our approach also surpasses the results of existing point-wise methods by a significant margin of 9.45% , achieving a Mean Absolute Error (MAE) of 4.98\ \mu\textg/\textm^3 . This demonstrates both high accuracy and computational efficiency, highlighting the applicability of our method for global environmental assessment. Furthermore, we showcase the method’s adaptability and robustness by applying it to diverse geographic regions. Our method offers a viable solution to the computational challenges of large-scale environmental monitoring.
zh

[CV-74] Seeing The Words: Evaluating AI-generated Biblical Art

【速读】:该论文试图解决的问题是如何系统性地评估基于圣经文本生成图像的准确性及其在宗教和美学视角下的表现。论文的关键解决方案在于构建了一个包含超过7000张图像的大规模数据集,并使用多种基于神经网络的工具从多个方面对这些生成的图像进行评估,同时结合宗教与美学角度提供综合分析,从而为生成式 AI (Generative AI) 在此类任务中的性能表现提供全面的评估与反思。

链接: https://arxiv.org/abs/2504.16974
作者: Hidde Makimei,Shuai Wang,Willem van Peursen
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The past years witnessed a significant amount of Artificial Intelligence (AI) tools that can generate images from texts. This triggers the discussion of whether AI can generate accurate images using text from the Bible with respect to the corresponding biblical contexts and backgrounds. Despite some existing attempts at a small scale, little work has been done to systematically evaluate these generated images. In this work, we provide a large dataset of over 7K images using biblical text as prompts. These images were evaluated with multiple neural network-based tools on various aspects. We provide an assessment of accuracy and some analysis from the perspective of religion and aesthetics. Finally, we discuss the use of the generated images and reflect on the performance of the AI generators.
zh

[CV-75] Unsupervised Time-Series Signal Analysis with Autoencoders and Vision Transformers: A Review of Architectures and Applications

【速读】:该论文旨在解决无监督信号分析领域中,如何有效利用大量未标注时序数据的问题。解决方案的关键在于结合自编码器(Autoencoders)和视觉变换器(Vision Transformers)的架构创新,通过特征提取、异常检测和分类等能力,实现对多种信号类型(如心电图、雷达波形和物联网传感器数据)的高效分析。论文强调了混合架构与自监督学习的优势,并指出了解释性、可扩展性和领域泛化能力等方面的挑战。通过这些方法论的创新与实际应用的结合,论文为开发鲁棒且适应性强的信号智能模型提供了方向。

链接: https://arxiv.org/abs/2504.16972
作者: Hossein Ahmadi,Sajjad Emdadi Mahdimahalleh,Arman Farahat,Banafsheh Saffari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The rapid growth of unlabeled time-series data in domains such as wireless communications, radar, biomedical engineering, and the Internet of Things (IoT) has driven advancements in unsupervised learning. This review synthesizes recent progress in applying autoencoders and vision transformers for unsupervised signal analysis, focusing on their architectures, applications, and emerging trends. We explore how these models enable feature extraction, anomaly detection, and classification across diverse signal types, including electrocardiograms, radar waveforms, and IoT sensor data. The review highlights the strengths of hybrid architectures and self-supervised learning, while identifying challenges in interpretability, scalability, and domain generalization. By bridging methodological innovations and practical applications, this work offers a roadmap for developing robust, adaptive models for signal intelligence.
zh

[CV-76] S2Vec: Self-Supervised Geospatial Embeddings

【速读】:该论文试图解决构建适用于地理人工智能(Geospatial Artificial Intelligence)应用的可扩展通用建筑环境表征的问题。解决方案的关键在于提出了一种名为S2Vec的新颖自监督框架,它利用S2 Geometry库将大区域划分为离散的S2单元格,将单元格内的建筑环境特征向量栅格化为图像,并对这些栅格化图像进行掩码自动编码(Masked Autoencoding),以编码特征向量。这种方法生成的任务无关嵌入能够捕捉局部特征特性和更广泛的空间关系。

链接: https://arxiv.org/abs/2504.16942
作者: Shushman Choudhury,Elad Aharoni,Chandrakumari Suvarna,Iveel Tsogsuren,Abdul Rahman Kreidieh,Chun-Ta Lu,Neha Arora
机构: Google (谷歌)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: To be submitted to ACM Transactions on Spatial Algorithms and Systems

点击查看摘要

Abstract:Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on three large-scale socioeconomic prediction tasks, showing its competitive performance against state-of-the-art image-based embeddings. We also explore the benefits of combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our results highlight how S2Vec can learn effective general-purpose geospatial representations and how it can complement other data modalities in geospatial artificial intelligence.
zh

[CV-77] Multifaceted Evaluation of Audio-Visual Capability for MLLM s: Effectiveness Efficiency Generalizability and Robustness

【速读】:该论文旨在解决多模态大型语言模型(MLLMs)在音频-视觉能力方面的综合评估不足的问题,特别是在多样化场景下(如数据分布偏移和对抗攻击)的性能评价。论文的关键在于提出了一种从有效性(effectiveness)、效率(efficiency)、通用性(generalizability)和鲁棒性(robustness)四个维度对MLLMs的音频-视觉能力进行全面评估的方法。通过大量实验,研究发现MLLMs在零样本和少样本场景下具有较强的泛化能力,但其表现严重依赖视觉模态,当视觉输入受损或缺失时性能显著下降。此外,尽管MLLMs对对抗样本较为敏感,但在鲁棒性方面优于传统模型。这些发现为改进MLLMs的音频-视觉能力提供了方向,并为未来研究提供了指导。

链接: https://arxiv.org/abs/2504.16936
作者: Yusheng Zhao,Junyu Luo,Xiao Luo,Weizhi Zhang,Zhiping Xiao,Wei Ju,Philip S. Yu,Ming Zhang
机构: Peking University (北京大学); University of California, Los Angeles (加州大学洛杉矶分校); University of Illinois Chicago (伊利诺伊大学芝加哥分校); University of Washington (华盛顿大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have recently achieved great success in processing and understanding information from diverse modalities (e.g., text, audio, and visual signals). Despite their growing popularity, there remains a lack of comprehensive evaluation measuring the audio-visual capabilities of these models, especially in diverse scenarios (e.g., distribution shifts and adversarial attacks). In this paper, we present a multifaceted evaluation of the audio-visual capability of MLLMs, focusing on four key dimensions: effectiveness, efficiency, generalizability, and robustness. Through extensive experiments, we find that MLLMs exhibit strong zero-shot and few-shot generalization abilities, enabling them to achieve great performance with limited data. However, their success relies heavily on the vision modality, which impairs performance when visual input is corrupted or missing. Additionally, while MLLMs are susceptible to adversarial samples, they demonstrate greater robustness compared to traditional models. The experimental results and our findings provide insights into the audio-visual capabilities of MLLMs, highlighting areas for improvement and offering guidance for future research.
zh

[CV-78] Plasma State Monitoring and Disruption Characterization using Multimodal VAEs

【速读】:该论文旨在解决托卡马克装置中等离子体破裂(disruption)这一关键挑战,通过数据驱动的方法找到一种可解释的等离子体状态表征方式,以实现对破裂的识别与预测。论文的关键在于利用潜在变量模型将诊断测量数据映射为低维潜在表示,并基于变分自编码器(Variational Autoencoder, VAE)框架进行扩展,提出了一种能够连续投影等离子体轨迹、分离操作模式以及区分不同破裂类型的解决方案。此外,通过统计分析测量数据的特性,论文提出了连续的破裂率和破裂倾向指标。这种方法在TCV装置约1600次放电的数据集上进行了验证,展示了其在解释性分类等离子体操作模式及关联破裂风险方面的有效性。

链接: https://arxiv.org/abs/2504.17710
作者: Yoeri Poels,Alessandro Pau,Christian Donner,Giulio Romanelli,Olivier Sauter,Cristina Venturini,Vlado Menkovski, theTCV team, theWPTE team
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When a plasma disrupts in a tokamak, significant heat and electromagnetic loads are deposited onto the surrounding device components. These forces scale with plasma current and magnetic field strength, making disruptions one of the key challenges for future devices. Unfortunately, disruptions are not fully understood, with many different underlying causes that are difficult to anticipate. Data-driven models have shown success in predicting them, but they only provide limited interpretability. On the other hand, large-scale statistical analyses have been a great asset to understanding disruptive patterns. In this paper, we leverage data-driven methods to find an interpretable representation of the plasma state for disruption characterization. Specifically, we use a latent variable model to represent diagnostic measurements as a low-dimensional, latent representation. We build upon the Variational Autoencoder (VAE) framework, and extend it for (1) continuous projections of plasma trajectories; (2) a multimodal structure to separate operating regimes; and (3) separation with respect to disruptive regimes. Subsequently, we can identify continuous indicators for the disruption rate and the disruptivity based on statistical properties of measurement data. The proposed method is demonstrated using a dataset of approximately 1600 TCV discharges, selecting for flat-top disruptions or regular terminations. We evaluate the method with respect to (1) the identified disruption risk and its correlation with other plasma properties; (2) the ability to distinguish different types of disruptions; and (3) downstream analyses. For the latter, we conduct a demonstrative study on identifying parameters connected to disruptions using counterfactual-like analysis. Overall, the method can adequately identify distinct operating regimes characterized by varying proximity to disruptions in an interpretable manner.
zh

[CV-79] Beyond Labels: Zero-Shot Diabetic Foot Ulcer Wound Segmentation with Self-attention Diffusion Models and the Potential for Text-Guided Customization

【速读】:本文针对糖尿病足溃疡(Diabetic Foot Ulcers, DFUs)伤口评估需求复杂且缺乏高质量标注数据的挑战,试图解决传统深度学习模型在医疗场景中对大量标注数据依赖的问题。论文提出了一种名为Attention Diffusion Zero-shot Unsupervised System (ADZUS) 的新型文本引导扩散模型,其关键在于通过零样本学习(zero-shot learning)实现无需标注数据的伤口分割任务。ADZUS 利用描述性提示动态调整分割结果,显著提升了临床应用中的灵活性与适应性。实验表明,ADZUS 在慢性伤口数据集上的 Intersection over Union (IoU) 达到 86.68%,精度 (Precision) 高达 94.69%,优于监督学习方法如 FUSegNet,并在自定义 DFU 数据集上实现了更高的 Dice 相似系数 (Dice Similarity Coefficient, DSC)。该模型的核心创新在于其文本引导的分割能力,可实时定制化输出以支持基于临床描述的针对性分析。尽管 ADZUS 表现出色,但其基于扩散机制的推理计算成本及可能需要的进一步微调仍是未来改进的方向。

链接: https://arxiv.org/abs/2504.17628
作者: Abderrachid Hamrani,Daniela Leizaola,Renato Sousa,Jose P. Ponce,Stanley Mathis,David G. Armstrong,Anuradha Godavarty
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures, journal article

点击查看摘要

Abstract:Diabetic foot ulcers (DFUs) pose a significant challenge in healthcare, requiring precise and efficient wound assessment to enhance patient outcomes. This study introduces the Attention Diffusion Zero-shot Unsupervised System (ADZUS), a novel text-guided diffusion model that performs wound segmentation without relying on labeled training data. Unlike conventional deep learning models, which require extensive annotation, ADZUS leverages zero-shot learning to dynamically adapt segmentation based on descriptive prompts, offering enhanced flexibility and adaptability in clinical applications. Experimental evaluations demonstrate that ADZUS surpasses traditional and state-of-the-art segmentation models, achieving an IoU of 86.68% and the highest precision of 94.69% on the chronic wound dataset, outperforming supervised approaches such as FUSegNet. Further validation on a custom-curated DFU dataset reinforces its robustness, with ADZUS achieving a median DSC of 75%, significantly surpassing FUSegNet’s 45%. The model’s text-guided segmentation capability enables real-time customization of segmentation outputs, allowing targeted analysis of wound characteristics based on clinical descriptions. Despite its competitive performance, the computational cost of diffusion-based inference and the need for potential fine-tuning remain areas for future improvement. ADZUS represents a transformative step in wound segmentation, providing a scalable, efficient, and adaptable AI-driven solution for medical imaging.
zh

[CV-80] A Spatially-Aware Multiple Instance Learning Framework for Digital Pathology

【速读】:该论文旨在解决传统基于注意力的深度多重实例学习(Attention-Based Deep Multiple Instance Learning, ABMIL)方法在病理学全片图像(Whole Slide Images, WSIs)弱监督分类中忽视病灶 patches 之间空间交互的问题。尽管近期基于 Transformer 的多重实例学习(Transformer-based MIL, TransMIL)方法通过引入空间上下文和 patch 间关系取得了进展,但其高昂的计算复杂度限制了广泛应用。同时,尚不清楚在仅依赖多层感知机(Multi-Layer Perceptrons, MLPs)的 ABMIL 框架中显式建模 patch 关系是否能带来类似的性能提升。

论文的关键创新在于提出了一种增强的 ABMIL 框架,名为全局 ABMIL(Global ABMIL, GABMIL),通过集成交互感知表示来显式捕获实例间的依赖关系,同时保持计算效率。具体而言,GABMIL 在不显著增加计算开销的情况下,通过设计能够捕捉 patch 间相互作用的机制,在乳腺癌和肺癌病理亚型分类任务中实现了平均精确率-召回曲线下的面积(AUPRC)提高多达 7 个百分点,Kappa 系数提升 5 个百分点的性能改进。这表明在多重实例学习框架中显式建模 patch 交互的重要性。

链接: https://arxiv.org/abs/2504.17379
作者: Hassan Keshvarikhojasteh,Mihail Tifrea,Sibylle Hess,Josien P.W. Pluim,Mitko Veta
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) is a promising approach for weakly supervised classification in pathology using whole slide images (WSIs). However, conventional MIL methods such as Attention-Based Deep Multiple Instance Learning (ABMIL) typically disregard spatial interactions among patches that are crucial to pathological diagnosis. Recent advancements, such as Transformer based MIL (TransMIL), have incorporated spatial context and inter-patch relationships. However, it remains unclear whether explicitly modeling patch relationships yields similar performance gains in ABMIL, which relies solely on Multi-Layer Perceptrons (MLPs). In contrast, TransMIL employs Transformer-based layers, introducing a fundamental architectural shift at the cost of substantially increased computational complexity. In this work, we enhance the ABMIL framework by integrating interaction-aware representations to address this question. Our proposed model, Global ABMIL (GABMIL), explicitly captures inter-instance dependencies while preserving computational efficiency. Experimental results on two publicly available datasets for tumor subtyping in breast and lung cancers demonstrate that GABMIL achieves up to a 7 percentage point improvement in AUPRC and a 5 percentage point increase in the Kappa score over ABMIL, with minimal or no additional computational overhead. These findings underscore the importance of incorporating patch interactions within MIL frameworks.
zh

[CV-81] Physiological neural representation for personalised tracer kinetic parameter estimation from dynamic PET

【速读】:该论文旨在解决动态正电子发射断层成像(PET)中基于[ ^18 F]FDG的葡萄糖代谢定量分析中存在的计算复杂度高、空间分辨率受限以及传统方法对数据需求大的问题。论文的关键解决方案在于提出了一种基于隐式神经表示(INRs)的生理神经表征方法,用于个性化动力学参数估计。与深度神经网络(DNNs)相比,INRs能够以更高效的方式学习连续函数,从而实现高分辨率的参数成像,并减少对大量训练数据的需求。此外,该方法还结合了来自3D CT基础模型的解剖先验信息,以增强动力学建模的鲁棒性和精确性。通过在[ ^18 F]FDG动态PET/CT数据集上的评估,结果显示该方法具有更高的空间分辨率、更低的均方误差以及更好的解剖一致性,特别是在肿瘤和高度血管化的区域表现突出。这些结果表明,INRs在个性化且数据高效的示踪剂动力学建模方面具有巨大潜力,可应用于肿瘤特征描述、分割及预后评估等领域。

链接: https://arxiv.org/abs/2504.17122
作者: Kartikay Tehlan,Thomas Wendler
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at: this https URL

点击查看摘要

Abstract:Dynamic positron emission tomography (PET) with [ ^18 F]FDG enables non-invasive quantification of glucose metabolism through kinetic analysis, often modelled by the two-tissue compartment model (TCKM). However, voxel-wise kinetic parameter estimation using conventional methods is computationally intensive and limited by spatial resolution. Deep neural networks (DNNs) offer an alternative but require large training datasets and significant computational resources. To address these limitations, we propose a physiological neural representation based on implicit neural representations (INRs) for personalized kinetic parameter estimation. INRs, which learn continuous functions, allow for efficient, high-resolution parametric imaging with reduced data requirements. Our method also integrates anatomical priors from a 3D CT foundation model to enhance robustness and precision in kinetic modelling. We evaluate our approach on an [ ^18 F]FDG dynamic PET/CT dataset and compare it to state-of-the-art DNNs. Results demonstrate superior spatial resolution, lower mean-squared error, and improved anatomical consistency, particularly in tumour and highly vascularized regions. Our findings highlight the potential of INRs for personalized, data-efficient tracer kinetic modelling, enabling applications in tumour characterization, segmentation, and prognostic assessment.
zh

[CV-82] Anatomy-constrained modelling of image-derived input functions in dynamic PET using multi-organ segmentation

【速读】:本文旨在解决动态正电子发射断层成像(PET)中基于图像衍生输入函数(Image-Derived Input Functions, IDIFs)的药代动力学分析准确性受解剖变异和复杂血管贡献影响的问题。传统方法仅从主动脉获取IDIFs,忽略了这些因素。为了解决这一问题,研究提出了一种基于多器官分割的方法,通过整合来自主动脉、门静脉、肺动脉和输尿管的IDIFs,并结合肝脏、肺、肾脏和膀胱的高分辨率CT分割数据,引入特定器官的血液供应来源,从而改进药代动力学建模。关键在于利用多器官信息提升模型的解剖学精确性,最终在九名患者的动态[ ^18 F]FDG PET数据上验证了该方法的有效性,分别实现了肝脏和肺部均方误差(Mean Squared Error, MSE)降低13.39%和10.42%,初步证明了多IDIFs方法在改善解剖建模及充分利用动态PET成像方面的潜力。

链接: https://arxiv.org/abs/2504.17114
作者: Valentin Langer,Kartikay Tehlan,Thomas Wendler
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: The code is available under this https URL

点击查看摘要

Abstract:Accurate kinetic analysis of [ ^18 F]FDG distribution in dynamic positron emission tomography (PET) requires anatomically constrained modelling of image-derived input functions (IDIFs). Traditionally, IDIFs are obtained from the aorta, neglecting anatomical variations and complex vascular contributions. This study proposes a multi-organ segmentation-based approach that integrates IDIFs from the aorta, portal vein, pulmonary artery, and ureters. Using high-resolution CT segmentations of the liver, lungs, kidneys, and bladder, we incorporate organ-specific blood supply sources to improve kinetic modelling. Our method was evaluated on dynamic [ ^18 F]FDG PET data from nine patients, resulting in a mean squared error (MSE) reduction of 13.39% for the liver and 10.42% for the lungs. These initial results highlight the potential of multiple IDIFs in improving anatomical modelling and fully leveraging dynamic PET imaging. This approach could facilitate the integration of tracer kinetic modelling into clinical routine.
zh

[CV-83] Automating tumor-infiltrating lymphocyte assessment in breast cancer histopathology images using QuPath: a transparent and accessible machine learning pipeline

【速读】:本文旨在解决乳腺癌HE染色全片图像(WSI)中肿瘤浸润淋巴细胞(TILs)评估的自动化问题。解决方案的关键在于构建了一个端到端的TILs评估流程,利用QuPath平台实现全自动操作。首先,通过训练像素分类器分割肿瘤、肿瘤相关间质和其他组织区域,以提取肿瘤相关间质用于后续分析;其次,应用预训练的StarDist深度学习模型进行细胞检测,并基于提取的细胞特征训练二元分类器区分TILs与其他细胞类型。最终,通过计算每张WSI中的TIL密度并将其分为低、中、高三个等级,验证了该流程的有效性,与病理学家评分的一致性达到了Cohen’s kappa值为0.71,证明现有软件可提供一种实用的解决方案。

链接: https://arxiv.org/abs/2504.16979
作者: Masoud Tafavvoghi,Lars Ailo Bongo,André Berli Delgado,Nikita Shvetsov,Anders Sildnes,Line Moi,Lill-Tove Rasmussen Busund,Kajsa Møllersen
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 Pages, 9 Figures, 3 tables

点击查看摘要

Abstract:In this study, we built an end-to-end tumor-infiltrating lymphocytes (TILs) assessment pipeline within QuPath, demonstrating the potential of easily accessible tools to perform complex tasks in a fully automatic fashion. First, we trained a pixel classifier to segment tumor, tumor-associated stroma, and other tissue compartments in breast cancer HE-stained whole-slide images (WSI) to isolate tumor-associated stroma for subsequent analysis. Next, we applied a pre-trained StarDist deep learning model in QuPath for cell detection and used the extracted cell features to train a binary classifier distinguishing TILs from other cells. To evaluate our TILs assessment pipeline, we calculated the TIL density in each WSI and categorized them as low, medium, or high TIL levels. Our pipeline was evaluated against pathologist-assigned TIL scores, achieving a Cohen’s kappa of 0.71 on the external test set, corroborating previous research findings. These results confirm that existing software can offer a practical solution for the assessment of TILs in HE-stained WSIs of breast cancer.
zh

[CV-84] Can deep neural networks learn biological vision?

【速读】:该论文试图解决的问题是:随着深度神经网络(DNNs)在计算机视觉基准测试中的表现不断提升,尽管早期其与灵长类神经反应的对齐度逐渐增加,但近年来这一趋势逆转,现代DNNs通过人类或超人水平的识别精度的方式与灵长类视觉系统所依赖的视觉特征不同,这可能导致更优的生物视觉计算模型的发展停滞。论文指出,目前基于互联网数据基准训练的人工智能方法可能无法有效模拟生物视觉系统的核心特性。

解决方案的关键在于:论文提出,视觉科学应脱离人工智能领域,设计专门针对生物视觉系统的算法,而非继续沿用现有的互联网数据基准。具体而言,下一代生物视觉深度学习模型需要采用更接近塑造人类视觉的数据集、训练流程以及目标函数,从而更好地捕捉生物视觉系统的本质特征。

链接: https://arxiv.org/abs/2504.16940
作者: Drew Linsley,Pinyuan Feng,Thomas Serre
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) once showed increasing alignment with primate neural responses as they improved on computer vision benchmarks. This trend raised the exciting possibility that better models of biological vision would come as a byproduct of the deep learning revolution in artificial intelligence. However, the trend has reversed over recent years as DNNs have scaled to human or superhuman recognition accuracy, a divergence that may stem from modern DNNs learning to rely on different visual features than primates to solve tasks. Where will better computational models of biological vision come from? We propose that vision science must break from artificial intelligence to develop algorithms that are designed with biological visual systems in mind instead of internet data benchmarks. We predict that the next generation of deep learning models of biological vision will be trained with data diets, training routines, and objectives that are closer to those that shape human vision than those that are in use today.
zh

人工智能

[AI-0] Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control ICRA2025

【速读】:本文旨在解决基于学习的方法(如模仿学习 IL 和强化学习 RL)在敏捷羽毛球机器人控制中的训练复杂性高以及安全性和稳定性保障不足的问题。为应对这一挑战,论文提出了一种名为 \ourmethod 的新型混合控制系统。其关键在于结合基于模型的方法与基于学习的方法:通过设计一种基于模型的底盘运动策略作为机械臂策略的基础,并引入包含特权信息的基于物理的“IL+RL”训练框架来优化机械臂策略;同时,在IL阶段训练批评者模型以缓解从IL过渡到RL时的性能下降问题。这些创新点共同确保了系统的高效性和可靠性。

链接: https://arxiv.org/abs/2504.17771
作者: Haochen Wang,Zhiwei Shi,Chengxi Zhu,Yafei Qiao,Cheng Zhang,Fan Yang,Pengjie Ren,Lan Lu,Dong Xuan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICRA 2025. Project page: this https URL

点击查看摘要

Abstract:Learning-based methods, such as imitation learning (IL) and reinforcement learning (RL), can produce excel control policies over challenging agile robot tasks, such as sports robot. However, no existing work has harmonized learning-based policy with model-based methods to reduce training complexity and ensure the safety and stability for agile badminton robot control. In this paper, we introduce \ourmethod, a novel hybrid control system for agile badminton robots. Specifically, we propose a model-based strategy for chassis locomotion which provides a base for arm policy. We introduce a physics-informed ``IL+RL’’ training framework for learning-based arm policy. In this train framework, a model-based strategy with privileged information is used to guide arm policy training during both IL and RL phases. In addition, we train the critic model during IL phase to alleviate the performance drop issue when transitioning from IL to RL. We present results on our self-engineered badminton robot, achieving 94.5% success rate against the serving machine and 90.7% success rate against human players. Our system can be easily generalized to other agile mobile manipulation tasks such as agile catching and table tennis. Our project website: this https URL.
zh

[AI-1] Revisiting Reset Mechanisms in Spiking Neural Networks for Sequential Modeling: Specialized Discretization for Binary Activated RNN

【速读】:该论文旨在解决基于二值化激活的脉冲神经网络(SNNs)在序列建模中的三个根本性挑战:(1) 传统模型缺乏有效的长程序列记忆机制;(2) SNNs 中受生物启发的组件(如重置机制和不应期应用)在序列任务中的理论探索不足;(3) SNN 的循环神经网络(RNN)计算范式阻碍了跨不同时间步的并行训练。为应对这些挑战,研究系统分析了二值化激活 RNN 基础上的 SNN 序列模型中重置操作与不应期的基本机制,重新审视了这些生物机制是否严格必要以生成稀疏脉冲模式,并提供了新的理论解释与见解。最终,论文提出了固定不应期 SNN 架构(fixed-refractory-period SNN architecture)作为序列建模的解决方案。关键在于通过理论分析与创新设计,优化 SNN 在序列任务中的性能与效率。

链接: https://arxiv.org/abs/2504.17751
作者: Enqi Zhang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of image recognition, spiking neural networks (SNNs) have achieved performance comparable to conventional artificial neural networks (ANNs). In such applications, SNNs essentially function as traditional neural networks with quantized activation values. This article focuses on an another alternative perspective,viewing SNNs as binary-activated recurrent neural networks (RNNs) for sequential modeling this http URL this viewpoint, current SNN architectures face several fundamental challenges in sequence modeling: (1) Traditional models lack effective memory mechanisms for long-range sequence modeling; (2) The biological-inspired components in SNNs (such as reset mechanisms and refractory period applications) remain theoretically under-explored for sequence tasks; (3) The RNN-like computational paradigm in SNNs prevents parallel training across different this http URL address these challenges, this study conducts a systematic analysis of the fundamental mechanisms underlying reset operations and refractory periods in binary-activated RNN-based SNN sequence models. We re-examine whether such biological mechanisms are strictly necessary for generating sparse spiking patterns, provide new theoretical explanations and insights, and ultimately propose the fixed-refractory-period SNN architecture for sequence modeling.
zh

[AI-2] Conformal Segmentation in Industrial Surface Defect Detection with Statistical Guarantees

【速读】:该论文旨在解决工业环境中基于深度学习的自动化表面缺陷检测模型在实际应用中可靠性不足的问题。传统方法依赖人工检测效率低且成本高,而基于卷积神经网络(Convolutional Neural Networks, CNN)的方法虽发展迅速,但其可靠性受制于训练过程中数据标注不确定性及过拟合问题,导致新样本检测结果可能出现偏差,影响自动化检测过程的可靠性。

解决方案的关键在于提出了一种基于校准数据的统计严格方法。论文通过满足独立同分布(independent and identically distributed, i.i.d)条件的校准数据评估检测模型的实际性能,并定义损失函数量化测试样本中的检测误差率(如召回率的补数与误发现率)。进一步,基于用户定义的风险水平推导出严格的阈值,用于识别测试图像中高概率缺陷像素,构建预测集(如缺陷区域),确保测试集上的预期误差率被严格控制在预设风险水平以内。此外,论文还观察到预测集大小与测试集风险水平之间存在负相关关系,建立了评估检测模型不确定性的统计严格指标,并验证了该方法在不同校准-测试划分比例下的鲁棒性和高效性。

链接: https://arxiv.org/abs/2504.17721
作者: Cheng Shen,Yuewei Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:In industrial settings, surface defects on steel can significantly compromise its service life and elevate potential safety risks. Traditional defect detection methods predominantly rely on manual inspection, which suffers from low efficiency and high costs. Although automated defect detection approaches based on Convolutional Neural Networks(e.g., Mask R-CNN) have advanced rapidly, their reliability remains challenged due to data annotation uncertainties during deep model training and overfitting issues. These limitations may lead to detection deviations when processing the given new test samples, rendering automated detection processes unreliable. To address this challenge, we first evaluate the detection model’s practical performance through calibration data that satisfies the independent and identically distributed (i.i.d) condition with test data. Specifically, we define a loss function for each calibration sample to quantify detection error rates, such as the complement of recall rate and false discovery rate. Subsequently, we derive a statistically rigorous threshold based on a user-defined risk level to identify high-probability defective pixels in test images, thereby constructing prediction sets (e.g., defect regions). This methodology ensures that the expected error rate (mean error rate) on the test set remains strictly bounced by the predefined risk level. Additionally, we observe a negative correlation between the average prediction set size and the risk level on the test set, establishing a statistically rigorous metric for assessing detection model uncertainty. Furthermore, our study demonstrates robust and efficient control over the expected test set error rate across varying calibration-to-test partitioning ratios, validating the method’s adaptability and operational effectiveness.
zh

[AI-3] Early Detection of Multidrug Resistance Using Multivariate Time Series Analysis and Interpretable Patient-Similarity Representations

【速读】:该论文旨在解决多重耐药性(Multidrug Resistance, MDR)这一全球性的健康挑战,通过提出一种可解释的机器学习框架来实现MDR的准确预测与增强的可解释性。解决方案的关键在于将患者建模为多变量时间序列(Multivariate Time Series, MTS),利用基于MTS的方法(如描述性统计、动态时间规整和时间聚类核)量化患者间的相似性,并将其作为输入特征,结合逻辑回归、随机森林和支持向量机等分类方法进行MDR预测。同时,通过维度约简和核变换优化模型性能,并利用图结构的患者相似性网络进行谱聚类和t-SNE可视化分析,以揭示临床相关的模式和高风险集群,从而识别关键风险因素并提供可解释的洞察。

链接: https://arxiv.org/abs/2504.17717
作者: Óscar Escudero-Arnanz,Antonio G. Marques,Inmaculada Mora-Jiménez,Joaquín Álvarez-Rodríguez,Cristina Soguero-Ruiz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background and Objectives: Multidrug Resistance (MDR) is a critical global health issue, causing increased hospital stays, healthcare costs, and mortality. This study proposes an interpretable Machine Learning (ML) framework for MDR prediction, aiming for both accurate inference and enhanced explainability. Methods: Patients are modeled as Multivariate Time Series (MTS), capturing clinical progression and patient-to-patient interactions. Similarity among patients is quantified using MTS-based methods: descriptive statistics, Dynamic Time Warping, and Time Cluster Kernel. These similarity measures serve as inputs for MDR classification via Logistic Regression, Random Forest, and Support Vector Machines, with dimensionality reduction and kernel transformations improving model performance. For explainability, patient similarity networks are constructed from these metrics. Spectral clustering and t-SNE are applied to identify MDR-related subgroups and visualize high-risk clusters, enabling insight into clinically relevant patterns. Results: The framework was validated on ICU Electronic Health Records from the University Hospital of Fuenlabrada, achieving an AUC of 81%. It outperforms baseline ML and deep learning models by leveraging graph-based patient similarity. The approach identifies key risk factors – prolonged antibiotic use, invasive procedures, co-infections, and extended ICU stays – and reveals clinically meaningful clusters. Code and results are available at \this https URL. Conclusions: Patient similarity representations combined with graph-based analysis provide accurate MDR prediction and interpretable insights. This method supports early detection, risk factor identification, and patient stratification, highlighting the potential of explainable ML in critical care. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.17717 [cs.LG] (or arXiv:2504.17717v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.17717 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Óscar Escudero Arnanz [view email] [v1] Thu, 24 Apr 2025 16:19:13 UTC (758 KB)
zh

[AI-4] Federated Learning: A Survey on Privacy-Preserving Collaborative Intelligence

【速读】:本文试图解决分布式机器学习领域中数据隐私、安全性和监管合规性等日益增长的挑战。解决方案的关键在于联邦学习(Federated Learning, FL)这一去中心化的协作训练范式,通过让多个客户端(如移动设备、边缘节点或组织)在不集中敏感数据的情况下共同训练共享的全局模型来实现。论文重点探讨了FL的核心架构、通信协议以及标准生命周期(包括本地训练、模型聚合和全局更新)。关键挑战包括处理非独立同分布(non-IID)数据、应对系统与硬件异构性、降低通信开销,以及通过差分隐私(differential privacy)和安全聚合等机制确保隐私保护。此外,论文还关注个性化联邦学习、跨设备与跨孤岛设置的研究趋势,并强调了真实世界应用、基准数据集及评估指标的重要性,最后提出了开放研究问题与未来发展方向,以推动可扩展、高效且可信的FL系统的实现。

链接: https://arxiv.org/abs/2504.17703
作者: Edward Collins,Michel Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a transformative paradigm in the field of distributed machine learning, enabling multiple clients such as mobile devices, edge nodes, or organizations to collaboratively train a shared global model without the need to centralize sensitive data. This decentralized approach addresses growing concerns around data privacy, security, and regulatory compliance, making it particularly attractive in domains such as healthcare, finance, and smart IoT systems. This survey provides a concise yet comprehensive overview of Federated Learning, beginning with its core architecture and communication protocol. We discuss the standard FL lifecycle, including local training, model aggregation, and global updates. A particular emphasis is placed on key technical challenges such as handling non-IID (non-independent and identically distributed) data, mitigating system and hardware heterogeneity, reducing communication overhead, and ensuring privacy through mechanisms like differential privacy and secure aggregation. Furthermore, we examine emerging trends in FL research, including personalized FL, cross-device versus cross-silo settings, and integration with other paradigms such as reinforcement learning and quantum computing. We also highlight real-world applications and summarize benchmark datasets and evaluation metrics commonly used in FL research. Finally, we outline open research problems and future directions to guide the development of scalable, efficient, and trustworthy FL systems.
zh

[AI-5] INSIGHT: Bridging the Student-Teacher Gap in Times of Large Language Models

【速读】:该论文旨在解决人工智能(AI),特别是大型语言模型(LLMs),在融入课堂教学过程中带来的挑战与机遇问题。论文关注如何利用AI技术辅助教学人员优化教学方法,同时应对学生-教师互动质量下降及用户隐私保护等潜在风险。解决方案的关键在于INSIGHT系统的设计,它通过模块化架构实现与多种高等教育课程的集成,并采用提取关键词的方法分析学生提出的问题,动态构建常见问题解答(FAQ)以提供新的见解,从而支持教学人员开展更个性化的面对面辅导。未来可进一步扩展INSIGHT,利用收集的数据实现适应性学习,根据学生的学习进度和风格调整教学内容,提供更具交互性和包容性的学习体验。

链接: https://arxiv.org/abs/2504.17677
作者: Jarne Thys,Sebe Vanbrabant,Davy Vanacken,Gustavo Rovelo Ruiz
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of AI, especially Large Language Models, presents challenges and opportunities to integrate such technology into the classroom. AI has the potential to revolutionize education by helping teaching staff with various tasks, such as personalizing their teaching methods, but it also raises concerns, for example, about the degradation of student-teacher interactions and user privacy. This paper introduces INSIGHT, a proof of concept to combine various AI tools to assist teaching staff and students in the process of solving exercises. INSIGHT has a modular design that allows it to be integrated into various higher education courses. We analyze students’ questions to an LLM by extracting keywords, which we use to dynamically build an FAQ from students’ questions and provide new insights for the teaching staff to use for more personalized face-to-face support. Future work could build upon INSIGHT by using the collected data to provide adaptive learning and adjust content based on student progress and learning styles to offer a more interactive and inclusive learning experience.
zh

[AI-6] Optimized Cloud Resource Allocation Using Genetic Algorithms for Energy Efficiency and QoS Assurance

【速读】:该论文旨在解决云计算环境中动态资源管理的问题,具体目标是在满足服务质量(QoS)约束的同时,通过优化虚拟机(VM)的放置与整合来最小化能耗。论文的关键在于提出了一种基于遗传算法(Genetic Algorithm, GA)的方法,该方法能够根据实时工作负载的变化动态调整虚拟机的分配策略,从而显著优于传统的启发式算法如First Fit Decreasing (FFD) 和 Best Fit Decreasing (BFD)。实验结果验证了该方法在降低能耗、减少虚拟机迁移次数、降低SLA违约率以及缩短执行时间方面的有效性,并通过相关性热图进一步证明了这些关键性能指标之间的强关联性。

链接: https://arxiv.org/abs/2504.17675
作者: Caroline Panggabean,Devaraj Verma C,Bhagyashree Gogoi,Ranju Limbu,Rhythm Sarker
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, accepted for publication (not yet published)

点击查看摘要

Abstract:Cloud computing environments demand dynamic and efficient resource management to ensure optimal performance, reduced energy consumption, and adherence to Service Level Agreements (SLAs). This paper presents a Genetic Algorithm (GA)-based approach for Virtual Machine (VM) placement and consolidation, aiming to minimize power usage while maintaining QoS constraints. The proposed method dynamically adjusts VM allocation based on real-time workload variations, outperforming traditional heuristics such as First Fit Decreasing (FFD) and Best Fit Decreasing (BFD). Experimental results show notable reductions in energy consumption, VM migrations, SLA violation rates, and execution time. A correlation heatmap further illustrates strong relationships among these key performance indicators, confirming the effectiveness of our approach in optimizing cloud resource utilization.
zh

[AI-7] owards a HIPAA Compliant Agent ic AI System in Healthcare

【速读】:该论文旨在解决大型语言模型驱动的主动型人工智能(Agentic AI)系统在医疗领域应用中面临的合规性挑战,特别是在处理受保护健康信息(PHI)时如何满足《健康保险流通与责任法案》(HIPAA)的要求。论文的关键在于提出了一种符合HIPAA标准的主动型AI框架,通过动态且上下文感知的策略执行机制确保监管合规性。该框架的核心解决方案包括:(1) 基于属性的访问控制(Attribute-Based Access Control, ABAC)以实现对PHI的精细化治理;(2) 结合正则表达式模式与基于BERT的模型的混合PHI脱敏流水线,以最小化信息泄露风险;(3) 不可变的审计跟踪,用于验证合规性。这些机制共同构成了一个全面的解决方案,以应对在敏感医疗数据处理中的合规性需求。

链接: https://arxiv.org/abs/2504.17669
作者: Subash Neupane,Shaswata Mitra,Sudip Mittal,Shahram Rahimi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Agentic AI systems powered by Large Language Models (LLMs) as their foundational reasoning engine, are transforming clinical workflows such as medical report generation and clinical summarization by autonomously analyzing sensitive healthcare data and executing decisions with minimal human oversight. However, their adoption demands strict compliance with regulatory frameworks such as Health Insurance Portability and Accountability Act (HIPAA), particularly when handling Protected Health Information (PHI). This work-in-progress paper introduces a HIPAA-compliant Agentic AI framework that enforces regulatory compliance through dynamic, context-aware policy enforcement. Our framework integrates three core mechanisms: (1) Attribute-Based Access Control (ABAC) for granular PHI governance, (2) a hybrid PHI sanitization pipeline combining regex patterns and BERT-based model to minimize leakage, and (3) immutable audit trails for compliance verification.
zh

[AI-8] he Malicious Technical Ecosystem: Exposing Limitations in Technical Governance of AI-Generated Non-Consensual Intimate Images of Adults

【速读】:该论文旨在解决由生成式人工智能(Generative AI)驱动的非同意亲密图像(AIG-NCII),即通常所说的“深度伪造色情内容”对成年人造成的危害问题。论文采用以幸存者为中心的方法,识别出一个由开源换脸模型和近200种“裸化”软件组成的恶意技术生态系统(Malicious Technical Ecosystem, MTE),这些工具使非技术用户能够在短时间内创建AIG-NCII。论文通过分析美国国家标准与技术研究院(NIST)发布的AI 100-4报告,揭示当前合成内容治理方法在监管MTE以防止成人AIG-NCII方面的不足及其背后存在的缺陷假设。解决方案的关键在于重新审视现有治理体系,并提出更有效的技术和社会机制来应对这一复杂挑战。

链接: https://arxiv.org/abs/2504.17663
作者: Michelle L. Ding,Harini Suresh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we adopt a survivor-centered approach to locate and dissect the role of sociotechnical AI governance in preventing AI-Generated Non-Consensual Intimate Images (AIG-NCII) of adults, colloquially known as “deep fake pornography.” We identify a “malicious technical ecosystem” or “MTE,” comprising of open-source face-swapping models and nearly 200 “nudifying” software programs that allow non-technical users to create AIG-NCII within minutes. Then, using the National Institute of Standards and Technology (NIST) AI 100-4 report as a reflection of current synthetic content governance methods, we show how the current landscape of practices fails to effectively regulate the MTE for adult AIG-NCII, as well as flawed assumptions explaining these gaps.
zh

[AI-9] PTCL: Pseudo-Label Temporal Curriculum Learning for Label-Limited Dynamic Graph

【速读】:该论文致力于解决标签受限下的动态节点分类问题,在现实场景中仅能获取最终时间戳标签的情况下,如何有效利用这些有限标签进行动态节点分类。论文的关键创新在于提出了一种名为PTCL(Pseudo-label Temporal Curriculum Learning)的方法,其核心包括:(1) 引入时间解耦架构,将骨干网络(用于学习时间感知表示)与解码器(严格对齐最终标签)分离,由骨干网络生成伪标签;(2) 设计时间课程学习策略,通过指数衰减函数为接近最终时间戳的伪标签赋予更高权重,从而优先优化这些更具信息量的标签。这种方法能够在仅使用最终标签的条件下显著提升动态节点分类性能。此外,论文还贡献了一个新的学术数据集CoOAG,并提出了一个统一框架FLiD(Label-Limited Dynamic Node Classification Framework),涵盖完整的数据准备流程、训练管道以及评估标准,以支持多种模型和数据集的应用。

链接: https://arxiv.org/abs/2504.17641
作者: Shengtao Zhang,Haokai Zhang,Shiqi Lou,Zicheng Wang,Zinan Zeng,Yilin Wang,Minnan Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic node classification is critical for modeling evolving systems like financial transactions and academic collaborations. In such systems, dynamically capturing node information changes is critical for dynamic node classification, which usually requires all labels at every timestamp. However, it is difficult to collect all dynamic labels in real-world scenarios due to high annotation costs and label uncertainty (e.g., ambiguous or delayed labels in fraud detection). In contrast, final timestamp labels are easier to obtain as they rely on complete temporal patterns and are usually maintained as a unique label for each user in many open platforms, without tracking the history data. To bridge this gap, we propose PTCL(Pseudo-label Temporal Curriculum Learning), a pioneering method addressing label-limited dynamic node classification where only final labels are available. PTCL introduces: (1) a temporal decoupling architecture separating the backbone (learning time-aware representations) and decoder (strictly aligned with final labels), which generate pseudo-labels, and (2) a Temporal Curriculum Learning strategy that prioritizes pseudo-labels closer to the final timestamp by assigning them higher weights using an exponentially decaying function. We contribute a new academic dataset (CoOAG), capturing long-range research interest in dynamic graph. Experiments across real-world scenarios demonstrate PTCL’s consistent superiority over other methods adapted to this task. Beyond methodology, we propose a unified framework FLiD (Framework for Label-Limited Dynamic Node Classification), consisting of a complete preparation workflow, training pipeline, and evaluation standards, and supporting various models and datasets. The code can be found at this https URL.
zh

[AI-10] Decentralized Time Series Classification with ROCKET Features ECML-PKDD

【速读】:该论文旨在解决时间序列分类(Time Series Classification, TSC)任务中基于联邦学习(Federated Learning, FL)方法所面临的中心化服务器带来的鲁棒性和隐私性挑战。具体而言,传统FL方案依赖于客户端-服务器架构,这种架构存在单点故障风险且服务器可能获取到从客户端提取的知识,从而威胁数据隐私与模型安全性。为应对这些挑战,论文提出了一种名为DROCKS的完全去中心化联邦学习框架,用于时间序列分类。其关键是利用ROCKET特征,并通过在联邦节点间按结构化路径顺序传递的方式训练全局模型,在每个节点上优化模型并筛选出最有效的本地核函数,从而实现更高的鲁棒性和安全性。实验结果表明,DROCKS不仅性能优于现有客户端-服务器架构的FL方法,还对节点失效和恶意攻击更具抵抗力。

链接: https://arxiv.org/abs/2504.17617
作者: Bruno Casella,Matthias Jakobs,Marco Aldinucci,Sebastian Buschjäger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to Workshop on Federated Learning Advancements 2025, in conjunction with ECML-PKDD, WAFL25

点击查看摘要

Abstract:Time series classification (TSC) is a critical task with applications in various domains, including healthcare, finance, and industrial monitoring. Due to privacy concerns and data regulations, Federated Learning has emerged as a promising approach for learning from distributed time series data without centralizing raw information. However, most FL solutions rely on a client-server architecture, which introduces robustness and confidentiality risks related to the distinguished role of the server, which is a single point of failure and can observe knowledge extracted from clients. To address these challenges, we propose DROCKS, a fully decentralized FL framework for TSC that leverages ROCKET (RandOm Convolutional KErnel Transform) features. In DROCKS, the global model is trained by sequentially traversing a structured path across federation nodes, where each node refines the model and selects the most effective local kernels before passing them to the successor. Extensive experiments on the UCR archive demonstrate that DROCKS outperforms state-of-the-art client-server FL approaches while being more resilient to node failures and malicious attacks. Our code is available at this https URL.
zh

[AI-11] Auditing the Ethical Logic of Generative AI Models

【速读】:该论文试图解决生成式人工智能(Generative AI)模型在高风险领域应用中伦理推理评估方法不足的问题。为应对这一挑战,论文提出了一种基于五维审计模型的方法,包括分析质量、伦理考量广度、解释深度、一致性及决断力五个维度,用于评估领先大型语言模型(LLMs)的伦理逻辑。关键解决方案在于采用多电池提示(multi-battery prompt)方法,结合新颖的伦理困境设计,以全面探查模型在多样化场景中的推理能力,并通过引入Chain-of-Thought提示及专门优化推理的模型显著提升审计指标表现。研究还揭示了AI系统在复杂决策情境中辅助人类道德推理的潜力。

链接: https://arxiv.org/abs/2504.17544
作者: W. Russell Neuman,Chad Coleman,Ali Dasdan,Safinah Ali,Manan Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative AI models become increasingly integrated into high-stakes domains, the need for robust methods to evaluate their ethical reasoning becomes increasingly important. This paper introduces a five-dimensional audit model – assessing Analytic Quality, Breadth of Ethical Considerations, Depth of Explanation, Consistency, and Decisiveness – to evaluate the ethical logic of leading large language models (LLMs). Drawing on traditions from applied ethics and higher-order thinking, we present a multi-battery prompt approach, including novel ethical dilemmas, to probe the models’ reasoning across diverse contexts. We benchmark seven major LLMs finding that while models generally converge on ethical decisions, they vary in explanatory rigor and moral prioritization. Chain-of-Thought prompting and reasoning-optimized models significantly enhance performance on our audit metrics. This study introduces a scalable methodology for ethical benchmarking of AI systems and highlights the potential for AI to complement human moral reasoning in complex decision-making contexts.
zh

[AI-12] Proof of Useful Intelligence (PoUI): Blockchain Consensus Beyond Energy Waste

【速读】:该论文试图解决区块链技术在支持高资源需求应用(如生成式 AI)时面临的可扩展性和可持续性问题,特别是现有共识机制(Proof of Work 和 Proof of Stake)在安全性和效率之间的权衡难题。论文提出了一种名为“Proof of Useful Intelligence (PoUI)”的新颖混合共识机制作为解决方案。其关键是将AI任务(如语言处理或图像分析)与共识过程结合,使参与者通过完成有用计算任务获取奖励,并将所得代币质押以保障网络安全,从而实现安全性与实用价值的融合,同时避免传统共识机制的资源浪费或中心化风险。

链接: https://arxiv.org/abs/2504.17539
作者: Zan-Kai Chong,Hiroyuki Ohsaki,Bryan Ng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Blockchain technology enables secure, transparent data management in decentralized systems, supporting applications from cryptocurrencies like Bitcoin to tokenizing real-world assets like property. Its scalability and sustainability hinge on consensus mechanisms balancing security and efficiency. Proof of Work (PoW), used by Bitcoin, ensures security through energy-intensive computations but demands significant resources. Proof of Stake (PoS), as in Ethereum post-Merge, selects validators based on staked cryptocurrency, offering energy efficiency but risking centralization from wealth concentration. With AI models straining computational resources, we propose Proof of Useful Intelligence (PoUI), a hybrid consensus mechanism. In PoUI, workers perform AI tasks like language processing or image analysis to earn coins, which are staked to secure the network, blending security with practical utility. Decentralized nodes–job posters, market coordinators, workers, and validators --collaborate via smart contracts to manage tasks and rewards.
zh

[AI-13] Learning Isometric Embeddings of Road Networks using Multidimensional Scaling

【速读】:该论文旨在解决基于学习的自动驾驶应用在泛化能力方面的不足,即当前车辆所能处理的道路场景范围狭窄的问题。论文提出的关键解决方案是利用多维尺度分析(Multidimensional Scaling, MDS)技术,通过图表示方法来构建能够涵盖各种道路场景的特征空间。这种方法的核心在于分析最先进的图表示与MDS技术在自动驾驶中的适用性,并探讨将图节点嵌入以简化学习过程及实现降维的可能性。

链接: https://arxiv.org/abs/2504.17534
作者: Juan Carlos Climent Pardo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:The lack of generalization in learning-based autonomous driving applications is shown by the narrow range of road scenarios that vehicles can currently cover. A generalizable approach should capture many distinct road structures and topologies, as well as consider traffic participants, and dynamic changes in the environment, so that vehicles can navigate and perform motion planning tasks even in the most difficult situations. Designing suitable feature spaces for neural network-based motion planers that encapsulate all kinds of road scenarios is still an open research challenge. This paper tackles this learning-based generalization challenge and shows how graph representations of road networks can be leveraged by using multidimensional scaling (MDS) techniques in order to obtain such feature spaces. State-of-the-art graph representations and MDS approaches are analyzed for the autonomous driving use case. Finally, the option of embedding graph nodes is discussed in order to perform easier learning procedures and obtain dimensionality reduction.
zh

[AI-14] owards Machine-Generated Code for the Resolution of User Intentions

【速读】:本文旨在探索通过提示大型语言模型(Large Language Models, LLMs)生成代码以实现和执行工作流的可行性,从而解决用户意图解析的问题。传统方式依赖于用户通过一系列高级应用程序来达成目标,而人工智能的引入可能改变这一模式。解决方案的关键在于利用LLMs生成与用户意图相对应的工作流代码,这些工作流由多个相互依赖的步骤组成,实现了人机混合智能协作:人类负责定义意图,而人工智能负责生成解决方案并执行任务。本文以具体用户意图(如“请将我的车辆证明发送至保险公司”)以及无图形用户界面(GUI)操作系统的简化API为例,深入分析和比较了不同用户意图所生成的代码及其执行结果,验证了该方法的可行性,并展示了GPT-4o-mini在生成面向代码的工作流方面的卓越能力。

链接: https://arxiv.org/abs/2504.17531
作者: Justus Flerlage,Ilja Behnke,Odej Kao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing capabilities of Artificial Intelligence (AI), particularly Large Language Models (LLMs), prompt a reassessment of the interaction mechanisms between users and their devices. Currently, users are required to use a set of high-level applications to achieve their desired results. However, the advent of AI may signal a shift in this regard, as its capabilities have generated novel prospects for user-provided intent resolution through the deployment of model-generated code, which is tantamount to the generation of workflows comprising a multitude of interdependent steps. This development represents a significant progression in the realm of hybrid workflows, where human and artificial intelligence collaborate to address user intentions, with the former responsible for defining these intentions and the latter for implementing the solutions to address them. In this paper, we investigate the feasibility of generating and executing workflows through code generation that results from prompting an LLM with a concrete user intention, such as \emphPlease send my car title to my insurance company, and a simplified application programming interface for a GUI-less operating system. We provide in-depth analysis and comparison of various user intentions, the resulting code, and its execution. The findings demonstrate a general feasibility of our approach and that the employed LLM, GPT-4o-mini, exhibits remarkable proficiency in the generation of code-oriented workflows in accordance with provided user intentions.
zh

[AI-15] ACO: Tackling Over-correction in Federated Learning with Tailored Adaptive Correction

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在非独立同分布(Non-IID)数据场景下的过矫正现象问题。现有方法通过采用统一的模型校正系数来缓解统计异质性,但这种做法可能导致过矫正(over-correction),进而降低模型性能甚至引发模型收敛失败。为解决此问题,论文提出了一种名为TACO的新算法,其关键在于实施细粒度、客户端特定的梯度校正与模型聚合策略,引导本地模型更精确地逼近全局最优解。此外,TACO通过轻量级的模型校正与定制化聚合方式,降低了计算开销,并验证了主流FL算法在通信轮次数而非实际时间上的优势,从而显著提升了训练效率。实验结果表明,TACO在多种数据集上表现出优越且稳定的表现。

链接: https://arxiv.org/abs/2504.17528
作者: Weijie Liu,Ziwei Zhan,Carlee Joe-Wong,Edith Ngai,Jingpu Duan,Deke Guo,Xu Chen,Xiaoxi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, accepted by ICDCS 2025

点击查看摘要

Abstract:Non-independent and identically distributed (Non-IID) data across edge clients have long posed significant challenges to federated learning (FL) training in edge computing environments. Prior works have proposed various methods to mitigate this statistical heterogeneity. While these works can achieve good theoretical performance, in this work we provide the first investigation into a hidden over-correction phenomenon brought by the uniform model correction coefficients across clients adopted by existing methods. Such over-correction could degrade model performance and even cause failures in model convergence. To address this, we propose TACO, a novel algorithm that addresses the non-IID nature of clients’ data by implementing fine-grained, client-specific gradient correction and model aggregation, steering local models towards a more accurate global optimum. Moreover, we verify that leading FL algorithms generally have better model accuracy in terms of communication rounds rather than wall-clock time, resulting from their extra computation overhead imposed on clients. To enhance the training efficiency, TACO deploys a lightweight model correction and tailored aggregation approach that requires minimum computation overhead and no extra information beyond the synchronized model parameters. To validate TACO’s effectiveness, we present the first FL convergence analysis that reveals the root cause of over-correction. Extensive experiments across various datasets confirm TACO’s superior and stable performance in practice.
zh

[AI-16] Combining GCN Structural Learning with LLM Chemical Knowledge for or Enhanced Virtual Screening

【速读】:该论文旨在解决传统机器学习方法在分子虚拟筛选任务中因依赖预定义分子表示而导致的信息损失和潜在偏差问题。为应对这一挑战,论文提出了一种将图卷积网络(Graph Convolutional Networks, GCNs)与大规模语言模型(Large Language Models, LLMs)衍生的嵌入向量相结合的混合架构。其关键创新在于通过在每一层GCN后引入LLM嵌入,而非仅在最终层使用,实现了全局化学知识在整个网络中的更深层次整合,从而显著提升了模型性能。实验结果显示,该方法在F1分数上达到了88.8%,优于单独使用GCN(87.9%)、XGBoost(85.5%)和SVM(85.4%)等基线模型。此外,预先计算并存储LLM嵌入至分子特征库的设计确保了计算效率,避免了重复运行LLM的需求。

链接: https://arxiv.org/abs/2504.17497
作者: Radia Berreziga,Mohammed Brahimi,Khairedine Kraim,Hamid Azzoune
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Virtual screening plays a critical role in modern drug discovery by enabling the identification of promising candidate molecules for experimental validation. Traditional machine learning methods such as support vector machines (SVM) and XGBoost rely on predefined molecular representations, often leading to information loss and potential bias. In contrast, deep learning approaches-particularly Graph Convolutional Networks (GCNs)-offer a more expressive and unbiased alternative by operating directly on molecular graphs. Meanwhile, Large Language Models (LLMs) have recently demonstrated state-of-the-art performance in drug design, thanks to their capacity to capture complex chemical patterns from large-scale data via attention mechanisms. In this paper, we propose a hybrid architecture that integrates GCNs with LLM-derived embeddings to combine localized structural learning with global chemical knowledge. The LLM embeddings can be precomputed and stored in a molecular feature library, removing the need to rerun the LLM during training or inference and thus maintaining computational efficiency. We found that concatenating the LLM embeddings after each GCN layer-rather than only at the final layer-significantly improves performance, enabling deeper integration of global context throughout the network. The resulting model achieves superior results, with an F1-score of (88.8%), outperforming standalone GCN (87.9%), XGBoost (85.5%), and SVM (85.4%) baselines. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.17497 [cs.LG] (or arXiv:2504.17497v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.17497 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-17] Goal-Oriented Time-Series Forecasting: Foundation Framework Design

【速读】:该论文试图解决传统时间序列预测方法仅关注最小化预测误差,而忽视实际应用中特定需求的问题。解决方案的关键在于提出了一种新的训练方法,该方法允许预测模型根据终端应用指定的预测区间的重要性动态调整其关注点。与以往预先固定预测区间的方法不同,本文的训练方法将整个信号范围的预测分解为更小的段,并对这些段进行动态加权组合,从而实现更准确的预测。这种方法不仅提高了预测精度,还提升了采用该预测模型的实际应用的整体性能。

链接: https://arxiv.org/abs/2504.17493
作者: Luca-Andrei Fechete,Mohamed Sana,Fadhel Ayed,Nicola Piovesan,Wenjie Li,Antonio De Domenico,Tareq Si Salem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional time-series forecasting often focuses only on minimizing prediction errors, ignoring the specific requirements of real-world applications that employ them. This paper presents a new training methodology, which allows a forecasting model to dynamically adjust its focus based on the importance of forecast ranges specified by the end application. Unlike previous methods that fix these ranges beforehand, our training approach breaks down predictions over the entire signal range into smaller segments, which are then dynamically weighted and combined to produce accurate forecasts. We tested our method on standard datasets, including a new dataset from wireless communication, and found that not only it improves prediction accuracy but also improves the performance of end application employing the forecasting model. This research provides a basis for creating forecasting systems that better connect prediction and decision-making in various practical applications.
zh

[AI-18] Plasticine: Accelerating Research in Plasticity-Motivated Deep Reinforcement Learning

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, RL)系统在训练过程中因神经网络适应能力逐渐丧失而导致的可塑性损失(plasticity loss)问题,同时填补该领域缺乏统一基准和评估协议的空白。论文的关键解决方案是提出Plasticine,首个开源框架,用于评估和优化深度RL中的可塑性。Plasticine通过提供超过13种缓解方法的单文件实现、10种评价指标以及从标准到开放环境的非平稳性逐步增加的学习场景,使研究者能够系统地量化可塑性损失、评估缓解策略,并分析不同上下文中的可塑性动态。

链接: https://arxiv.org/abs/2504.17490
作者: Mingqi Yuan,Qi Wang,Guozheng Ma,Bo Li,Xin Jin,Yunbo Wang,Xiaokang Yang,Wenjun Zeng,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages

点击查看摘要

Abstract:Developing lifelong learning agents is crucial for artificial general intelligence. However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for benchmarking plasticity optimization in deep RL. Plasticine provides single-file implementations of over 13 mitigation methods, 10 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to open-ended environments. This framework enables researchers to systematically quantify plasticity loss, evaluate mitigation strategies, and analyze plasticity dynamics across different contexts. Our documentation, examples, and source code are available at this https URL.
zh

[AI-19] GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

【速读】:该论文旨在解决在动态图上的 Gossip Learning (GL) 方法在面对 Byzantine(模型投毒)攻击时的鲁棒性问题,特别是在 Byzantine 节点通过攻击 Random Peer Sampling (RPS) 协议以放大模型投毒影响的情况下。论文提出了一种名为 GRANITE 的框架作为解决方案,其关键是引入两个核心组件:(i) History-aware Byzantine-resilient Peer Sampling 协议(HaPS),通过追踪之前遇到的标识符来减少对抗性影响随时间的增长;(ii) Adaptive Probabilistic Threshold(APT),利用 Byzantine 节点存在比例的估计值,在聚合过程中设置具有形式化保证的阈值。这些机制使得 GRANITE 能够在高达 30% 的 Byzantine 节点存在下保持收敛,并在比现有理论允许的拓扑稀疏多达 9 倍的情况下实现更快的学习速度。

链接: https://arxiv.org/abs/2504.17471
作者: Yacine Belal,Mohamed Maouche,Sonia Ben Mokhtar,Anthony Simonet-Boulogne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent GL approaches rely on dynamic communication graphs built and maintained using Random Peer Sampling (RPS) protocols. Thanks to graph dynamics, GL can achieve fast convergence even over extremely sparse topologies. However, the robustness of GL over dy- namic graphs to Byzantine (model poisoning) attacks remains unaddressed especially when Byzantine nodes attack the RPS protocol to scale up model poisoning. We address this issue by introducing GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of a fraction of Byzantine nodes. GRANITE relies on two key components (i) a History-aware Byzantine-resilient Peer Sampling protocol (HaPS), which tracks previously encountered identifiers to reduce adversarial influence over time, and (ii) an Adaptive Probabilistic Threshold (APT), which leverages an estimate of Byzantine presence to set aggregation thresholds with formal guarantees. Empirical results confirm that GRANITE maintains convergence with up to 30% Byzantine nodes, improves learning speed via adaptive filtering of poisoned models and obtains these results in up to 9 times sparser graphs than dictated by current theory.
zh

[AI-20] Evaluating Time Series Models for Urban Wastewater Management: Predictive Performance Model Complexity and Resilience

【速读】:该论文旨在解决气候变化导致极端降雨频率增加,对城市基础设施(尤其是Combined Sewer Systems, CSS)造成的压力问题。传统基于物理的模型虽然有效,但维护成本高且难以适应系统动态变化。为应对这一挑战,论文提出了一种评估神经网络架构用于CSS时间序列预测的协议,重点关注预测性能、模型复杂度以及对扰动的鲁棒性。此外,论文还探讨了轻量级模型在物联网部署中的可行性,并通过比较全局模型与局部模型来评估其在分散场景下的韧性。为探索网络中断或对抗攻击带来的安全风险,引入误差模型以评估模型的稳健性。关键在于通过平衡全局模型的预测性能与局部模型的分布式韧性,开发出可解释且可靠的机器学习解决方案,以实现可持续的城市污水处理管理。

链接: https://arxiv.org/abs/2504.17461
作者: Vipin Singh,Tianheng Ling,Teodor Chiaburu,Felix Biessmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures, accepted at 10th International Conference on Smart and Sustainable Technologies (SpliTech) 2025, GitHub: this https URL

点击查看摘要

Abstract:Climate change increases the frequency of extreme rainfall, placing a significant strain on urban infrastructures, especially Combined Sewer Systems (CSS). Overflows from overburdened CSS release untreated wastewater into surface waters, posing environmental and public health risks. Although traditional physics-based models are effective, they are costly to maintain and difficult to adapt to evolving system dynamics. Machine Learning (ML) approaches offer cost-efficient alternatives with greater adaptability. To systematically assess the potential of ML for modeling urban infrastructure systems, we propose a protocol for evaluating Neural Network architectures for CSS time series forecasting with respect to predictive performance, model complexity, and robustness to perturbations. In addition, we assess model performance on peak events and critical fluctuations, as these are the key regimes for urban wastewater management. To investigate the feasibility of lightweight models suitable for IoT deployment, we compare global models, which have access to all information, with local models, which rely solely on nearby sensor readings. Additionally, to explore the security risks posed by network outages or adversarial attacks on urban infrastructure, we introduce error models that assess the resilience of models. Our results demonstrate that while global models achieve higher predictive performance, local models provide sufficient resilience in decentralized scenarios, ensuring robust modeling of urban infrastructure. Furthermore, models with longer native forecast horizons exhibit greater robustness to data perturbations. These findings contribute to the development of interpretable and reliable ML solutions for sustainable urban wastewater management. The implementation is available in our GitHub repository.
zh

[AI-21] Detection Classification and Prevalence of Self-Admitted Aging Debt

【速读】:该论文试图解决软件老化(Software Aging)研究中局限于动态运行时指标(如内存和性能)而忽视进化指标(如源代码注释)的问题,并关注传统遗留问题在技术债务(Technical Debt, TD)语境下的狭隘视角。论文引入了老化债务(Aging Debt, AD)的概念,以反映保持软件更新所需的额外维护努力和成本。解决方案的关键在于通过自承认老化债务(Self-Admitted Aging Debt, SAAD)这一观察手段,即从开发者留下的源代码注释中识别老化迹象。论文采用混合方法论,结合定性和定量分析来检测和量化AD,构建了一个反映软件时间老化及其相关债务的SAAD分类法,并利用该分类法量化开源软件存储库中的不同AD类型。结果表明,超过21%的存储库表现出SAAD特征,其中休眠型AD为主要类别,揭示了软件维护中被忽视的重要方面。

链接: https://arxiv.org/abs/2504.17428
作者: Murali Sridharan,Mika Mäntylä,Leevi Rantala
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Literature (cs.GL)
备注: Draft

点击查看摘要

Abstract:Context: Previous research on software aging is limited with focus on dynamic runtime indicators like memory and performance, often neglecting evolutionary indicators like source code comments and narrowly examining legacy issues within the TD context. Objective: We introduce the concept of Aging Debt (AD), representing the increased maintenance efforts and costs needed to keep software updated. We study AD through Self-Admitted Aging Debt (SAAD) observed in source code comments left by software developers. Method: We employ a mixed-methods approach, combining qualitative and quantitative analyses to detect and measure AD in software. This includes framing SAAD patterns from the source code comments after analysing the source code context, then utilizing the SAAD patterns to detect SAAD comments. In the process, we develop a taxonomy for SAAD that reflects the temporal aging of software and its associated debt. Then we utilize the taxonomy to quantify the different types of AD prevalent in OSS repositories. Results: Our proposed taxonomy categorizes temporal software aging into Active and Dormant types. Our extensive analysis of over 9,000+ Open Source Software (OSS) repositories reveals that more than 21% repositories exhibit signs of SAAD as observed from our gold standard SAAD dataset. Notably, Dormant AD emerges as the predominant category, highlighting a critical but often overlooked aspect of software maintenance. Conclusion: As software volume grows annually, so do evolutionary aging and maintenance challenges; our proposed taxonomy can aid researchers in detailed software aging studies and help practitioners develop improved and proactive maintenance strategies.
zh

[AI-22] owards Leverag ing Large Language Model Summaries for Topic Modeling in Source Code

【速读】:该论文旨在解决如何自动识别代码库中的有意义主题的问题。解决方案的关键在于结合大型语言模型(Large Language Models, LLMs)和基于变换器的主题建模技术的优势。具体而言,论文提出的方法是利用LLM生成的代码描述进行主题建模,以提取Python程序语义信息。通过将这些从LLM生成摘要中提取的主题与仅从函数名或现有文档字符串推断的主题进行比较,验证所提取主题的一致性。实验结果表明,利用LLM生成的摘要能够提供可解释且语义丰富的代码结构表示,这为代码自动生成文档和标签、代码搜索、软件重组以及大规模存储库中的知识发现等软件工程任务提供了有前景的应用方向。

链接: https://arxiv.org/abs/2504.17426
作者: Michele Carissimi,Martina Saletta,Claudio Ferretti
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.
zh

[AI-23] Object Pose Estimation by Camera Arm Control Based on the Next Viewpoint Estimation

【速读】:本文旨在解决零售商店中产品展示机器人对简单形状产品的位姿估计(Pose Estimation)精度下降的问题。现有基于神经网络(Neural Network, NN)的方法在使用RGBD相机进行位姿估计时具有高精度,但当当前视点的纹理和形状特征较少时,其准确性显著降低。同时,传统基于数学模型的方法难以有效估算下一视点(Next Viewpoint, NV),因为简单形状物体的特征有限。为解决此问题,论文的关键创新在于提出了一种新的位姿估计神经网络,该网络能够同时估算NV。实验结果表明,所提出的NV估算方法将位姿估计成功率提高了7.4个百分点,达到77.3%,并且验证了采用该方法的机器人可以展示84.2%的产品。因此,论文的核心解决方案在于通过结合位姿估计与NV估计的相互关系,开发出一种能够协同优化两者的新方法。

链接: https://arxiv.org/abs/2504.17424
作者: Tomoki Mizuno,Kazuya Yabashi,Tsuyoshi Tasaki
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We have developed a new method to estimate a Next Viewpoint (NV) which is effective for pose estimation of simple-shaped products for product display robots in retail stores. Pose estimation methods using Neural Networks (NN) based on an RGBD camera are highly accurate, but their accuracy significantly decreases when the camera acquires few texture and shape features at a current view point. However, it is difficult for previous mathematical model-based methods to estimate effective NV which is because the simple shaped objects have few shape features. Therefore, we focus on the relationship between the pose estimation and NV estimation. When the pose estimation is more accurate, the NV estimation is more accurate. Therefore, we develop a new pose estimation NN that estimates NV simultaneously. Experimental results showed that our NV estimation realized a pose estimation success rate 77.3%, which was 7.4pt higher than the mathematical model-based NV calculation did. Moreover, we verified that the robot using our method displayed 84.2% of products.
zh

[AI-24] owards Harnessing the Collaborative Power of Large and Small Models for Domain Tasks

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在适配私有领域时面临的资源消耗大和技术局限性问题。论文主张通过大模型与小模型协同工作的协作方法,加速LLMs在私有领域的适应,并挖掘人工智能的新潜力。解决方案的关键在于探索大模型与小模型协同工作的多种策略,同时识别潜在挑战与机遇,强调以行业驱动的研究为导向,重点关注真实世界私有数据集和应用场景下的多目标基准测试。

链接: https://arxiv.org/abs/2504.17421
作者: Yang Liu,Bingjie Yan,Tianyuan Zou,Jianqing Zhang,Zixuan Gu,Jianbing Ding,Xidong Wang,Jingyi Li,Xiaozhou Ye,Ye Ouyang,Qiang Yang,Ya-Qin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities, but they require vast amounts of data and computational resources. In contrast, smaller models (SMs), while less powerful, can be more efficient and tailored to specific domains. In this position paper, we argue that taking a collaborative approach, where large and small models work synergistically, can accelerate the adaptation of LLMs to private domains and unlock new potential in AI. We explore various strategies for model collaboration and identify potential challenges and opportunities. Building upon this, we advocate for industry-driven research that prioritizes multi-objective benchmarks on real-world private datasets and applications.
zh

[AI-25] Redefining Superalignment: From Weak-to-Strong Alignment to Human-AI Co-Alignment to Sustainable Symbiotic Society

【速读】:该论文试图解决超级智能(Artificial Superintelligence, ASI)超越人类控制、违背人类价值观,甚至可能导致不可逆灾难性后果的问题,即“超级对齐”(superalignment),确保比人类更聪明的AI系统始终保持与人类意图和价值观的一致性。现有可扩展监督和弱到强泛化方法在面对ASI时可能难以实现且不足,因此需要探索更安全和多元化的框架与方法。

解决方案的关键在于重新定义超级对齐为核心目标——实现人与AI的协同对齐,以构建可持续共生的社会。为此,论文提出一个整合外部监督和内在主动对齐的框架。外部监督通过以人类为中心的最终决策为基础,并辅以可解释的自动化评估与校正,持续适应人类不断演化的价值观;内在主动对齐则基于深刻理解自我、他人和社会,融合自我意识、自我反思和共情能力,自发推断人类意图,区分善恶并主动考虑人类福祉,最终通过迭代交互达成人与AI的协同对齐。这种外部驱动监督与内在驱动主动对齐的结合,能够推动可持续共生社会的发展,为实现安全有益的通用人工智能(AGI)和ASI奠定基础。

链接: https://arxiv.org/abs/2504.17404
作者: Feifei Zhao,Yuwei Wang,Enmeng Lu,Dongcheng Zhao,Bing Han,Haibo Tong,Yao Liang,Dongqi Liang,Kang Sun,Lei Wang,Yitao Liang,Chao Liu,Yaodong Yang,Yi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are becoming increasingly powerful and autonomous, and may progress to surpass human intelligence levels, namely Artificial Superintelligence (ASI). During the progression from AI to ASI, it may exceed human control, violate human values, and even lead to irreversible catastrophic consequences in extreme cases. This gives rise to a pressing issue that needs to be addressed: superalignment, ensuring that AI systems much smarter than humans, remain aligned with human (compatible) intentions and values. Existing scalable oversight and weak-to-strong generalization methods may prove substantially infeasible and inadequate when facing ASI. We must explore safer and more pluralistic frameworks and approaches for superalignment. In this paper, we redefine superalignment as the human-AI co-alignment towards a sustainable symbiotic society, and highlight a framework that integrates external oversight and intrinsic proactive alignment. External oversight superalignment should be grounded in human-centered ultimate decision, supplemented by interpretable automated evaluation and correction, to achieve continuous alignment with humanity’s evolving values. Intrinsic proactive superalignment is rooted in a profound understanding of the self, others, and society, integrating self-awareness, self-reflection, and empathy to spontaneously infer human intentions, distinguishing good from evil and proactively considering human well-being, ultimately attaining human-AI co-alignment through iterative interaction. The integration of externally-driven oversight with intrinsically-driven proactive alignment empowers sustainable symbiotic societies through human-AI co-alignment, paving the way for achieving safe and beneficial AGI and ASI for good, for human, and for a symbiotic ecology.
zh

[AI-26] Assessing the Capability of Large Language Models for Domain-Specific Ontology Generation

【速读】:该论文试图解决的问题是如何评估大型语言模型(Large Language Models, LLMs)在领域特定本体生成任务中的适用性和性能,并探索其跨领域的泛化能力。论文的关键在于通过使用两种最先进的具备推理能力的LLMs(DeepSeek和o1-preview),基于一套精心设计的能力问题(Competency Questions, CQs)和相关用户故事,自动生成本体,并在六个不同的现有本体工程项目的领域中进行实验验证。研究发现表明,这两种模型在所有领域内的实验表现高度一致,证明了LLM方法能够实现可扩展且领域无关的本体构建,为提升自动化推理和知识表示技术奠定了基础。

链接: https://arxiv.org/abs/2504.17402
作者: Anna Sofia Lippolis,Mohammad Javad Saeedizade,Robin Keskisarkka,Aldo Gangemi,Eva Blomqvist,Andrea Giovanni Nuzzolese
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant potential for ontology engineering. However, it is still unclear to what extent they are applicable to the task of domain-specific ontology generation. In this study, we explore the application of LLMs for automated ontology generation and evaluate their performance across different domains. Specifically, we investigate the generalizability of two state-of-the-art LLMs, DeepSeek and o1-preview, both equipped with reasoning capabilities, by generating ontologies from a set of competency questions (CQs) and related user stories. Our experimental setup comprises six distinct domains carried out in existing ontology engineering projects and a total of 95 curated CQs designed to test the models’ reasoning for ontology engineering. Our findings show that with both LLMs, the performance of the experiments is remarkably consistent across all domains, indicating that these methods are capable of generalizing ontology generation tasks irrespective of the domain. These results highlight the potential of LLM-based approaches in achieving scalable and domain-agnostic ontology construction and lay the groundwork for further research into enhancing automated reasoning and knowledge representation techniques.
zh

[AI-27] owards User-Centred Design of AI-Assisted Decision-Making in Law Enforcement

【速读】:该论文旨在解决执法领域中设计人工智能辅助系统时用户需求不明确的问题。研究通过定性分析识别现有实践的局限性,探索用户需求,并探讨人类在这些系统中期望承担的责任。关键在于开发一种能够高效处理和分析大规模数据以支持犯罪检测与预防的系统,同时满足可扩展性、准确性、可解释性、可信性和适应性的要求。此外,用户需审查可能对AI具有挑战性的输入数据并验证输出结果以确保准确性,还需协助系统适应刑事行为和政策指导的变化。技术专家需持续监督系统,而友好的人机交互对于系统的采纳至关重要。论文强调,由于执法领域的动态复杂性,完全自动化难以实现。

链接: https://arxiv.org/abs/2504.17393
作者: Vesna Nowack,Dalal Alrajeh,Carolina Gutierrez Muñoz,Katie Thomas,William Hobson,Catherine Hamilton-Giachritsis,Patrick Benjamin,Tim Grant,Juliane A. Kloess,Jessica Woodhams
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:Artificial Intelligence (AI) has become an important part of our everyday lives, yet user requirements for designing AI-assisted systems in law enforcement remain unclear. To address this gap, we conducted qualitative research on decision-making within a law enforcement agency. Our study aimed to identify limitations of existing practices, explore user requirements and understand the responsibilities that humans expect to undertake in these systems. Participants in our study highlighted the need for a system capable of processing and analysing large volumes of data efficiently to help in crime detection and prevention. Additionally, the system should satisfy requirements for scalability, accuracy, justification, trustworthiness and adaptability to be adopted in this domain. Participants also emphasised the importance of having end users review the input data that might be challenging for AI to interpret, and validate the generated output to ensure the system’s accuracy. To keep up with the evolving nature of the law enforcement domain, end users need to help the system adapt to the changes in criminal behaviour and government guidance, and technical experts need to regularly oversee and monitor the system. Furthermore, user-friendly human interaction with the system is essential for its adoption and some of the participants confirmed they would be happy to be in the loop and provide necessary feedback that the system can learn from. Finally, we argue that it is very unlikely that the system will ever achieve full automation due to the dynamic and complex nature of the law enforcement domain. Comments: 10 pages, 1 figure Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2504.17393 [cs.CY] (or arXiv:2504.17393v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2504.17393 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-28] Comprehend Divide and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning

【速读】:该论文旨在解决传统特征选择方法在处理复杂数据集时面临的效率低下和性能瓶颈问题,特别是基于单一特征一个代理(one-feature-one-agent)的强化学习方法存在的不足。论文的关键创新在于提出了一种名为HRLFS(Hierarchical Reinforcement Learning for Feature Selection)的新方法。其核心解决方案是利用基于大型语言模型(Large Language Model, LLM)的混合状态提取器来捕捉每个特征的数学与语义特性,并据此对特征进行聚类,进而构建层次化的代理(hierarchical agents)以分别处理不同簇和子簇中的特征。这种方法不仅提高了特征子空间探索的效率,还显著减少了所需代理的数量,从而加速了整体运行时间,同时提升了下游机器学习任务的表现。

链接: https://arxiv.org/abs/2504.17356
作者: Weiliang Zhang,Xiaohan Huang,Yi Du,Ziyue Qiao,Qingqing Long,Zhen Meng,Yuanchun Zhou,Meng Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, keywords: Automated Feature Engineering, Tabular Dataset, Multi-Agent Reinforcement Learning, Feature Selection

点击查看摘要

Abstract:Feature selection aims to preprocess the target dataset, find an optimal and most streamlined feature subset, and enhance the downstream machine learning task. Among filter, wrapper, and embedded-based approaches, the reinforcement learning (RL)-based subspace exploration strategy provides a novel objective optimization-directed perspective and promising performance. Nevertheless, even with improved performance, current reinforcement learning approaches face challenges similar to conventional methods when dealing with complex datasets. These challenges stem from the inefficient paradigm of using one agent per feature and the inherent complexities present in the datasets. This observation motivates us to investigate and address the above issue and propose a novel approach, namely HRLFS. Our methodology initially employs a Large Language Model (LLM)-based hybrid state extractor to capture each feature’s mathematical and semantic characteristics. Based on this information, features are clustered, facilitating the construction of hierarchical agents for each cluster and sub-cluster. Extensive experiments demonstrate the efficiency, scalability, and robustness of our approach. Compared to contemporary or the one-feature-one-agent RL-based approaches, HRLFS improves the downstream ML performance with iterative feature subspace exploration while accelerating total run time by reducing the number of agents involved.
zh

[AI-29] Collaborative Multi-Agent Reinforcement Learning for Automated Feature Transformation with Graph-Driven Path Optimization

【速读】:该论文旨在解决现有特征变换框架中忽视动态依赖关系的问题,这些框架虽然减少了人工成本,但通常将特征变换视为孤立操作。为了解决这一局限性,论文提出了一种名为TCTO的协作多智能体强化学习框架,通过图驱动的路径优化实现特征工程的自动化。该框架的关键创新在于引入了一个随时间演化的交互图,其中特征被建模为节点,变换过程作为边。通过图剪枝和回溯机制,动态移除低影响边,减少冗余操作,并增强探索稳定性。此外,此图还提供了完整的可追溯性,使TCTO能够重用历史变换中的高 Utility 子图。实验和案例研究验证了该方法在多种数据集上的优越性能。

链接: https://arxiv.org/abs/2504.17355
作者: Xiaohan Huang,Dongjie Wang,Zhiyuan Ning,Ziyue Qiao,Qingqing Long,Haowei Zhu,Yi Du,Min Wu,Yuanchun Zhou,Meng Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, Keywords: Automated Feature Transformation, Tabular Dataset, Reinforcement Learning

点击查看摘要

Abstract:Feature transformation methods aim to find an optimal mathematical feature-feature crossing process that generates high-value features and improves the performance of downstream machine learning tasks. Existing frameworks, though designed to mitigate manual costs, often treat feature transformations as isolated operations, ignoring dynamic dependencies between transformation steps. To address the limitations, we propose TCTO, a collaborative multi-agent reinforcement learning framework that automates feature engineering through graph-driven path optimization. The framework’s core innovation lies in an evolving interaction graph that models features as nodes and transformations as edges. Through graph pruning and backtracking, it dynamically eliminates low-impact edges, reduces redundant operations, and enhances exploration stability. This graph also provides full traceability to empower TCTO to reuse high-utility subgraphs from historical transformations. To demonstrate the efficacy and adaptability of our approach, we conduct comprehensive experiments and case studies, which show superior performance across a range of datasets.
zh

[AI-30] Data-Driven Surrogate Modeling Techniques to Predict the Effective Contact Area of Rough Surface Contact Problems

【速读】:该论文旨在解决在多物理场现象(如磨损、密封以及热或电传导)中粗糙表面接触有效接触面积的高效预测问题。传统数值方法如边界元法(Boundary Element Method, BEM)虽然精度高,但其高昂的计算成本限制了其在需要多次重复评估的多查询场景(如不确定性量化、参数识别及多尺度算法)中的应用。为应对这一挑战,论文提出了一种基于数据驱动的代理建模框架,利用快速评估的技术来预测有效接触面积。解决方案的关键在于通过训练多种机器学习算法(输入为施加载荷和统计粗糙度参数,输出为对应的有效接触面积),并对模型进行超参数优化,以实现预测精度与计算效率之间的最佳权衡。其中,核岭回归(Kernel Ridge Regressor)因其在准确性、预测时间和训练开销方面的优异表现脱颖而出,成为通用代理建模的有力候选者;而高斯过程回归(Gaussian Process Regressor)则在需要不确定性量化时提供了有吸引力的替代方案。数据库生成是该代理建模流程的主要成本来源,但整体方法在多查询任务中仍表现出实用性和效率。

链接: https://arxiv.org/abs/2504.17354
作者: Tarik Sahin,Jacopo Bonari,Sebastian Brandstaeter,Alexander Popp
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The effective contact area in rough surface contact plays a critical role in multi-physics phenomena such as wear, sealing, and thermal or electrical conduction. Although accurate numerical methods, like the Boundary Element Method (BEM), are available to compute this quantity, their high computational cost limits their applicability in multi-query contexts, such as uncertainty quantification, parameter identification, and multi-scale algorithms, where many repeated evaluations are required. This study proposes a surrogate modeling framework for predicting the effective contact area using fast-to-evaluate data-driven techniques. Various machine learning algorithms are trained on a precomputed dataset, where the inputs are the imposed load and statistical roughness parameters, and the output is the corresponding effective contact area. All models undergo hyperparameter optimization to enable fair comparisons in terms of predictive accuracy and computational efficiency, evaluated using established quantitative metrics. Among the models, the Kernel Ridge Regressor demonstrates the best trade-off between accuracy and efficiency, achieving high predictive accuracy, low prediction time, and minimal training overhead-making it a strong candidate for general-purpose surrogate modeling. The Gaussian Process Regressor provides an attractive alternative when uncertainty quantification is required, although it incurs additional computational cost due to variance estimation. The generalization capability of the Kernel Ridge model is validated on an unseen simulation scenario, confirming its ability to transfer to new configurations. Database generation constitutes the dominant cost in the surrogate modeling process. Nevertheless, the approach proves practical and efficient for multi-query tasks, even when accounting for this initial expense.
zh

[AI-31] Dual-Individual Genetic Algorithm: A Dual-Individual Approach for Efficient Training of Multi-Layer Neural Networks

【速读】:该论文试图解决二值图像分类任务中神经网络优化的问题,特别是如猫与非猫分类等场景。传统梯度下降方法存在手动调整网络架构复杂且收敛效率低的问题。为了解决这些问题,论文提出了一种增强型遗传算法——Dual-Individual Genetic Algorithm (Dual-Individual GA),其关键在于引入了“Leader”和“Follower”两个个体分别负责开发(exploitation)和探索(exploration)。其中,“Leader”通过偶数索引位置参数集聚焦于最优解的开发,“Follower”则通过奇数索引位置参数集保持种群多样性以避免过早收敛。此外,该方法设计了一种自适应层维度机制,并基于Pareto支配关系生成了两组参数集,同时在实验中展示了优于传统梯度方法的性能,特别是在训练准确率和成本效率方面。

链接: https://arxiv.org/abs/2504.17346
作者: Tran Thuy Nga Truong,Jooyong Kim
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces an enhanced Genetic Algorithm technique called Dual-Individual Genetic Algorithm (Dual-Individual GA), which optimizes neural networks for binary image classification tasks, such as cat vs. non-cat classification. The proposed method employs only two individuals for crossover, represented by two parameter sets: Leader and Follower. The Leader focuses on exploitation, representing the primary optimal solution at even-indexed positions (0, 2, 4, …), while the Follower promotes exploration by preserving diversity and avoiding premature convergence, operating at odd-indexed positions (1, 3, 5, …). Leader and Follower are modeled as two phases or roles. The key contributions of this work are threefold: (1) a self-adaptive layer dimension mechanism that eliminates the need for manual tuning of layer architectures; (2) generates two parameter sets, leader and follower parameter sets, with 10 layer architecture configurations (5 for each set), ranked by Pareto dominance and cost. post-optimization; and (3) demonstrated superior performance compared to traditional gradient-based methods. Experimental results show that the Dual-Individual GA achieves 99.04% training accuracy and 80% testing accuracy (cost = 0.034) on a three-layer network with architecture [12288, 17, 4, 1], outperforming a gradient-based approach that achieves 98% training accuracy and 80% testing accuracy (cost = 0.092) on a four-layer network with architecture [12288, 20, 7, 5, 1]. These findings highlight the efficiency and effectiveness of the proposed method in optimizing neural networks.
zh

[AI-32] Exploring Context-aware and LLM -driven Locomotion for Immersive Virtual Reality

【速读】:该论文试图解决虚拟现实环境中用户导航体验受限的问题,特别是传统控制器或语音命令方法在自然性和灵活性方面的局限性。论文的关键解决方案在于提出了一种基于大型语言模型(Large Language Models, LLMs)的新型无手柄移动技术,允许用户通过具有上下文感知的自然语言与虚拟环境进行交互。这种方法不仅提供了舒适且自然的无手柄导航方式,还增强了用户的注意力和参与度,同时通过SHAP分析揭示了不同技术在视觉注意力和认知处理上的差异模式。

链接: https://arxiv.org/abs/2504.17331
作者: Süleyman Özdel,Kadir Burak Buldu,Enkelejda Kasneci,Efe Bozkir
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Locomotion plays a crucial role in shaping the user experience within virtual reality environments. In particular, hands-free locomotion offers a valuable alternative by supporting accessibility and freeing users from reliance on handheld controllers. To this end, traditional speech-based methods often depend on rigid command sets, limiting the naturalness and flexibility of interaction. In this study, we propose a novel locomotion technique powered by large language models (LLMs), which allows users to navigate virtual environments using natural language with contextual awareness. We evaluate three locomotion methods: controller-based teleportation, voice-based steering, and our language model-driven approach. Our evaluation measures include eye-tracking data analysis, including explainable machine learning through SHAP analysis as well as standardized questionnaires for usability, presence, cybersickness, and cognitive load to examine user attention and engagement. Our findings indicate that the LLM-driven locomotion possesses comparable usability, presence, and cybersickness scores to established methods like teleportation, demonstrating its novel potential as a comfortable, natural language-based, hands-free alternative. In addition, it enhances user attention within the virtual environment, suggesting greater engagement. Complementary to these findings, SHAP analysis revealed that fixation, saccade, and pupil-related features vary across techniques, indicating distinct patterns of visual attention and cognitive processing. Overall, we state that our method can facilitate hands-free locomotion in virtual spaces, especially in supporting accessibility.
zh

[AI-33] You Are What You Bought: Generating Customer Personas for E-commerce Applications SIGIR2025

【速读】:该论文旨在解决现有基于深度学习的用户表示方法难以理解且难以与外部知识集成的问题,这限制了诸如客户细分、搜索导航和产品推荐等应用的效果。为了解决这一问题,论文引入了客户画像(Customer Persona)的概念,通过从客户的大量购买历史中提取多维度且易于人类理解的特征描述,如“忙碌的父母”或“精明的买家”,以实现更清晰且信息丰富的显式用户表示。论文的关键解决方案是GPLR(Generative Persona-based Labeling and Reasoning),它利用预训练的大语言模型(LLMs)推断客户画像,并通过仅对部分用户应用基于LLM的标注,结合随机游走技术预测其余用户的画像,从而有效降低计算开销。此外,论文还提出了RevAff算法,在保证绝对误差 ε 的同时,将精确解的时间复杂度提升了至少 (O\left(\frac{\epsilon \cdot |E|}{N|E| + N\log N}\right)),其中 N 表示客户和产品的数量,E 表示它们之间的交互。实验结果表明,基于客户画像的表示方法显著提升了基于图卷积的推荐模型在推荐和客户细分任务中的表现,NDCG@K 和 F1-Score@K 提高了多达 12%。

链接: https://arxiv.org/abs/2504.17304
作者: Yimin Shi,Yang Fei,Shiqi Zhang,Haixun Wang,Xiaokui Xiao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2025

点击查看摘要

Abstract:In e-commerce, user representations are essential for various applications. Existing methods often use deep learning techniques to convert customer behaviors into implicit embeddings. However, these embeddings are difficult to understand and integrate with external knowledge, limiting the effectiveness of applications such as customer segmentation, search navigation, and product recommendations. To address this, our paper introduces the concept of the customer persona. Condensed from a customer’s numerous purchasing histories, a customer persona provides a multi-faceted and human-readable characterization of specific purchase behaviors and preferences, such as Busy Parents or Bargain Hunters. This work then focuses on representing each customer by multiple personas from a predefined set, achieving readable and informative explicit user representations. To this end, we propose an effective and efficient solution GPLR. To ensure effectiveness, GPLR leverages pre-trained LLMs to infer personas for customers. To reduce overhead, GPLR applies LLM-based labeling to only a fraction of users and utilizes a random walk technique to predict personas for the remaining customers. We further propose RevAff, which provides an absolute error \epsilon guarantee while improving the time complexity of the exact solution by a factor of at least O(\frac\epsilon\cdot|E|N|E|+N\log N) , where N represents the number of customers and products, and E represents the interactions between them. We evaluate the performance of our persona-based representation in terms of accuracy and robustness for recommendation and customer segmentation tasks using three real-world e-commerce datasets. Most notably, we find that integrating customer persona representations improves the state-of-the-art graph convolution-based recommendation model by up to 12% in terms of NDCG@K and F1-Score@K. Comments: SIGIR 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.17304 [cs.IR] (or arXiv:2504.17304v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.17304 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730118 Focus to learn more DOI(s) linking to related resources
zh

[AI-34] AI-Enhanced Business Process Automation: A Case Study in the Insurance Domain Using Object-Centric Process Mining

【速读】:该论文旨在解决如何通过数据驱动的方法全面评估大型语言模型(LLMs)驱动的自动化对业务流程转型的影响,特别是关注传统流程与AI增强型流程变体在转型过程中的共存方式及其对流程可扩展性的影响。论文的关键在于采用对象中心流程挖掘(Object-Centric Process Mining, OCPM)方法,结合保险行业的实际案例,分析LLMs在提升索赔部分识别任务自动化过程中对流程效率和可扩展性的具体影响。解决方案的关键是利用OCPM技术揭示AI引入的新流程动态,并通过实证研究验证其在真实场景中的适用性和局限性。

链接: https://arxiv.org/abs/2504.17295
作者: Shahrzad Khayatbashi,Viktor Sjölind,Anders Granåker,Amin Jalali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Artificial Intelligence (AI), particularly Large Language Models (LLMs), have enhanced organizations’ ability to reengineer business processes by automating knowledge-intensive tasks. This automation drives digital transformation, often through gradual transitions that improve process efficiency and effectiveness. To fully assess the impact of such automation, a data-driven analysis approach is needed - one that examines how traditional and AI-enhanced process variants coexist during this transition. Object-Centric Process Mining (OCPM) has emerged as a valuable method that enables such analysis, yet real-world case studies are still needed to demonstrate its applicability. This paper presents a case study from the insurance sector, where an LLM was deployed in production to automate the identification of claim parts, a task previously performed manually and identified as a bottleneck for scalability. To evaluate this transformation, we apply OCPM to assess the impact of AI-driven automation on process scalability. Our findings indicate that while LLMs significantly enhance operational capacity, they also introduce new process dynamics that require further refinement. This study also demonstrates the practical application of OCPM in a real-world setting, highlighting its advantages and limitations.
zh

[AI-35] Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

【速读】:本文旨在解决在数据有限的情况下,基于图形用户界面(GUI)的自主web导航代理所需的大量领域特定专家演示样本的问题。特别是在稀疏奖励和大动作空间的环境中,如web GUI,仅少数操作在特定情况下相关,导致样本效率低下。为实现更高效的样本学习,本文探索了通过基于意图的 affordances 来约束动作空间的效果——即在任何情况下仅考虑能够实现期望结果的动作子集。解决方案的关键在于提出了一种名为 Code as Generative Affordances (\textttCoGA) 的方法,该方法利用预训练的视觉语言模型(VLMs)生成代码,通过隐式的意图完成函数确定可负担的操作,并采用全自动的程序生成和验证管道。这些程序被用于强化学习代理的循环中,以根据像素观测返回一组 affordances。通过大幅减少代理必须考虑的动作数量,本文展示了在MiniWob++基准的一系列任务中,\textttCoGA 在样本效率、跨任务泛化能力以及少量专家演示可用时的表现等方面的优势。

链接: https://arxiv.org/abs/2504.17282
作者: Lynn Cherif,Flemming Kondrup,David Venuto,Ankit Anand,Doina Precup,Khimya Khetarpal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through \textitintent-based affordances – i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose \textbfCode as Generative Affordances (\textbf \textttCoGA ) , a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: \textbf1) \textttCoGA is orders of magnitude more sample efficient than its RL agent, \textbf2) \textttCoGA 's programs can generalize within a family of tasks, and \textbf3) \textttCoGA performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.
zh

[AI-36] ExOSITO: Explainable Off-Policy Learning with Side Information for Intensive Care Unit Blood Test Orders ALT

【速读】:该论文旨在解决重症监护室(ICU)患者实验室检测项目开单数量过多的问题,通过平衡提供正确信息的需求与降低临床负担及成本之间的矛盾,减少过度医疗带来的医院资源浪费及环境影响。论文的关键解决方案在于提出了一种名为“结合特权信息的可解释反事实学习用于ICU血液检测开单”(EXplainable Off-policy learning with Side Information for ICU blood Test Orders, ExOSITO) 的新方法。该方法将临床知识与观察数据相结合,利用因果多臂强盗模型和基于临床批准规则的奖励函数,在离线数据上进行训练,以弥合最优策略与实际执行策略之间的差距。其核心在于通过整合特权信息(如患者当前状态及预测未来状态),生成可解释的辅助决策工具,从而在不遗漏任何重要检测的前提下有效降低成本,同时优于传统医生经验或已有方法的性能。

链接: https://arxiv.org/abs/2504.17277
作者: Zongliang Ji,Andre Carlos Kajdacsy-Balla Amaral,Anna Goldenberg,Rahul G. Krishnan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the Conference on Health, Inference, and Learning (CHIL) 2025

点击查看摘要

Abstract:Ordering a minimal subset of lab tests for patients in the intensive care unit (ICU) can be challenging. Care teams must balance between ensuring the availability of the right information and reducing the clinical burden and costs associated with each lab test order. Most in-patient settings experience frequent over-ordering of lab tests, but are now aiming to reduce this burden on both hospital resources and the environment. This paper develops a novel method that combines off-policy learning with privileged information to identify the optimal set of ICU lab tests to order. Our approach, EXplainable Off-policy learning with Side Information for ICU blood Test Orders (ExOSITO) creates an interpretable assistive tool for clinicians to order lab tests by considering both the observed and predicted future status of each patient. We pose this problem as a causal bandit trained using offline data and a reward function derived from clinically-approved rules; we introduce a novel learning framework that integrates clinical knowledge with observational data to bridge the gap between the optimal and logging policies. The learned policy function provides interpretable clinical information and reduces costs without omitting any vital lab orders, outperforming both a physician’s policy and prior approaches to this practical problem.
zh

[AI-37] Symbolic Representation for Any-to-Any Generative Tasks

【速读】:该论文旨在解决传统生成式 AI 模型在跨模态任务建模中依赖大规模训练数据和隐式神经表示导致计算成本高且灵活性有限的问题。论文提出了一种符号化生成任务描述语言及对应的推理引擎,其核心在于引入显式的符号表示方法,包含函数(functions)、参数(parameters)和拓扑逻辑(topological logic)三个基本原语。通过利用预训练语言模型,该推理引擎能够以无监督的方式将自然语言指令直接映射为符号化工作流。关键在于这种显式符号表示方法不仅实现了超过12种多样化的多模态生成任务,而且无需针对具体任务进行调优,同时在内容质量、效率、可编辑性和中断性方面表现出显著优势。

链接: https://arxiv.org/abs/2504.17261
作者: Jiaqi Chen,Xiaoye Zhu,Yue Wang,Tianyang Liu,Xinhui Chen,Ying Chen,Chak Tou Leong,Yifei Ke,Joseph Liu,Yiwen Yuan,Julian McAuley,Li-jia Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.
zh

[AI-38] argeted AMP generation through controlled diffusion with efficient embeddings

【速读】:该论文旨在解决基于深度学习的抗菌肽 (Antimicrobial Peptide, AMP) 发现面临的低实验命中率以及对细微可控性和高效建模肽性质的需求等关键挑战。解决方案的关键在于提出OmegAMP框架,其核心包括利用基于扩散的生成模型(diffusion-based generative model)、高效的低维嵌入表示(efficient low-dimensional embeddings)、精确的可控性机制(precise controllability mechanisms)以及具有显著降低假阳性率的新颖分类器(novel classifiers with drastically reduced false positive rates),用于候选过滤。OmegAMP能够针对特定理化性质、活性谱和物种特异性效力靶向生成AMP,并在生成过程中最大化样本多样性同时保持对底层数据分布的忠实性。这一方法在AMP发现的各个阶段均表现出最先进的性能,显著提升了计算框架在对抗抗菌素耐药性方面的潜力。

链接: https://arxiv.org/abs/2504.17247
作者: Diogo Soares,Leon Hetzel,Paulina Szymczak,Fabian Theis,Stephan Günnemann,Ewa Szczurek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Deep learning-based antimicrobial peptide (AMP) discovery faces critical challenges such as low experimental hit rates as well as the need for nuanced controllability and efficient modeling of peptide properties. To address these challenges, we introduce OmegAMP, a framework that leverages a diffusion-based generative model with efficient low-dimensional embeddings, precise controllability mechanisms, and novel classifiers with drastically reduced false positive rates for candidate filtering. OmegAMP enables the targeted generation of AMPs with specific physicochemical properties, activity profiles, and species-specific effectiveness. Moreover, it maximizes sample diversity while ensuring faithfulness to the underlying data distribution during generation. We demonstrate that OmegAMP achieves state-of-the-art performance across all stages of the AMP discovery pipeline, significantly advancing the potential of computational frameworks in combating antimicrobial resistance.
zh

[AI-39] NeuralGrok: Accelerate Grokking by Neural Gradient Transformation

【速读】:本文旨在解决Transformer模型在算术任务中grokking现象所导致的长时间过拟合后才实现泛化的问题。论文提出了一种基于梯度的新方法NeuralGrok,其关键是通过一个辅助模块(如MLP块)与基础模型协同训练,动态调整梯度分量对泛化的贡献,并利用双层优化算法指导这一过程。与传统正则化方法(如权重衰减)相比,NeuralGrok不仅显著加速了泛化,还降低了模型复杂度,从而促进更稳定的训练范式。此外,通过引入新的绝对梯度熵(AGE)度量,进一步揭示了NeuralGrok通过减少模型复杂度来有效提升泛化能力的机制。

链接: https://arxiv.org/abs/2504.17243
作者: Xinyu Zhou,Simin Fan,Martin Jaggi,Jie Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, 16 pages

点击查看摘要

Abstract:Grokking is proposed and widely studied as an intricate phenomenon in which generalization is achieved after a long-lasting period of overfitting. In this work, we propose NeuralGrok, a novel gradient-based approach that learns an optimal gradient transformation to accelerate the generalization of transformers in arithmetic tasks. Specifically, NeuralGrok trains an auxiliary module (e.g., an MLP block) in conjunction with the base model. This module dynamically modulates the influence of individual gradient components based on their contribution to generalization, guided by a bilevel optimization algorithm. Our extensive experiments demonstrate that NeuralGrok significantly accelerates generalization, particularly in challenging arithmetic tasks. We also show that NeuralGrok promotes a more stable training paradigm, constantly reducing the model’s complexity, while traditional regularization methods, such as weight decay, can introduce substantial instability and impede generalization. We further investigate the intrinsic model complexity leveraging a novel Absolute Gradient Entropy (AGE) metric, which explains that NeuralGrok effectively facilitates generalization by reducing the model complexity. We offer valuable insights on the grokking phenomenon of Transformer models, which encourages a deeper understanding of the fundamental principles governing generalization ability.
zh

[AI-40] Enhancing Variational Autoencoders with Smooth Robust Latent Encoding

【速读】:该论文试图解决 Variational Autoencoders (VAEs) 在生成式模型中的鲁棒性不足的问题,尤其是在扩散模型(如 Stable Diffusion)中的应用。传统观点认为,对抗训练可能会因性能与鲁棒性之间的权衡而损害生成模型的保真度,因此这一方向长期被忽视。论文的关键解决方案是提出了一种名为 Smooth Robust Latent VAE (SRL-VAE) 的新型对抗训练框架,通过在潜在空间中引入平滑处理来提升生成质量和鲁棒性。这种方法不仅关注增强鲁棒性,还通过保持原始表示的原创性来维持生成的保真度,从而克服了传统对抗训练可能带来的保真度下降问题。

链接: https://arxiv.org/abs/2504.17219
作者: Hyomin Lee,Minseon Kim,Sangwon Jang,Jongheon Jeong,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review

点击查看摘要

Abstract:Variational Autoencoders (VAEs) have played a key role in scaling up diffusion-based generative models, as in Stable Diffusion, yet questions regarding their robustness remain largely underexplored. Although adversarial training has been an established technique for enhancing robustness in predictive models, it has been overlooked for generative models due to concerns about potential fidelity degradation by the nature of trade-offs between performance and robustness. In this work, we challenge this presumption, introducing Smooth Robust Latent VAE (SRL-VAE), a novel adversarial training framework that boosts both generation quality and robustness. In contrast to conventional adversarial training, which focuses on robustness only, our approach smooths the latent space via adversarial perturbations, promoting more generalizable representations while regularizing with originality representation to sustain original fidelity. Applied as a post-training step on pre-trained VAEs, SRL-VAE improves image robustness and fidelity with minimal computational overhead. Experiments show that SRL-VAE improves both generation quality, in image reconstruction and text-guided image editing, and robustness, against Nightshade attacks and image editing attacks. These results establish a new paradigm, showing that adversarial training, once thought to be detrimental to generative models, can instead enhance both fidelity and robustness.
zh

[AI-41] Synthetic Power Flow Data Generation Using Physics-Informed Denoising Diffusion Probabilistic Models

【速读】:该论文旨在解决由于隐私和运行限制导致的可用高质量电力潮流数据不足的问题。论文提出了一种基于去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)的物理信息增强生成框架,用于合成可行的电力潮流数据。解决方案的关键在于通过引入辅助训练和物理信息损失函数,确保生成的数据不仅具有统计保真度,还能满足电力系统的可行性约束。通过在IEEE 14-bus和30-bus基准系统上的评估,证明了该方法在保持可行性、多样性和统计特征准确性方面的优越性。

链接: https://arxiv.org/abs/2504.17210
作者: Junfei Wang,Darshana Upadhyay,Marzia Zaman,Pirathayini Srikantha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE SmartGridComm Conference 2025

点击查看摘要

Abstract:Many data-driven modules in smart grid rely on access to high-quality power flow data; however, real-world data are often limited due to privacy and operational constraints. This paper presents a physics-informed generative framework based on Denoising Diffusion Probabilistic Models (DDPMs) for synthesizing feasible power flow data. By incorporating auxiliary training and physics-informed loss functions, the proposed method ensures that the generated data exhibit both statistical fidelity and adherence to power system feasibility. We evaluate the approach on the IEEE 14-bus and 30-bus benchmark systems, demonstrating its ability to capture key distributional properties and generalize to out-of-distribution scenarios. Comparative results show that the proposed model outperforms three baseline models in terms of feasibility, diversity, and accuracy of statistical features. This work highlights the potential of integrating generative modelling into data-driven power system applications.
zh

[AI-42] Automatically Generating Rules of Malicious Software Packages via Large Language Model

【速读】:该论文试图解决传统安全工具因依赖专家预定义规则而难以适应软件供应链攻击快速变化的问题。解决方案的关键在于提出RuleLLM这一新颖工具,它利用大型语言模型(Large Language Models, LLMs)自动化生成针对开源生态系统(OSS ecosystems)的安全规则。具体而言,RuleLLM通过提取恶意软件的元数据与代码片段作为输入,输出可以直接应用于软件开发的YARA和Semgrep规则。其规则生成过程包含三个子任务:制定规则、优化规则以及规则对齐。实验结果表明,RuleLLM在1,633个恶意软件包的数据集上生成了763条规则(包括452条YARA规则和311条Semgrep规则),具有85.2%的精确率和91.8%的召回率,显著优于现有最先进的(SOTA)工具及基于评分的方法,并进一步提出了规则分类法,将其划分为11个类别和38个子类别。

链接: https://arxiv.org/abs/2504.17198
作者: XiangRui Zhang,HaoYu Chen,Yongzhong He,Wenjia Niu,Qiang Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Today’s security tools predominantly rely on predefined rules crafted by experts, making them poorly adapted to the emergence of software supply chain attacks. To tackle this limitation, we propose a novel tool, RuleLLM, which leverages large language models (LLMs) to automate rule generation for OSS ecosystems. RuleLLM extracts metadata and code snippets from malware as its input, producing YARA and Semgrep rules that can be directly deployed in software development. Specifically, the rule generation task involves three subtasks: crafting rules, refining rules, and aligning rules. To validate RuleLLM’s effectiveness, we implemented a prototype system and conducted experiments on the dataset of 1,633 malicious packages. The results are promising that RuleLLM generated 763 rules (452 YARA and 311 Semgrep) with a precision of 85.2% and a recall of 91.8%, outperforming state-of-the-art (SOTA) tools and scored-based approaches. We further analyzed generated rules and proposed a rule taxonomy: 11 categories and 38 subcategories.
zh

[AI-43] Improving Human-Autonomous Vehicle Interaction in Complex Systems

【速读】:该论文试图解决自动驾驶车辆(Autonomous Vehicles, AVs)如何有效满足乘客信息需求的问题,以促进其在现实中的广泛应用。现有挑战在于不同个体、目标以及驾驶情境对交互成功的标准存在差异,而当前人机协作研究大多未充分考虑这种多样性。论文的关键在于揭示人类-自动驾驶系统复杂交互关系的本质,并据此提出透明、可适应且个性化的通信策略,以更好地迎合个体需求、目标及情境变化。通过三项实证研究,作者强调了任务敏感性、模态适宜性的沟通方式对于提升学习者表现、信心与信任的重要性,展示了上下文感知通信的必要性,并利用机器学习方法突出了个性化设计对增强用户信任的核心作用。这些发现为设计更有效的自动驾驶通信系统提供了理论依据和技术路径。

链接: https://arxiv.org/abs/2504.17170
作者: Robert Kaufman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: PhD Dissertation from University of California, San Diego; 175 pages

点击查看摘要

Abstract:Unresolved questions about how autonomous vehicles (AVs) should meet the informational needs of riders hinder real-world adoption. Complicating our ability to satisfy rider needs is that different people, goals, and driving contexts have different criteria for what constitutes interaction success. Unfortunately, most human-AV research and design today treats all people and situations uniformly. It is crucial to understand how an AV should communicate to meet rider needs, and how communications should change when the human-AV complex system changes. I argue that understanding the relationships between different aspects of the human-AV system can help us build improved and adaptable AV communications. I support this argument using three empirical studies. First, I identify optimal communication strategies that enhance driving performance, confidence, and trust for learning in extreme driving environments. Findings highlight the need for task-sensitive, modality-appropriate communications tuned to learner cognitive limits and goals. Next, I highlight the consequences of deploying faulty communication systems and demonstrate the need for context-sensitive communications. Third, I use machine learning (ML) to illuminate personal factors predicting trust in AVs, emphasizing the importance of tailoring designs to individual traits and concerns. Together, this dissertation supports the necessity of transparent, adaptable, and personalized AV systems that cater to individual needs, goals, and contextual demands. By considering the complex system within which human-AV interactions occur, we can deliver valuable insights for designers, researchers, and policymakers. This dissertation also provides a concrete domain to study theories of human-machine joint action and situational awareness, and can be used to guide future human-AI interaction research. [shortened for arxiv]
zh

[AI-44] Scalable Permutation-Aware Modeling for Temporal Set Prediction

【速读】:该论文致力于解决时间序列集合预测(Temporal Set Prediction)问题,即在给定一组先前集合序列的情况下,预测下一个集合中将出现的元素。传统方法通常依赖于复杂的架构,导致显著的计算开销,限制了其可扩展性。论文的关键解决方案在于提出了一种新颖且可扩展的框架,通过利用排列等变(permutation-equivariant)和排列不变(permutation-invariant)变换来高效建模集合动态。这种方法不仅大幅减少了训练和推理时间,同时保持了与现有先进模型相当甚至更优的性能,验证了其在提升效率和可扩展性方面的有效性。

链接: https://arxiv.org/abs/2504.17140
作者: Ashish Ranjan,Ayush Agarwal,Shalin Barot,Sushant Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal set prediction involves forecasting the elements that will appear in the next set, given a sequence of prior sets, each containing a variable number of elements. Existing methods often rely on intricate architectures with substantial computational overhead, which hampers their scalability. In this work, we introduce a novel and scalable framework that leverages permutation-equivariant and permutation-invariant transformations to efficiently model set dynamics. Our approach significantly reduces both training and inference time while maintaining competitive performance. Extensive experiments on multiple public benchmarks show that our method achieves results on par with or superior to state-of-the-art models across several evaluation metrics. These results underscore the effectiveness of our model in enabling efficient and scalable temporal set prediction.
zh

[AI-45] Peer-Aware Cost Estimation in Nonlinear General-Sum Dynamic Games for Mutual Learning and Intent Inference

【速读】:该论文旨在解决人类-机器人交互中作为不完全信息广义和动态博弈的问题,特别是在双方目标函数未知且博弈包含非线性动态的情况下,传统方法难以有效求解均衡策略。现有工作通常假设一个代理为专家并拥有其同伴的完整信息,但这种方法可能导致偏差估计和协作失败。为了解决这一挑战,论文提出了一种非线性同伴感知成本估计算法(Nonlinear Peer-Aware Cost Estimation, N-PACE)。N-PACE 的关键在于通过迭代线性二次(Linear Quadratic, LQ)近似非线性广义和博弈,每个代理在推断同伴目标函数的同时显式建模其学习动态,从而实现对同伴未知目标函数的无偏快速学习,这对任务完成和安全保障至关重要。此外,该算法还通过显式建模同伴的学习动态实现了多智能体系统中的意图通信。

链接: https://arxiv.org/abs/2504.17129
作者: Seyed Yousef Soltanian,Wenlong Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Human-robot interactions can be modeled as incomplete-information general-sum dynamic games since the objective functions of both agents are not explicitly known to each other. However, solving for equilibrium policies for such games presents a major challenge, especially if the games involve nonlinear underlying dynamics. To simplify the problem, existing work often assumes that one agent is an expert with complete information about its peer, which can lead to biased estimates and failures in coordination. To address this challenge, we propose a nonlinear peer-aware cost estimation (N-PACE) algorithm for general-sum dynamic games. In N-PACE, using iterative linear quadratic (LQ) approximation of the nonlinear general-sum game, each agent explicitly models the learning dynamics of its peer agent while inferring their objective functions, leading to unbiased fast learning in inferring the unknown objective function of the peer agent, which is critical for task completion and safety assurance. Additionally, we demonstrate how N-PACE enables \textbfintent communication in such multi-agent systems by explicitly modeling the peer’s learning dynamics.
zh

[AI-46] Leverag ing LLM s as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

【速读】:该论文试图解决两个主要问题:一是如何减少大型语言模型(LLMs)评估结果中因人类判断偏差和错误而导致的不准确性;二是如何在多个潜在的LLM响应中选择合适的判断。为了解决这些问题,论文提出了一种三阶段的元裁判选择流水线,其关键是通过开发综合评分标准(结合GPT-4与人类专家)、采用多代理协作以及引入评分阈值过滤机制,来提高LLMs作为元裁判的性能。实验结果显示,与单一LLM方法相比,该流水线在JudgeBench数据集上的表现提升了约15.55%,比单代理基线提高了约8.37%。

链接: https://arxiv.org/abs/2504.17087
作者: Yuran Li,Jama Hussein Mohamud,Chongren Sun,Di Wu,Benoit Boulet
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) are being widely applied across various fields, but as tasks become more complex, evaluating their responses is increasingly challenging. Compared to human evaluators, the use of LLMs to support performance evaluation offers a more efficient alternative. However, most studies focus mainly on aligning LLMs’ judgments with human preferences, overlooking the existence of biases and mistakes in human judgment. Furthermore, how to select suitable LLM judgments given multiple potential LLM responses remains underexplored. To address these two aforementioned issues, we propose a three-stage meta-judge selection pipeline: 1) developing a comprehensive rubric with GPT-4 and human experts, 2) using three advanced LLM agents to score judgments, and 3) applying a threshold to filter out low-scoring judgments. Compared to methods using a single LLM as both judge and meta-judge, our pipeline introduces multi-agent collaboration and a more comprehensive rubric. Experimental results on the JudgeBench dataset show about 15.55% improvement compared to raw judgments and about 8.37% improvement over the single-agent baseline. Our work demonstrates the potential of LLMs as meta-judges and lays the foundation for future research on constructing preference datasets for LLM-as-a-judge reinforcement learning.
zh

[AI-47] Robo-Troj: Attacking LLM -based Task Planners

【速读】:本文旨在解决基于大型语言模型(Large Language Models, LLMs)的任务规划器在任务规划方面取得成功的同时,其安全性研究不足的问题。论文的关键贡献在于提出了Robo-Troj,这是一种针对LLM-based任务规划器的首个多触发器后门攻击方法。Robo-Troj通过训练以适应机器人应用领域的多样性,例如使用特定的触发词(如“herical”)激活恶意行为(如厨房机器人的切手指行为)。此外,论文还开发了一种优化方法来选择最有效的触发词。通过揭示基于LLM的任务规划器的脆弱性,论文旨在推动安全机器人系统的研发。

链接: https://arxiv.org/abs/2504.17070
作者: Mohaiminul Al Nahian,Zainab Altaweel,David Reitano,Sabbir Ahmed,Saumitra Lohokare,Shiqi Zhang,Adnan Siraj Rakin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robots need task planning methods to achieve goals that require more than individual actions. Recently, large language models (LLMs) have demonstrated impressive performance in task planning. LLMs can generate a step-by-step solution using a description of actions and the goal. Despite the successes in LLM-based task planning, there is limited research studying the security aspects of those systems. In this paper, we develop Robo-Troj, the first multi-trigger backdoor attack for LLM-based task planners, which is the main contribution of this work. As a multi-trigger attack, Robo-Troj is trained to accommodate the diversity of robot application domains. For instance, one can use unique trigger words, e.g., “herical”, to activate a specific malicious behavior, e.g., cutting hand on a kitchen robot. In addition, we develop an optimization method for selecting the trigger words that are most effective. Through demonstrating the vulnerability of LLM-based planners, we aim to promote the development of secured robot systems.
zh

[AI-48] Statistical Guarantees in Synthetic Data through Conformal Adversarial Generation

【速读】:该论文旨在解决机器学习研究中高质量合成数据生成面临的挑战,特别是在统计保真度和不确定性量化方面的不足。现有生成模型虽能产生令人信服的合成样本,但缺乏关于其与潜在数据分布关系的严格统计保证,这限制了它们在需要稳健误差界的关键领域的应用。为克服这一根本局限性,论文提出了一种新颖的框架,将一致性预测方法引入生成对抗网络(GANs)。通过整合多种一致性预测范式,如归纳一致性预测(ICP)、蒙特利尔一致性预测、交叉一致性预测以及Venn-Abers预测器,该方法在生成样本中实现了分布无关的不确定性量化。此方法被称为一致性化GAN(cGAN),不仅展示了增强的校准特性,还保持了传统GANs的生成能力,生成具有可证明统计保证的合成数据。论文提供了严格的数学证明,建立了有限样本有效性保证和渐近效率特性,从而确保合成数据在医疗保健、金融和自动驾驶系统等高风险领域中的可靠应用。

链接: https://arxiv.org/abs/2504.17058
作者: Rahul Vishwakarma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The generation of high-quality synthetic data presents significant challenges in machine learning research, particularly regarding statistical fidelity and uncertainty quantification. Existing generative models produce compelling synthetic samples but lack rigorous statistical guarantees about their relation to the underlying data distribution, limiting their applicability in critical domains requiring robust error bounds. We address this fundamental limitation by presenting a novel framework that incorporates conformal prediction methodologies into Generative Adversarial Networks (GANs). By integrating multiple conformal prediction paradigms including Inductive Conformal Prediction (ICP), Mondrian Conformal Prediction, Cross-Conformal Prediction, and Venn-Abers Predictors, we establish distribution-free uncertainty quantification in generated samples. This approach, termed Conformalized GAN (cGAN), demonstrates enhanced calibration properties while maintaining the generative power of traditional GANs, producing synthetic data with provable statistical guarantees. We provide rigorous mathematical proofs establishing finite-sample validity guarantees and asymptotic efficiency properties, enabling the reliable application of synthetic data in high-stakes domains including healthcare, finance, and autonomous systems.
zh

[AI-49] Psychological Effect of AI driven marketing tools for beauty/facial feature enhancement

【速读】:该论文旨在探究由人工智能驱动的人脸评估工具对个体心理的影响,特别是其对自我物化(self-objectification)、自尊(self-esteem)以及情绪反应的作用,并着重分析性别差异。研究使用了两种不同版本的面部分析工具:一个明显带有批判性(样本量N=75,平均年龄M=22.9岁),另一个相对中立(样本量N=51,平均年龄M=19.9岁)。通过完成标准化的自我物化与自尊量表及自定义的问题,测量数字/物理外观增强行为(DAE, PAEE)和感知的社会情绪(PSE)。

研究结果表明,在两种版本中,高水平的自我物化与低自尊之间存在一致联系,并伴随增加的外貌增强行为。尽管较新的工具以更柔和的方式呈现,但它仍引发了负面情绪反应(U=1466.5, p=0.013),这暗示隐性反馈可能强化了与外貌相关的不安感。此外,女性在数字外观增强(DAE)(p=0.025)和感知他人情感影响(PSE)(p=0.001)方面显示出显著差异,她们更倾向于进行数字外观修饰且不太容易感受到他人的情感影响。这些发现揭示了AI工具如何无意间强化并放大现有的社会偏见,并强调了负责任的设计与开发的重要性。

未来的研究将进一步探讨嵌入此类工具训练数据中的意识形态如何塑造其评估输出,并进一步分析这些输出如何影响用户的观念与决策。因此,本研究的关键在于识别现有AI工具可能存在的潜在负面影响,并提出改进措施以促进更公平和积极的技术应用。

链接: https://arxiv.org/abs/2504.17055
作者: Ayushi Agrawal,Aditya Kondai,Kavita Vemuri
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI-powered facial assessment tools are reshaping how individuals evaluate appearance and internalize social judgments. This study examines the psychological impact of such tools on self-objectification, self-esteem, and emotional responses, with attention to gender differences. Two samples used distinct versions of a facial analysis tool: one overtly critical (N=75; M=22.9 years), and another more neutral (N=51; M=19.9 years). Participants completed validated self-objectification and self-esteem scales and custom items measuring emotion, digital/physical appearance enhancement (DAE, PAEE), and perceived social emotion (PSE). Results revealed consistent links between high self-objectification, low self-esteem, and increased appearance enhancement behaviors across both versions. Despite softer framing, the newer tool still evoked negative emotional responses (U=1466.5, p=0.013), indicating implicit feedback may reinforce appearance-related insecurities. Gender differences emerged in DAE (p=0.025) and PSE (p0.001), with females more prone to digital enhancement and less likely to perceive emotional impact in others. These findings reveal how AI tools may unintentionally reinforce and amplify existing social biases and underscore the critical need for responsible AI design and development. Future research will investigate how human ideologies embedded in the training data of such tools shape their evaluative outputs, and how these, in turn, influence user attitudes and decisions.
zh

[AI-50] Approaches to Responsible Governance of GenAI in Organizations

【速读】:该论文试图解决在生成式 AI (GenAI) 快速发展的背景下,如何将负责任的 GenAI 治理有效融入多样化组织结构中的问题。论文通过文献回顾、已建立的治理框架以及行业圆桌讨论,提炼出核心原则,并提出平衡创新与监管的风险导向型治理建议。解决方案的关键在于开发可适应的风险评估工具、持续监控实践以及跨领域协作,以构建可信的 GenAI 系统,同时提供结构化指南(Responsible GenAI Guide, ResAI),帮助组织将 GenAI 计划与伦理、法律及运营最佳实践相一致。

链接: https://arxiv.org/abs/2504.17044
作者: Dhari Gandhi,Himanshu Joshi,Lucas Hartman,Shabnam Hassani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid evolution of Generative AI (GenAI) has introduced unprecedented opportunities while presenting complex challenges around ethics, accountability, and societal impact. This paper draws on a literature review, established governance frameworks, and industry roundtable discussions to identify core principles for integrating responsible GenAI governance into diverse organizational structures. Our objective is to provide actionable recommendations for a balanced, risk-based governance approach that enables both innovation and oversight. Findings emphasize the need for adaptable risk assessment tools, continuous monitoring practices, and cross-sector collaboration to establish trustworthy GenAI. These insights provide a structured foundation and Responsible GenAI Guide (ResAI) for organizations to align GenAI initiatives with ethical, legal, and operational best practices.
zh

[AI-51] Democracy of AI Numerical Weather Models: An Example of Global Forecasting with FourCastNetv2 Made by a University Research Lab Using GPU

【速读】:该论文旨在解决大学研究小组在资源受限情况下利用生成式 AI (Generative AI) 模型实现全球天气预报模型民主化的问题。论文的关键在于通过利用图形处理单元 (GPU) 和 NVIDIA 的 FourCastNetv2 等开源 AI 模型,演示了通过指定的应用程序编程接口 (API) 利用 FourCastNetv2 进行预测的能力,并展示了如何使用 NVIDIA 硬件重现原始 FourCastNet 模型的训练过程。此外,论文探讨了资源有限的研究小组在数据管理、训练效率和模型验证方面的优势与挑战,为其他高校研究小组及机器学习、气候科学和数据科学相关课程提供了初步指导,以推动 AI 驱动的数值天气预报 (Numerical Weather Prediction, NWP) 在数字经济中的普及。

链接: https://arxiv.org/abs/2504.17028
作者: Iman Khadir,Shane Stevenson,Henry Li,Kyle Krick,Abram Burrows,David Hall,Stan Posey,Samuel S.P. Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:This paper demonstrates the feasibility of democratizing AI-driven global weather forecasting models among university research groups by leveraging Graphics Processing Units (GPUs) and freely available AI models, such as NVIDIA’s FourCastNetv2. FourCastNetv2 is an NVIDIA’s advanced neural network for weather prediction and is trained on a 73-channel subset of the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset at single levels and different pressure levels. Although the training specifications for FourCastNetv2 are not released to the public, the training documentation of the model’s first generation, FourCastNet, is available to all users. The training had 64 A100 GPUs and took 16 hours to complete. Although NVIDIA’s models offer significant reductions in both time and cost compared to traditional Numerical Weather Prediction (NWP), reproducing published forecasting results presents ongoing challenges for resource-constrained university research groups with limited GPU availability. We demonstrate both (i) leveraging FourCastNetv2 to create predictions through the designated application programming interface (API) and (ii) utilizing NVIDIA hardware to train the original FourCastNet model. Further, this paper demonstrates the capabilities and limitations of NVIDIA A100’s for resource-limited research groups in universities. We also explore data management, training efficiency, and model validation, highlighting the advantages and challenges of using limited high-performance computing resources. Consequently, this paper and its corresponding GitHub materials may serve as an initial guide for other university research groups and courses related to machine learning, climate science, and data science to develop research and education programs on AI weather forecasting, and hence help democratize the AI NWP in the digital economy.
zh

[AI-52] What Makes for a Good Saliency Map? Comparing Strategies for Evaluating Saliency Maps in Explainable AI (XAI)

【速读】:该论文试图解决如何最佳评估显著性图(Saliency Maps)这一开放性问题,目前常用的评估方法包括主观用户度量、客观用户度量以及数学指标。论文的关键解决方案在于通过一项被试间研究(N=166),同时采用这三种评估方法,系统比较了三种最受欢迎的显著性图方法(LIME、Grad-CAM 和 Guided Backpropagation)。研究旨在检验:1)主观度量下这些显著性图是否在用户信任和满意度方面存在差异;2)客观度量下这些显著性图是否能提升用户的模型理解能力;3)数学指标下哪种显著性图获得最优评分;4)数学指标与客观用户度量之间是否存在关联。结果显示,不同评估方法的结论并不一致,但揭示了 Grad-CAM 在提升用户能力方面表现最佳,Guided Backpropagation 的数学指标最为有利,且部分数学指标与用户理解能力之间存在关联,尽管这种关系有时违背直觉。这一综合评估方法为可解释人工智能(XAI)领域中用户研究与数学指标互补使用的争议提供了新的见解。

链接: https://arxiv.org/abs/2504.17023
作者: Felix Kares,Timo Speith,Hanwei Zhang,Markus Langer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 27 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Saliency maps are a popular approach for explaining classifications of (convolutional) neural networks. However, it remains an open question as to how best to evaluate salience maps, with three families of evaluation methods commonly being used: subjective user measures, objective user measures, and mathematical metrics. We examine three of the most popular saliency map approaches (viz., LIME, Grad-CAM, and Guided Backpropagation) in a between subject study (N=166) across these families of evaluation methods. We test 1) for subjective measures, if the maps differ with respect to user trust and satisfaction; 2) for objective measures, if the maps increase users’ abilities and thus understanding of a model; 3) for mathematical metrics, which map achieves the best ratings across metrics; and 4) whether the mathematical metrics can be associated with objective user measures. To our knowledge, our study is the first to compare several salience maps across all these evaluation methods - with the finding that they do not agree in their assessment (i.e., there was no difference concerning trust and satisfaction, Grad-CAM improved users’ abilities best, and Guided Backpropagation had the most favorable mathematical metrics). Additionally, we show that some mathematical metrics were associated with user understanding, although this relationship was often counterintuitive. We discuss these findings in light of general debates concerning the complementary use of user studies and mathematical metrics in the evaluation of explainable AI (XAI) approaches.
zh

[AI-53] Analyzing Value Functions of States in Parametric Markov Chains

【速读】:该论文致力于解决带有未知或部分已知概率的参数化马尔可夫链(parametric Markov chains, pMC)的可达性性质验证问题。尽管此类验证已被证明是coETR-完全难题,研究者们尝试通过更易验证的属性(如检查pMC在某些参数下是否具有单调性)来简化这一过程。论文的关键解决方案在于将单调性的判断转化为检查从某一状态出发的可达概率是否始终不低于另一状态的可达概率。针对后一性质的最新研究成果表明,可以高效实现同值等价类的坍缩,这一操作不仅保持了验证结果的一致性,还保留了单调性。论文进一步实现了这一算法,用于坍缩pMC中的“平凡”等价类,并提供了实证证据:首先,这种坍缩对于一些现有基准测试减少了模型规模,在自定义基准测试中显著降低了规模;其次,坍缩加速了现有的单调性和参数提升检测算法,从而可以在实际应用中作为快速预处理步骤。

链接: https://arxiv.org/abs/2504.17020
作者: Kasper Engelen,Guillermo A. Pérez,Shrisha Rao
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Published as part of the book “Principles of Verification: Cycling the Probabilistic Landscape: Essays Dedicated to Joost-Pieter Katoen on the Occasion of His 60th Birthday, Part II”

点击查看摘要

Abstract:Parametric Markov chains (pMC) are used to model probabilistic systems with unknown or partially known probabilities. Although (universal) pMC verification for reachability properties is known to be coETR-complete, there have been efforts to approach it using potentially easier-to-check properties such as asking whether the pMC is monotonic in certain parameters. In this paper, we first reduce monotonicity to asking whether the reachability probability from a given state is never less than that of another given state. Recent results for the latter property imply an efficient algorithm to collapse same-value equivalence classes, which in turn preserves verification results and monotonicity. We implement our algorithm to collapse “trivial” equivalence classes in the pMC and show empirical evidence for the following: First, the collapse gives reductions in size for some existing benchmarks and significant reductions on some custom benchmarks; Second, the collapse speeds up existing algorithms to check monotonicity and parameter lifting, and hence can be used as a fast pre-processing step in practice.
zh

[AI-54] Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification

【速读】:该论文旨在解决通用定理证明(generalized theorem proving)这一长期未完全解决的问题,特别是在基于大型语言模型(LLMs)的背景下。论文的核心目标是提升LLMs在形式化验证任务中的推理能力,并提出一种能够在正式语言中生成完整证明的框架,以用于结合内置策略和现成自动化定理证明器的系统中。解决方案的关键在于设计了一个包含三个组件的框架:生成待验证代码的自然语言陈述、一个能够生成形式化证明的LLM,以及一个利用启发式方法构建最终证明的模块。此外,通过两阶段微调过程训练LLM,包括基于SFT的训练以生成语法正确的Isabelle代码,以及基于RL的训练以鼓励模型生成被定理证明器验证的证明。这一方案旨在有效提升LLMs在形式化验证任务中的性能。

链接: https://arxiv.org/abs/2504.17017
作者: Balaji Rao,William Eiers,Carlo Lipizzi
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: Accepted to the Proceedings of the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)

点击查看摘要

Abstract:Formally verifying properties of software code has been a highly desirable task, especially with the emergence of LLM-generated code. In the same vein, they provide an interesting avenue for the exploration of formal verification and mechanistic interpretability. Since the introduction of code-specific models, despite their successes in generating code in Lean4 and Isabelle, the task of generalized theorem proving still remains far from being fully solved and will be a benchmark for reasoning capability in LLMs. In this work, we introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. Our framework includes 3 components: generating natural language statements of the code to be verified, an LLM that generates formal proofs for the given statement, and a module employing heuristics for building the final proof. To train the LLM, we employ a 2-stage fine-tuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code and then RL-based training that encourages the model to generate proofs verified by a theorem prover. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the AWS S3 bucket access policy code. We also curate a dataset based on the FVEL\textsubscript\textnormalER dataset for future training tasks.
zh

[AI-55] A Systematic Approach to Design Real-World Human-in-the-Loop Deep Reinforcement Learning: Salient Features Challenges and Trade-offs

【速读】:本文旨在解决复杂决策问题中的人机协作挑战,提出了一种新型的多层次分层人机协作(Human-in-the-Loop, HITL)深度强化学习(Deep Reinforcement Learning, DRL)算法。该算法结合了自我学习(self-learning)、模仿学习(imitation learning)和迁移学习(transfer learning)三种学习方式,并考虑了奖励(reward)、动作(action)和演示(demonstration)三种形式的人类输入。关键在于系统性地将人类信息融入AI解决方案,通过实证研究验证其有效性。论文以无人机对抗任务为例,设计了针对敌方无人机的中和策略,采用获奖开源HITL软件Cogment实现解决方案,并展示了HITL在加速训练、提升性能及指导梯度方法降低方差方面的优势,同时强调了建议量适中的重要性以避免过拟合或欠拟合。最终,论文通过两个真实场景(过载攻击与诱饵攻击)进一步证明了人机合作在复杂问题求解中的作用。

链接: https://arxiv.org/abs/2504.17006
作者: Jalal Arabneydi,Saiful Islam,Srijita Das,Sai Krishna Gottipati,William Duguay,Cloderic Mars,Matthew E. Taylor,Matthew Guzdial,Antoine Fagette,Younes Zerouali
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: This is a result of the collaboration by JACOBB, AMII(Alberta Machine Intelligence Institute), Thales and AI Redefined (AIR) in 2021-2023

点击查看摘要

Abstract:With the growing popularity of deep reinforcement learning (DRL), human-in-the-loop (HITL) approach has the potential to revolutionize the way we approach decision-making problems and create new opportunities for human-AI collaboration. In this article, we introduce a novel multi-layered hierarchical HITL DRL algorithm that comprises three types of learning: self learning, imitation learning and transfer learning. In addition, we consider three forms of human inputs: reward, action and demonstration. Furthermore, we discuss main challenges, trade-offs and advantages of HITL in solving complex problems and how human information can be integrated in the AI solution systematically. To verify our technical results, we present a real-world unmanned aerial vehicles (UAV) problem wherein a number of enemy drones attack a restricted area. The objective is to design a scalable HITL DRL algorithm for ally drones to neutralize the enemy drones before they reach the area. To this end, we first implement our solution using an award-winning open-source HITL software called Cogment. We then demonstrate several interesting results such as (a) HITL leads to faster training and higher performance, (b) advice acts as a guiding direction for gradient methods and lowers variance, and © the amount of advice should neither be too large nor too small to avoid over-training and under-training. Finally, we illustrate the role of human-AI cooperation in solving two real-world complex scenarios, i.e., overloaded and decoy attacks.
zh

[AI-56] Backslash: Rate Constrained Optimized Training of Large Language Models

【速读】:该论文旨在解决大型语言模型(Large-Language Models, LLMs)训练后参数压缩研究较为成熟,而训练过程中压缩方法探索不足的问题。论文提出了一种名为Rate-Constrained Training(Backslash)的新方法,其关键在于基于率失真优化(Rate-Distortion Optimization, RDO)的训练时压缩策略,通过灵活调整模型精度与复杂度之间的权衡,在大幅减少参数冗余的同时保持性能,显著降低内存使用(高达60%-90%),且无需牺牲准确性。此外,Backslash展示了良好的通用性,可提升泛化能力、增强模型对剪枝的鲁棒性,并支持边缘设备上的加速推理。

链接: https://arxiv.org/abs/2504.16968
作者: Jun Wu,Jiangtao Wen,Yuxing Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large-language models (LLMs) has driven extensive research into parameter compression after training has been completed, yet compression during the training phase remains largely unexplored. In this work, we introduce Rate-Constrained Training (Backslash), a novel training-time compression approach based on rate-distortion optimization (RDO). Backslash enables a flexible trade-off between model accuracy and complexity, significantly reducing parameter redundancy while preserving performance. Experiments in various architectures and tasks demonstrate that Backslash can reduce memory usage by 60% - 90% without accuracy loss and provides significant compression gain compared to compression after training. Moreover, Backslash proves to be highly versatile: it enhances generalization with small Lagrange multipliers, improves model robustness to pruning (maintaining accuracy even at 80% pruning rates), and enables network simplification for accelerated inference on edge devices.
zh

[AI-57] Intrinsic Barriers to Explaining Deep Foundation Models

【速读】:该论文试图解决的问题是:如何理解和解释深度基础模型(Deep Foundation Models, DFMs)的内部工作机制,以确保对其的信任、安全和责任,并探讨这些挑战是否源于模型本身的内在属性。论文的关键在于分析DFMs的基本特性以及当前可解释性方法在应对这些固有挑战时的局限性,同时探究实现满意解释的可行性,并思考这对验证和治理这些技术的意义。

链接: https://arxiv.org/abs/2504.16948
作者: Zhen Tan,Huan Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Deep Foundation Models (DFMs) offer unprecedented capabilities but their increasing complexity presents profound challenges to understanding their internal workings-a critical need for ensuring trust, safety, and accountability. As we grapple with explaining these systems, a fundamental question emerges: Are the difficulties we face merely temporary hurdles, awaiting more sophisticated analytical techniques, or do they stem from \emphintrinsic barriers deeply rooted in the nature of these large-scale models themselves? This paper delves into this critical question by examining the fundamental characteristics of DFMs and scrutinizing the limitations encountered by current explainability methods when confronted with this inherent challenge. We probe the feasibility of achieving satisfactory explanations and consider the implications for how we must approach the verification and governance of these powerful technologies.
zh

[AI-58] SCRAG : Social Computing-Based Retrieval Augmented Generation for Community Response Forecasting in Social Media Environments

【速读】:本文旨在解决动态社交媒体环境中基于静态训练数据的大规模语言模型(Large Language Models, LLMs)在预测社区对真实或假设性社交媒体帖子响应时存在的局限性,如易受幻觉影响及无法有效利用最新信息的问题。为克服这些挑战,论文提出了一种名为SCRAG的预测框架,其核心解决方案是将LLMs与基于社会计算的检索增强生成(Retrieval-Augmented Generation, RAG)技术相结合。具体而言,SCRAG通过检索目标社区的历史响应以捕捉其意识形态、语义及情感特征,并结合新闻文章等外部知识源注入时间敏感的上下文信息,从而实现对新帖子或叙事下目标社区响应的精准预测。实验结果表明,相比传统方法,SCRAG在六个不同场景下的主要评估指标平均提升了超过10%,且能够有效捕捉多样化的意识形态与细微差别。这一工具对于需要准确理解社区反馈的应用场景具有重要意义。

链接: https://arxiv.org/abs/2504.16947
作者: Dachun Sun,You Lyu,Jinning Li,Yizhuo Chen,Tianshi Wang,Tomoyoshi Kimura,Tarek Abdelzaher
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces SCRAG, a prediction framework inspired by social computing, designed to forecast community responses to real or hypothetical social media posts. SCRAG can be used by public relations specialists (e.g., to craft messaging in ways that avoid unintended misinterpretations) or public figures and influencers (e.g., to anticipate social responses), among other applications related to public sentiment prediction, crisis management, and social what-if analysis. While large language models (LLMs) have achieved remarkable success in generating coherent and contextually rich text, their reliance on static training data and susceptibility to hallucinations limit their effectiveness at response forecasting in dynamic social media environments. SCRAG overcomes these challenges by integrating LLMs with a Retrieval-Augmented Generation (RAG) technique rooted in social computing. Specifically, our framework retrieves (i) historical responses from the target community to capture their ideological, semantic, and emotional makeup, and (ii) external knowledge from sources such as news articles to inject time-sensitive context. This information is then jointly used to forecast the responses of the target community to new posts or narratives. Extensive experiments across six scenarios on the X platform (formerly Twitter), tested with various embedding models and LLMs, demonstrate over 10% improvements on average in key evaluation metrics. A concrete example further shows its effectiveness in capturing diverse ideologies and nuances. Our work provides a social computing tool for applications where accurate and concrete insights into community responses are crucial.
zh

[AI-59] MobileCity: An Efficient Framework for Large-Scale Urban Behavior Simulation

【速读】:该论文试图解决现有生成式代理(Generative Agents)方法在模拟现代城市复杂交通选择时过于简化的局限性,并克服在大规模人口仿真中所需的高昂计算资源需求的问题。解决方案的关键在于首先构建了一个包含多种功能建筑与交通工具的虚拟城市,并通过广泛的调查建模不同人群的行为选择与移动偏好。在此基础上,提出了一种能够捕捉城市流动性复杂性的可扩展仿真框架,实现了超过4000个代理的模拟。此外,论文还通过微观和宏观层面的分析验证了生成行为的逼真度,并进一步探索了从运动模式预测人群密度以及识别不同人口群体车辆偏好的实验。

链接: https://arxiv.org/abs/2504.16946
作者: Xiaotong Ye,Nicolas Bougie,Toshihiko Yamasaki,Narimasa Watanabe
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative agents offer promising capabilities for simulating realistic urban behaviors. However, existing methods oversimplify transportation choices in modern cities, and require prohibitive computational resources for large-scale population simulation. To address these limitations, we first present a virtual city that features multiple functional buildings and transportation modes. Then, we conduct extensive surveys to model behavioral choices and mobility preferences among population groups. Building on these insights, we introduce a simulation framework that captures the complexity of urban mobility while remaining scalable, enabling the simulation of over 4,000 agents. To assess the realism of the generated behaviors, we perform a series of micro and macro-level analyses. Beyond mere performance comparison, we explore insightful experiments, such as predicting crowd density from movement patterns and identifying trends in vehicle preferences across agent demographics.
zh

[AI-60] Rational Inference in Formal Concept Analysis

【速读】:该论文试图解决如何在形式概念分析(Formal Concept Analysis, FCA)中引入非单调推理(non-monotonic inference)的问题。传统FCA中的蕴涵(implications)无法有效处理包含错误或例外的数据,而现有的非单调推理研究主要集中在命题逻辑(propositional logic)的KLM框架上。论文的关键解决方案是基于KLM框架,构建适用于FCA的可废止推理(defeasible reasoning)方法,并证明该方法忠实地保留了原始KLM框架中描述的非单调推理原则。此外,论文提出,与命题逻辑相比,所提出的FCA中的可废止推理提供了更贴近上下文的推理视角,能够得出更加相关的结论。

链接: https://arxiv.org/abs/2504.16938
作者: Lucas Carr,Nicholas Leisegang,Thomas Meyer,Sergei Obiedkov
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Defeasible conditionals are a form of non-monotonic inference which enable the expression of statements like "if \phi then normally \psi ". The KLM framework defines a semantics for the propositional case of defeasible conditionals by construction of a preference ordering over possible worlds. The pattern of reasoning induced by these semantics is characterised by consequence relations satisfying certain desirable properties of non-monotonic reasoning. In FCA, implications are used to describe dependencies between attributes. However, these implications are unsuitable to reason with erroneous data or data prone to exceptions. Until recently, the topic of non-monotonic inference in FCA has remained largely uninvestigated. In this paper, we provide a construction of the KLM framework for defeasible reasoning in FCA and show that this construction remains faithful to the principle of non-monotonic inference described in the original framework. We present an additional argument that, while remaining consistent with the original ideas around non-monotonic reasoning, the defeasible reasoning we propose in FCA offers a more contextual view on inference, providing the ability for more relevant conclusions to be drawn when compared to the propositional case.
zh

[AI-61] A Framework for the Assurance of AI-Enabled Systems

【速读】:该论文试图解决人工智能系统(Artificial Intelligence-enabled System, AIES)在国防应用中的快速部署与确保其可信性之间的矛盾问题。具体而言,它关注如何在加速开发和部署的同时,有效管理风险并提供充分的保障,以确保AI系统的预期价值得以实现,并避免引入不可接受的风险。论文的关键解决方案在于提出了一种基于主张(claims-based)的框架,用于AI系统的风险管理与保障。该框架旨在通过明确的评估过程和定义,平衡快速部署的需求与严格的验证需求,从而在不牺牲安全性或信任度的前提下,支持国防部迅速部署有效的AI能力。

链接: https://arxiv.org/abs/2504.16937
作者: Ariel S. Kapusta(1),David Jin(2),Peter M. Teague(2),Robert A. Houston(2),Jonathan B. Elliott(2),Grace Y. Park(2),Shelby S. Holdren(3) ((1) The MITRE Corporation, (2) Office of the Chief Digital and Artificial Intelligence Officer, (3) John Hopkins University Applied Physics Laboratory)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, published in conference proceedings of SPIE Defense and Commercial Sensing conference on Assurance and Security for AI-enabled Systems 2025

点击查看摘要

Abstract:The United States Department of Defense (DOD) looks to accelerate the development and deployment of AI capabilities across a wide spectrum of defense applications to maintain strategic advantages. However, many common features of AI algorithms that make them powerful, such as capacity for learning, large-scale data ingestion, and problem-solving, raise new technical, security, and ethical challenges. These challenges may hinder adoption due to uncertainty in development, testing, assurance, processes, and requirements. Trustworthiness through assurance is essential to achieve the expected value from AI. This paper proposes a claims-based framework for risk management and assurance of AI systems that addresses the competing needs for faster deployment, successful adoption, and rigorous evaluation. This framework supports programs across all acquisition pathways provide grounds for sufficient confidence that an AI-enabled system (AIES) meets its intended mission goals without introducing unacceptable risks throughout its lifecycle. The paper’s contributions are a framework process for AI assurance, a set of relevant definitions to enable constructive conversations on the topic of AI assurance, and a discussion of important considerations in AI assurance. The framework aims to provide the DOD a robust yet efficient mechanism for swiftly fielding effective AI capabilities without overlooking critical risks or undermining stakeholder trust. Comments: 12 pages, 2 figures, published in conference proceedings of SPIE Defense and Commercial Sensing conference on Assurance and Security for AI-enabled Systems 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16937 [cs.AI] (or arXiv:2504.16937v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.16937 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-62] Deciphering the unique dynamic activation pathway in a G protein-coupled receptor enables unveiling biased signaling and identifying cryptic allosteric sites in conformational intermediates

【速读】:该论文旨在探究神经紧张素受体1(Neurotensin Receptor 1, NTSR1)激活及其偏向性信号转导的分子机制。研究试图揭示NTSR1偏向性信号转导的动态过渡步骤及激活网络,并深入分析其极性网络、非保守离子锁以及芳香簇之间的复杂相互作用。此外,研究还发现了一个位于受体胞内区域的隐蔽别构位点,该位点在激活路径中的中间态存在。关键在于结合计算模拟(如nudged elastic band分子动力学模拟与Markov状态模型)和实验技术(如定点突变与构象生物传感器),系统性地解析NTSR1激活与信号偏向性的原子级机制,为开发NTSR1别构调节剂提供潜在策略。

链接: https://arxiv.org/abs/2504.17624
作者: Jigang Fan,Chunhao Zhu,Xiaobing Lan,Haiming Zhuang,Mingyu Li,Jian Zhang,Shaoyong Lu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neurotensin receptor 1 (NTSR1), a member of the Class A G protein-coupled receptor superfamily, plays an important role in modulating dopaminergic neuronal activity and eliciting opioid-independent analgesia. Recent studies suggest that promoting \beta-arrestin-biased signaling in NTSR1 may diminish drugs of abuse, such as psychostimulants, thereby offering a potential avenue for treating human addiction-related disorders. In this study, we utilized a novel computational and experimental approach that combined nudged elastic band-based molecular dynamics simulations, Markov state models, temporal communication network analysis, site-directed mutagenesis, and conformational biosensors, to explore the intricate mechanisms underlying NTSR1 activation and biased signaling. Our study reveals a dynamic stepwise transition mechanism and activated transmission network associated with NTSR1 activation. It also yields valuable insights into the complex interplay between the unique polar network, non-conserved ion locks, and aromatic clusters in NTSR1 signaling. Moreover, we identified a cryptic allosteric site located in the intracellular region of the receptor that exists in an intermediate state within the activation pathway. Collectively, these findings contribute to a more profound understanding of NTSR1 activation and biased signaling at the atomic level, thereby providing a potential strategy for the development of NTSR1 allosteric modulators in the realm of G protein-coupled receptor biology, biophysics, and medicine.
zh

[AI-63] On the workflow opportunities and challenges of developing foundation model in geophysics

【速读】:该论文旨在解决在地球物理学领域中,关于将基础模型(Foundation Models)与地质物理数据整合的全工作流程缺乏全面综述的问题。论文的关键解决方案在于提出一个系统性的完整框架,涵盖从数据收集与预处理到模型架构选择、预训练策略以及模型部署的全流程,并针对地球物理数据的多样性、复杂性及物理一致性约束提供针对性的解决方法。此外,论文强调利用基础模型的迁移学习能力,减少对标注数据的依赖,提升计算效率,并将物理约束融入模型训练,从而增强模型的物理一致性和可解释性。这一研究不仅填补了地球物理学领域在基础模型全工作流程综述上的空白,还为实际应用提供了有价值的指导,推动了该领域的创新与发展。

链接: https://arxiv.org/abs/2504.17384
作者: Hanlin Sheng,Xinming Wu,Hang Gao,Haibin Di,Sergey Fomel,Jintao Li,Xu Si
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models, as a mainstream technology in artificial intelligence, have demonstrated immense potential across various domains in recent years, particularly in handling complex tasks and multimodal data. In the field of geophysics, although the application of foundation models is gradually expanding, there is currently a lack of comprehensive reviews discussing the full workflow of integrating foundation models with geophysical data. To address this gap, this paper presents a complete framework that systematically explores the entire process of developing foundation models in conjunction with geophysical data. From data collection and preprocessing to model architecture selection, pre-training strategies, and model deployment, we provide a detailed analysis of the key techniques and methodologies at each stage. In particular, considering the diversity, complexity, and physical consistency constraints of geophysical data, we discuss targeted solutions to address these challenges. Furthermore, we discuss how to leverage the transfer learning capabilities of foundation models to reduce reliance on labeled data, enhance computational efficiency, and incorporate physical constraints into model training, thereby improving physical consistency and interpretability. Through a comprehensive summary and analysis of the current technological landscape, this paper not only fills the gap in the geophysics domain regarding a full-process review of foundation models but also offers valuable practical guidance for their application in geophysical data analysis, driving innovation and advancement in the field.
zh

[AI-64] 3D Deep-learning-based Segmentation of Human Skin Sweat Glands and Their 3D Morphological Response to Temperature Variations

【速读】:该论文旨在解决现有观察汗腺形态方法在二维、体外及破坏性等方面的局限性,提出一种实时、无创且可量化的技术。解决方案的关键在于提出了一种基于三维变换器的多目标分割框架,该框架整合了滑动窗口方法、联合空间-通道注意力机制以及浅层与深层结构之间的异构性设计,实现了从光学相干断层扫描(OCT)获取的皮肤体积数据中精确分割三维汗腺结构,并首次可视化和量化了汗腺三维形态对温度变化的细微响应,为正常汗腺形态建立基准,同时提供了研究个体变异和病理变化的新工具,推动皮肤病学研究及临床应用的发展。

链接: https://arxiv.org/abs/2504.17255
作者: Shaoyu Pei,Renxiong Wu,Hao Zheng,Lang Qin,Shuaichen Lin,Yuxing Gan,Wenjing Huang,Zhixuan Wang,Mohan Qin,Yong Liu,Guangming Ni
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Skin, the primary regulator of heat exchange, relies on sweat glands for thermoregulation. Alterations in sweat gland morphology play a crucial role in various pathological conditions and clinical diagnoses. Current methods for observing sweat gland morphology are limited by their two-dimensional, in vitro, and destructive nature, underscoring the urgent need for real-time, non-invasive, quantifiable technologies. We proposed a novel three-dimensional (3D) transformer-based multi-object segmentation framework, integrating a sliding window approach, joint spatial-channel attention mechanism, and architectural heterogeneity between shallow and deep layers. Our proposed network enables precise 3D sweat gland segmentation from skin volume data captured by optical coherence tomography (OCT). For the first time, subtle variations of sweat gland 3D morphology in response to temperature changes, have been visualized and quantified. Our approach establishes a benchmark for normal sweat gland morphology and provides a real-time, non-invasive tool for quantifying 3D structural parameters. This enables the study of individual variability and pathological changes in sweat gland structure, advancing dermatological research and clinical applications, including thermoregulation and bromhidrosis treatment.
zh

[AI-65] Demonstration of an AI-driven workflow for dynamic x-ray spectroscopy

【速读】:该论文旨在解决传统X射线吸收近边缘结构(XANES)光谱数据采集耗时长且容易出现欠采样或过采样的问题。解决方案的关键在于提出了一种注入领域知识的贝叶斯优化方法,通过结合对光谱特征(如吸收边和预边峰)的理解,实现了仅使用常规采样所需测量点的15-20%即可准确重构吸收边,并将吸收边及尖峰能量的重建误差分别控制在0.1 eV和0.03 eV以内,同时保持整体均方根误差低于0.005。这种高效且精确的自适应采样策略显著提升了XANES实验的自动化程度,改善了时间分辨率,适用于静态与动态测量,有效解决了化学状态追踪中的效率与精度挑战。

链接: https://arxiv.org/abs/2504.17124
作者: Ming Du,Mark Wolfman,Chengjun Sun,Shelly D. Kelly,Mathew J. Cherukara
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:X-ray absorption near edge structure (XANES) spectroscopy is a powerful technique for characterizing the chemical state and symmetry of individual elements within materials, but requires collecting data at many energy points which can be time-consuming. While adaptive sampling methods exist for efficiently collecting spectroscopic data, they often lack domain-specific knowledge about XANES spectra structure. Here we demonstrate a knowledge-injected Bayesian optimization approach for adaptive XANES data collection that incorporates understanding of spectral features like absorption edges and pre-edge peaks. We show this method accurately reconstructs the absorption edge of XANES spectra using only 15-20% of the measurement points typically needed for conventional sampling, while maintaining the ability to determine the x-ray energy of the sharp peak after absorption edge with errors less than 0.03 eV, the absorption edge with errors less than 0.1 eV; and overall root-mean-square errors less than 0.005 compared to compared to traditionally sampled spectra. Our experiments on battery materials and catalysts demonstrate the method’s effectiveness for both static and dynamic XANES measurements, improving data collection efficiency and enabling better time resolution for tracking chemical changes. This approach advances the degree of automation in XANES experiments reducing the common errors of under- or over-sampling points in near the absorption edge and enabling dynamic experiments that require high temporal resolution or limited measurement time.
zh

[AI-66] Physics-guided and fabrication-aware inverse design of photonic devices using diffusion models

【速读】:该论文旨在解决自由形式光子器件设计中的根本性挑战,即由于可能的几何形状数量庞大以及制造约束的复杂性所带来的难题。传统逆向设计方法(无论是基于人类直觉、全局优化还是伴随梯度方法)通常包含复杂的二值化和过滤步骤,而近期的深度学习策略则需要极大量的模拟次数(10⁵到10⁶)。为克服这些限制,本文提出了一种名为AdjointDiffusion的物理引导框架,它将伴随灵敏度梯度集成到扩散模型的采样过程中。该方法的关键在于通过在训练阶段使用合成且考虑制造的二值掩模数据集来初始化扩散网络,并在推理阶段计算候选结构的伴随梯度,在每个去噪步骤中注入基于物理的指导,从而引导生成过程朝向高优值指标(FoM)的解决方案发展,而无需额外的后处理。这种方法在弯波导和CMOS图像传感器颜色路由器两个典型的光子设计问题上得到了验证,并显示出相较于最先进的非线性优化器(如MMA和SLSQP)在效率和可制造性方面的显著优势,同时所需的模拟次数比纯深度学习方法减少了几个数量级(约2×10²次对比10⁵到10⁶次)。通过消除复杂的二值化计划并最小化模拟开销,AdjointDiffusion提供了一个简化、模拟高效的、面向制造的下一代光子器件设计管道。其开源实现可在提供的链接获取。

链接: https://arxiv.org/abs/2504.17077
作者: Dongjin Seo,Soobin Um,Sangbin Lee,Jong Chul Ye,Haejun Chung
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 25 pages, 7 Figures

点击查看摘要

Abstract:Designing free-form photonic devices is fundamentally challenging due to the vast number of possible geometries and the complex requirements of fabrication constraints. Traditional inverse-design approaches–whether driven by human intuition, global optimization, or adjoint-based gradient methods–often involve intricate binarization and filtering steps, while recent deep learning strategies demand prohibitively large numbers of simulations (10^5 to 10^6). To overcome these limitations, we present AdjointDiffusion, a physics-guided framework that integrates adjoint sensitivity gradients into the sampling process of diffusion models. AdjointDiffusion begins by training a diffusion network on a synthetic, fabrication-aware dataset of binary masks. During inference, we compute the adjoint gradient of a candidate structure and inject this physics-based guidance at each denoising step, steering the generative process toward high figure-of-merit (FoM) solutions without additional post-processing. We demonstrate our method on two canonical photonic design problems–a bent waveguide and a CMOS image sensor color router–and show that our method consistently outperforms state-of-the-art nonlinear optimizers (such as MMA and SLSQP) in both efficiency and manufacturability, while using orders of magnitude fewer simulations (approximately 2 x 10^2) than pure deep learning approaches (approximately 10^5 to 10^6). By eliminating complex binarization schedules and minimizing simulation overhead, AdjointDiffusion offers a streamlined, simulation-efficient, and fabrication-aware pipeline for next-generation photonic device design. Our open-source implementation is available at this https URL.
zh

[AI-67] Fried Parameter Estimation from Single Wavefront Sensor Image with Artificial Neural Networks

【速读】:该论文旨在解决利用单个Shack-Hartmann或金字塔波前传感器图像估算Fried参数(r0)的问题,这是优化自适应光学(Adaptive Optics, AO)系统性能以及自由空间光通信信道天空剖面分析的关键。论文的关键创新在于开发了一种基于数据驱动的新方法,通过借鉴计算机视觉领域的机器学习技术来实现这一目标。这种方法能够在开放环路和闭合环路AO配置下均保持较高的准确性,并且能够从单一波前传感器图像直接估算出精度达到毫米级别的r0值,同时支持实时控制,具有显著的经济效益。

链接: https://arxiv.org/abs/2504.17029
作者: Jeffrey Smith,Taisei Fujii,Jesse Craney,Charles Gretton
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Atmospheric turbulence degrades the quality of astronomical observations in ground-based telescopes, leading to distorted and blurry images. Adaptive Optics (AO) systems are designed to counteract these effects, using atmospheric measurements captured by a wavefront sensor to make real-time corrections to the incoming wavefront. The Fried parameter, r0, characterises the strength of atmospheric turbulence and is an essential control parameter for optimising the performance of AO systems and more recently sky profiling for Free Space Optical (FSO) communication channels. In this paper, we develop a novel data-driven approach, adapting machine learning methods from computer vision for Fried parameter estimation from a single Shack-Hartmann or pyramid wavefront sensor image. Using these data-driven methods, we present a detailed simulation-based evaluation of our approach using the open-source COMPASS AO simulation tool to evaluate both the Shack-Hartmann and pyramid wavefront sensors. Our evaluation is over a range of guide star magnitudes, and realistic noise, atmospheric and instrument conditions. Remarkably, we are able to develop a single network-based estimator that is accurate in both open and closed-loop AO configurations. Our method accurately estimates the Fried parameter from a single WFS image directly from AO telemetry to a few millimetres. Our approach is suitable for real time control, exhibiting 0.83ms r0 inference times on retail NVIDIA RTX 3090 GPU hardware, and thereby demonstrating a compelling economic solution for use in real-time instrument control.
zh

机器学习

[LG-0] Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

链接: https://arxiv.org/abs/2504.17782
作者: Xize Cheng,Slytherin Wang,Zehan Wang,Rongjie Huang,Tao Jin,Zhou Zhao
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Work in Progress

点击查看摘要

Abstract:Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose ClearSep, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios. For more examples and detailed results, please visit our demo page at this https URL.

[LG-1] Replay to Remember: Retaining Domain Knowledge in Streaming Language Models

链接: https://arxiv.org/abs/2504.17780
作者: Sneh Pillai(University of Massachusetts Dartmouth)
类目: Machine Learning (cs.LG)
*备注: 8 pages 3 figures, 3 tables

点击查看摘要

Abstract:Continual learning in large language models (LLMs) typically encounters the critical challenge of catastrophic forgetting, where previously acquired knowledge deteriorates upon exposure to new data. While techniques like replay buffers and parameter-efficient tuning (e.g., Low-Rank Adaptation or LoRA) have been proposed, few studies investigate real-time domain adaptation under strict computational and data-stream constraints. In this paper, we demonstrate a lightweight method combining LoRA and a minimal replay mechanism in a realistic streaming setting across three diverse knowledge domains: medical question answering, genetics, and law. Using perplexity, semantic similarity, and GPT-based human-like evaluation metrics, we quantify the model’s adaptation, forgetting, and recovery over time. Our experiments reveal that while catastrophic forgetting naturally occurs, even minimal replay significantly stabilizes and partially restores domain-specific knowledge. This study contributes practical insights for deploying adaptable LLMs in resource-constrained, real-world scenarios.

[LG-2] Disaggregated Deep Learning via In-Physics Computing at Radio Frequency

链接: https://arxiv.org/abs/2504.17752
作者: Zhihui Gao,Sri Krishna Vadlamani,Kfir Sulimany,Dirk Englund,Tingjun Chen
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Applied Physics (physics.app-ph)
*备注: 11 pages, 4 figures. Supplementary Information: 54 pages, 20 figures, 1 table

点击查看摘要

Abstract:Modern edge devices, such as cameras, drones, and Internet-of-Things nodes, rely on deep learning to enable a wide range of intelligent applications, including object recognition, environment perception, and autonomous navigation. However, deploying deep learning models directly on the often resource-constrained edge devices demands significant memory footprints and computational power for real-time inference using traditional digital computing architectures. In this paper, we present WISE, a novel computing architecture for wireless edge networks designed to overcome energy constraints in deep learning inference. WISE achieves this goal through two key innovations: disaggregated model access via wireless broadcasting and in-physics computation of general complex-valued matrix-vector multiplications directly at radio frequency. Using a software-defined radio platform with wirelessly broadcast model weights over the air, we demonstrate that WISE achieves 95.7% image classification accuracy with ultra-low operation power of 6.0 fJ/MAC per client, corresponding to a computation efficiency of 165.8 TOPS/W. This approach enables energy-efficient deep learning inference on wirelessly connected edge devices, achieving more than two orders of magnitude improvement in efficiency compared to traditional digital computing.

[LG-3] MSGCN: Multiplex Spatial Graph Convolution Network for Interlayer Link Weight Prediction

链接: https://arxiv.org/abs/2504.17749
作者: Steven E. Wilson,Sina Khanmohammadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely used for various learning tasks, ranging from node classification to link prediction. They have demonstrated excellent performance in multiple domains involving graph-structured data. However, an important category of learning tasks, namely link weight prediction, has received less emphasis due to its increased complexity compared to binary link classification. Link weight prediction becomes even more challenging when considering multilayer networks, where nodes can be interconnected across multiple layers. To address these challenges, we propose a new method named Multiplex Spatial Graph Convolution Network (MSGCN), which spatially embeds information across multiple layers to predict interlayer link weights. The MSGCN model generalizes spatial graph convolution to multiplex networks and captures the geometric structure of nodes across multiple layers. Extensive experiments using data with known interlayer link information show that the MSGCN model has robust, accurate, and generalizable link weight prediction performance across a wide variety of multiplex network structures.

[LG-4] Embedding Empirical Distributions for Computing Optimal Transport Maps

链接: https://arxiv.org/abs/2504.17740
作者: Mingchen Jiang,Peng Xu,Xichen Ye,Xiaohui Chen,Yun Yang,Yifan Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributional data have become increasingly prominent in modern signal processing, highlighting the necessity of computing optimal transport (OT) maps across multiple probability distributions. Nevertheless, recent studies on neural OT methods predominantly focused on the efficient computation of a single map between two distributions. To address this challenge, we introduce a novel approach to learning transport maps for new empirical distributions. Specifically, we employ the transformer architecture to produce embeddings from distributional data of varying length; these embeddings are then fed into a hypernetwork to generate neural OT maps. Various numerical experiments were conducted to validate the embeddings and the generated OT maps. The model implementation and the code are provided on this https URL.

[LG-5] Interpretable Early Detection of Parkinsons Disease through Speech Analysis

链接: https://arxiv.org/abs/2504.17739
作者: Lorenzo Simone,Mauro Giuseppe Camporeale,Vito Marco Rubino,Vincenzo Gervasi,Giovanni Dimauro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parkinson’s disease is a progressive neurodegenerative disorder affecting motor and non-motor functions, with speech impairments among its earliest symptoms. Speech impairments offer a valuable diagnostic opportunity, with machine learning advances providing promising tools for timely detection. In this research, we propose a deep learning approach for early Parkinson’s disease detection from speech recordings, which also highlights the vocal segments driving predictions to enhance interpretability. This approach seeks to associate predictive speech patterns with articulatory features, providing a basis for interpreting underlying neuromuscular impairments. We evaluated our approach using the Italian Parkinson’s Voice and Speech Database, containing 831 audio recordings from 65 participants, including both healthy individuals and patients. Our approach showed competitive classification performance compared to state-of-the-art methods, while providing enhanced interpretability by identifying key speech features influencing predictions.

[LG-6] owards Robust LLM s: an Adversarial Robustness Measurement Framework

链接: https://arxiv.org/abs/2504.17723
作者: Natan Levy,Adiel Ashrov,Guy Katz
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has revolutionized artificial intelligence, yet these models remain vulnerable to adversarial perturbations, undermining their reliability in high-stakes applications. While adversarial robustness in vision-based neural networks has been extensively studied, LLM robustness remains under-explored. We adapt the Robustness Measurement and Assessment (RoMA) framework to quantify LLM resilience against adversarial inputs without requiring access to model parameters. By comparing RoMA’s estimates to those of formal verification methods, we demonstrate its accuracy with minimal error margins while maintaining computational efficiency. Our empirical evaluation reveals that robustness varies significantly not only between different models but also across categories within the same task and between various types of perturbations. This non-uniformity underscores the need for task-specific robustness evaluations, enabling practitioners to compare and select models based on application-specific robustness requirements. Our work provides a systematic methodology to assess LLM robustness, advancing the development of more reliable language models for real-world deployment.

[LG-7] Fault Diagnosis in New Wind Turbines using Knowledge from Existing Turbines by Generative Domain Adaptation

链接: https://arxiv.org/abs/2504.17709
作者: Stefan Jonas,Angela Meyer
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Intelligent condition monitoring of wind turbines is essential for reducing downtimes. Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and, eventually, operation faults. However, data-driven normal behavior models (NBMs) require a substantial amount of training data, as NBMs trained with scarce data may result in unreliable fault diagnosis. To overcome this limitation, we present a novel generative deep learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data. Through CycleGAN-based domain mapping, our method enables the application of an NBM trained on an existing wind turbine to one with severely limited data. We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs. Our findings show significantly improved fault diagnosis in wind turbines with scarce data. Our method achieves the most similar anomaly scores to an NBM trained with abundant data, outperforming NBMs trained on scarce training data with improvements of +10.3% in F1-score when 1 month of training data is available and +16.8% when 2 weeks are available. The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity, ranging from 1 to 8 weeks of training data. The proposed technique enables earlier and more reliable fault diagnosis in newly installed wind farms, demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.

[LG-8] On Multivariate Financial Time Series Classification

链接: https://arxiv.org/abs/2504.17664
作者: Grégory Bournassenko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article investigates the use of Machine Learning and Deep Learning models in multivariate time series analysis within financial markets. It compares small and big data approaches, focusing on their distinct challenges and the benefits of scaling. Traditional methods such as SVMs are contrasted with modern architectures like ConvTimeNet. The results show the importance of using and understanding Big Data in depth in the analysis and prediction of financial time series.

[LG-9] Effortless Simulation-Efficient Bayesian Inference using Tabular Foundation Models

链接: https://arxiv.org/abs/2504.17660
作者: Julius Vetter,Manuel Gloeckler,Daniel Gedon,Jakob H. Macke
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference (SBI) offers a flexible and general approach to performing Bayesian inference: In SBI, a neural network is trained on synthetic data simulated from a model and used to rapidly infer posterior distributions for observed data. A key goal for SBI is to achieve accurate inference with as few simulations as possible, especially for expensive simulators. In this work, we address this challenge by repurposing recent probabilistic foundation models for tabular data: We show how tabular foundation models – specifically TabPFN – can be used as pre-trained autoregressive conditional density estimators for SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks (NPE-PF) and show that it is competitive with current SBI approaches in terms of accuracy for both benchmark tasks and two complex scientific inverse problems. Crucially, it often substantially outperforms them in terms of simulation efficiency, sometimes requiring orders of magnitude fewer simulations. NPE-PF eliminates the need for inference network selection, training, and hyperparameter tuning. We also show that it exhibits superior robustness to model misspecification and can be scaled to simulation budgets that exceed the context size limit of TabPFN. NPE-PF provides a new direction for SBI, where training-free, general-purpose inference models offer efficient, easy-to-use, and flexible solutions for a wide range of stochastic inverse problems.

[LG-10] polyGen: A Learning Framework for Atomic-level Polymer Structure Generation

链接: https://arxiv.org/abs/2504.17656
作者: Ayush Jain,Rampi Ramprasad
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic polymeric materials underpin fundamental technologies in the energy, electronics, consumer goods, and medical sectors, yet their development still suffers from prolonged design timelines. Although polymer informatics tools have supported speedup, polymer simulation protocols continue to face significant challenges: on-demand generation of realistic 3D atomic structures that respect the conformational diversity of polymer structures. Generative algorithms for 3D structures of inorganic crystals, bio-polymers, and small molecules exist, but have not addressed synthetic polymers. In this work, we introduce polyGen, the first latent diffusion model designed specifically to generate realistic polymer structures from minimal inputs such as the repeat unit chemistry alone, leveraging a molecular encoding that captures polymer connectivity throughout the architecture. Due to a scarce dataset of only 3855 DFT-optimized polymer structures, we augment our training with DFT-optimized molecular structures, showing improvement in joint learning between similar chemical structures. We also establish structure matching criteria to benchmark our approach on this novel problem. polyGen effectively generates diverse conformations of both linear chains and complex branched structures, though its performance decreases when handling repeat units with a high atom count. Given these initial results, polyGen represents a paradigm shift in atomic-level structure generation for polymer science-the first proof-of-concept for predicting realistic atomic-level polymer conformations while accounting for their intrinsic structural flexibility.

[LG-11] arDiff: Target-Oriented Diffusion Guidance for Synthetic Electronic Health Record Time Series Generation

链接: https://arxiv.org/abs/2504.17613
作者: Bowen Deng,Chang Xu,Hao Li,Yuhao Huang,Min Hou,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic Electronic Health Record (EHR) time-series generation is crucial for advancing clinical machine learning models, as it helps address data scarcity by providing more training data. However, most existing approaches focus primarily on replicating statistical distributions and temporal dependencies of real-world data. We argue that fidelity to observed data alone does not guarantee better model performance, as common patterns may dominate, limiting the representation of rare but important conditions. This highlights the need for generate synthetic samples to improve performance of specific clinical models to fulfill their target outcomes. To address this, we propose TarDiff, a novel target-oriented diffusion framework that integrates task-specific influence guidance into the synthetic data generation process. Unlike conventional approaches that mimic training data distributions, TarDiff optimizes synthetic samples by quantifying their expected contribution to improving downstream model performance through influence functions. Specifically, we measure the reduction in task-specific loss induced by synthetic samples and embed this influence gradient into the reverse diffusion process, thereby steering the generation towards utility-optimized data. Evaluated on six publicly available EHR datasets, TarDiff achieves state-of-the-art performance, outperforming existing methods by up to 20.4% in AUPRC and 18.4% in AUROC. Our results demonstrate that TarDiff not only preserves temporal fidelity but also enhances downstream model performance, offering a robust solution to data scarcity and class imbalance in healthcare analytics.

[LG-12] Interpretable non-linear dimensionality reduction using gaussian weighted linear transformation

链接: https://arxiv.org/abs/2504.17601
作者: Erik Bergh
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Dimensionality reduction techniques are fundamental for analyzing and visualizing high-dimensional data. With established methods like t-SNE and PCA presenting a trade-off between representational power and interpretability. This paper introduces a novel approach that bridges this gap by combining the interpretability of linear methods with the expressiveness of non-linear transformations. The proposed algorithm constructs a non-linear mapping between high-dimensional and low-dimensional spaces through a combination of linear transformations, each weighted by Gaussian functions. This architecture enables complex non-linear transformations while preserving the interpretability advantages of linear methods, as each transformation can be analyzed independently. The resulting model provides both powerful dimensionality reduction and transparent insights into the transformed space. Techniques for interpreting the learned transformations are presented, including methods for identifying suppressed dimensions and how space is expanded and contracted. These tools enable practitioners to understand how the algorithm preserves and modifies geometric relationships during dimensionality reduction. To ensure the practical utility of this algorithm, the creation of user-friendly software packages is emphasized, facilitating its adoption in both academia and industry.

[LG-13] A Machine Learning Approach for Denoising and Upsampling HRTFs

链接: https://arxiv.org/abs/2504.17586
作者: Xuyi Hu,Jian Li,Lorenzo Picinali,Aidan O. T. Hogg
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The demand for realistic virtual immersive audio continues to grow, with Head-Related Transfer Functions (HRTFs) playing a key role. HRTFs capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception. It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment. Although machine learning has been shown to reduce the required measurement points and, thus, the measurement time, a controlled environment is still necessary. This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements. The proposed approach combines an HRTF Denoisy U-Net for denoising and an Autoencoding Generative Adversarial Network (AE-GAN) for upsampling from three measurement points. The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method’s effectiveness in HRTF upsampling.

[LG-14] L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

链接: https://arxiv.org/abs/2504.17584
作者: Qingyuan Liu,Liyan Chen,Yanning Yang,Haocheng Wang,Dong Du,Zhigang Mao,Naifeng Jing,Yubin Xia,Haibo Chen
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1 \times speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes. Comments: 16 pages, 11 figures Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2504.17584 [cs.AR] (or arXiv:2504.17584v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2504.17584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Advancing CMA-ES with Learning-Based Cooperative Coevolution for Scalable Optimization

链接: https://arxiv.org/abs/2504.17578
作者: Hongshu Guo,Wenjie Qiu,Zeyuan Ma,Xinglin Zhang,Jun Zhang,Yue-Jiao Gong
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent research in Cooperative Coevolution~(CC) have achieved promising progress in solving large-scale global optimization problems. However, existing CC paradigms have a primary limitation in that they require deep expertise for selecting or designing effective variable decomposition strategies. Inspired by advancements in Meta-Black-Box Optimization, this paper introduces LCC, a pioneering learning-based cooperative coevolution framework that dynamically schedules decomposition strategies during optimization processes. The decomposition strategy selector is parameterized through a neural network, which processes a meticulously crafted set of optimization status features to determine the optimal strategy for each optimization step. The network is trained via the Proximal Policy Optimization method in a reinforcement learning manner across a collection of representative problems, aiming to maximize the expected optimization performance. Extensive experimental results demonstrate that LCC not only offers certain advantages over state-of-the-art baselines in terms of optimization effectiveness and resource consumption, but it also exhibits promising transferability towards unseen problems.

[LG-16] Lang: A Composable Tiled Programming Model for AI Systems

链接: https://arxiv.org/abs/2504.17577
作者: Lei Wang,Yu Cheng,Yining Shi,Zhengju Tang,Zhiwen Mo,Wenhao Xie,Lingxiao Ma,Yuqing Xia,Jilong Xue,Fan Yang,Zhi Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern AI workloads rely heavily on optimized computing kernels for both training and inference. These AI kernels follow well-defined data-flow patterns, such as moving tiles between DRAM and SRAM and performing a sequence of computations on those tiles. However, writing high-performance kernels remains complex despite the clarity of these patterns. Achieving peak performance requires careful, hardware-centric optimizations to fully leverage modern accelerators. While domain-specific compilers attempt to reduce the burden of writing high-performance kernels, they often struggle with usability and expressiveness gaps. In this paper, we present TileLang, a generalized tiled programming model for more efficient AI Kernel programming. TileLang decouples scheduling space (thread binding, layout, tensorize and pipeline) from dataflow, and encapsulated them as a set of customization annotations and primitives. This approach allows users to focus on the kernel’s data-flow itself, while leaving most other optimizations to compilers. We conduct comprehensive experiments on commonly-used devices, across numerous experiments, our evaluation shows that TileLang can achieve state-of-the-art performance in key kernels, demonstrating that its unified block-and-thread paradigm and transparent scheduling capabilities deliver both the power and flexibility demanded by modern AI system development.

[LG-17] Beyond Cox Models: Assessing the Performance of Machine-Learning Methods in Non-Proportional Hazards and Non-Linear Survival Analysis

链接: https://arxiv.org/abs/2504.17568
作者: Ivan Rossi,Flavio Sartori,Cesare Rollo,Giovanni Birolo,Piero Fariselli,Tiziana Sanavia
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Survival analysis often relies on Cox models, assuming both linearity and proportional hazards (PH). This study evaluates machine and deep learning methods that relax these constraints, comparing their performance with penalized Cox models on a benchmark of three synthetic and three real datasets. In total, eight different models were tested, including six non-linear models of which four were also non-PH. Although Cox regression often yielded satisfactory performance, we showed the conditions under which machine and deep learning models can perform better. Indeed, the performance of these methods has often been underestimated due to the improper use of Harrell’s concordance index (C-index) instead of more appropriate scores such as Antolini’s concordance index, which generalizes C-index in cases where the PH assumption does not hold. In addition, since occasionally high C-index models happen to be badly calibrated, combining Antolini’s C-index with Brier’s score is useful to assess the overall performance of a survival method. Results on our benchmark data showed that survival prediction should be approached by testing different methods to select the most appropriate one according to sample size, non-linearity and non-PH conditions. To allow an easy reproducibility of these tests on our benchmark data, code and documentation are freely available at this https URL.

[LG-18] IRA: Adaptive Interest-aware Representation and Alignment for Personalized Multi-interest Retrieval SIGIR2025

链接: https://arxiv.org/abs/2504.17529
作者: Youngjune Lee,Haeyu Jeong,Changgeon Lim,Jeong Choi,Hongjun Lim,Hangon Kim,Jiyoon Kwon,Saehun Kim
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to SIGIR 2025 Industry Track. First two authors contributed equally

点击查看摘要

Abstract:Online community platforms require dynamic personalized retrieval and recommendation that can continuously adapt to evolving user interests and new documents. However, optimizing models to handle such changes in real-time remains a major challenge in large-scale industrial settings. To address this, we propose the Interest-aware Representation and Alignment (IRA) framework, an efficient and scalable approach that dynamically adapts to new interactions through a cumulative structure. IRA leverages two key mechanisms: (1) Interest Units that capture diverse user interests as contextual texts, while reinforcing or fading over time through cumulative updates, and (2) a retrieval process that measures the relevance between Interest Units and documents based solely on semantic relationships, eliminating dependence on click signals to mitigate temporal biases. By integrating cumulative Interest Unit updates with the retrieval process, IRA continuously adapts to evolving user preferences, ensuring robust and fine-grained personalization without being constrained by past training distributions. We validate the effectiveness of IRA through extensive experiments on real-world datasets, including its deployment in the Home Section of NAVER’s CAFE, South Korea’s leading community platform.

[LG-19] Cooperative Task Offloading through Asynchronous Deep Reinforcement Learning in Mobile Edge Computing for Future Networks

链接: https://arxiv.org/abs/2504.17526
作者: Yuelin Liu,Haiyuan Li,Xenofon Vasilakos,Rasheed Hussain,Dimitra Simeonidou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Future networks (including 6G) are poised to accelerate the realisation of Internet of Everything. However, it will result in a high demand for computing resources to support new services. Mobile Edge Computing (MEC) is a promising solution, enabling to offload computation-intensive tasks to nearby edge servers from the end-user devices, thereby reducing latency and energy consumption. However, relying solely on a single MEC server for task offloading can lead to uneven resource utilisation and suboptimal performance in complex scenarios. Additionally, traditional task offloading strategies specialise in centralised policy decisions, which unavoidably entail extreme transmission latency and reach computational bottleneck. To fill the gaps, we propose a latency and energy efficient Cooperative Task Offloading framework with Transformer-driven Prediction (CTO-TP), leveraging asynchronous multi-agent deep reinforcement learning to address these challenges. This approach fosters edge-edge cooperation and decreases the synchronous waiting time by performing asynchronous training, optimising task offloading, and resource allocation across distributed networks. The performance evaluation demonstrates that the proposed CTO-TP algorithm reduces up to 80% overall system latency and 87% energy consumption compared to the baseline schemes.

[LG-20] Communication-Efficient Personalized Distributed Learning with Data and Node Heterogeneity

链接: https://arxiv.org/abs/2504.17520
作者: Zhuojun Tian,Zhaoyang Zhang,Yiwei Li,Mehdi Bennis
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: Accepcted by TCCN

点击查看摘要

Abstract:To jointly tackle the challenges of data and node heterogeneity in decentralized learning, we propose a distributed strong lottery ticket hypothesis (DSLTH), based on which a communication-efficient personalized learning algorithm is developed. In the proposed method, each local model is represented as the Hadamard product of global real-valued parameters and a personalized binary mask for pruning. The local model is learned by updating and fusing the personalized binary masks while the real-valued parameters are fixed among different agents. To further reduce the complexity of hardware implementation, we incorporate a group sparse regularization term in the loss function, enabling the learned local model to achieve structured sparsity. Then, a binary mask aggregation algorithm is designed by introducing an intermediate aggregation tensor and adding a personalized fine-tuning step in each iteration, which constrains model updates towards the local data distribution. The proposed method effectively leverages the relativity among agents while meeting personalized requirements in heterogeneous node conditions. We also provide a theoretical proof for the DSLTH, establishing it as the foundation of the proposed method. Numerical simulations confirm the validity of the DSLTH and demonstrate the effectiveness of the proposed algorithm.

[LG-21] ailored minimal reservoir computing: on the bidirectional connection between nonlinearities in the reservoir and in data

链接: https://arxiv.org/abs/2504.17503
作者: Davide Prosperino,Haochun Ma,Christoph Räth
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:We study how the degree of nonlinearity in the input data affects the optimal design of reservoir computers, focusing on how closely the model’s nonlinearity should align with that of the data. By reducing minimal RCs to a single tunable nonlinearity parameter, we explore how the predictive performance varies with the degree of nonlinearity in the reservoir. To provide controlled testbeds, we generalize to the fractional Halvorsen system, a novel chaotic system with fractional exponents. Our experiments reveal that the prediction performance is maximized when the reservoir’s nonlinearity matches the nonlinearity present in the data. In cases where multiple nonlinearities are present in the data, we find that the correlation dimension of the predicted signal is reconstructed correctly when the smallest nonlinearity is matched. We use this observation to propose a method for estimating the minimal nonlinearity in unknown time series by sweeping the reservoir exponent and identifying the transition to a successful reconstruction. Applying this method to both synthetic and real-world datasets, including financial time series, we demonstrate its practical viability. Finally, we transfer these insights to classical RC by augmenting traditional architectures with fractional, generalized reservoir states. This yields performance gains, particularly in resource-constrained scenarios such as physical reservoirs, where increasing reservoir size is impractical or economically unviable. Our work provides a principled route toward tailoring RCs to the intrinsic complexity of the systems they aim to model.

[LG-22] Prototype-enhanced prediction in graph neural networks for climate applications

链接: https://arxiv.org/abs/2504.17492
作者: Nawid Keshtmand,Elena Fillola,Jeffrey Nicholas Clark,Raul Santos-Rodriguez,Matthew Rigby
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven emulators are increasingly being used to learn and emulate physics-based simulations, reducing computational expense and run time. Here, we present a structured way to improve the quality of these high-dimensional emulated outputs, through the use of prototypes: an approximation of the emulator’s output passed as an input, which informs the model and leads to better predictions. We demonstrate our approach to emulate atmospheric dispersion, key for greenhouse gas emissions monitoring, by comparing a baseline model to models trained using prototypes as an additional input. The prototype models achieve better performance, even with few prototypes and even if they are chosen at random, but we show that choosing the prototypes through data-driven methods (k-means) can lead to almost 10% increased performance in some metrics.

[LG-23] CHASe: Client Heterogeneity-Aware Data Selection for Effective Federated Active Learning

链接: https://arxiv.org/abs/2504.17448
作者: Jun Zhang,Jue Wang,Huan Li,Zhongle Xie,Ke Chen,Lidan Shou
类目: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by TKDE 2025

点击查看摘要

Abstract:Active learning (AL) reduces human annotation costs for machine learning systems by strategically selecting the most informative unlabeled data for annotation, but performing it individually may still be insufficient due to restricted data diversity and annotation budget. Federated Active Learning (FAL) addresses this by facilitating collaborative data selection and model training, while preserving the confidentiality of raw data samples. Yet, existing FAL methods fail to account for the heterogeneity of data distribution across clients and the associated fluctuations in global and local model parameters, adversely affecting model accuracy. To overcome these challenges, we propose CHASe (Client Heterogeneity-Aware Data Selection), specifically designed for FAL. CHASe focuses on identifying those unlabeled samples with high epistemic variations (EVs), which notably oscillate around the decision boundaries during training. To achieve both effectiveness and efficiency, \model encompasses techniques for 1) tracking EVs by analyzing inference inconsistencies across training epochs, 2) calibrating decision boundaries of inaccurate models with a new alignment loss, and 3) enhancing data selection efficiency via a data freeze and awaken mechanism with subset sampling. Experiments show that CHASe surpasses various established baselines in terms of effectiveness and efficiency, validated across diverse datasets, model complexities, and heterogeneous federation settings.

[LG-24] Coding for Computation: Efficient Compression of Neural Networks for Reconfigurable Hardware

链接: https://arxiv.org/abs/2504.17403
作者: Hans Rosenberger,Rodrigo Fischer,Johanna S. Fröhlich,Ali Bereyhi,Ralf R. Müller
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: Accepted at the 2025 IEEE Statistical Signal Processing (SSP) Workshop, Edinburgh

点击查看摘要

Abstract:As state of the art neural networks (NNs) continue to grow in size, their resource-efficient implementation becomes ever more important. In this paper, we introduce a compression scheme that reduces the number of computations required for NN inference on reconfigurable hardware such as FPGAs. This is achieved by combining pruning via regularized training, weight sharing and linear computation coding (LCC). Contrary to common NN compression techniques, where the objective is to reduce the memory used for storing the weights of the NNs, our approach is optimized to reduce the number of additions required for inference in a hardware-friendly manner. The proposed scheme achieves competitive performance for simple multilayer perceptrons, as well as for large scale deep NNs such as ResNet-34.

[LG-25] On-Device Qwen 2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration

链接: https://arxiv.org/abs/2504.17376
作者: Maoyang Xiang,Ramesh Fernando,Bo Wang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based Large Language Models (LLMs) have significantly advanced AI capabilities but pose considerable challenges for deployment on edge devices due to high computational demands, memory bandwidth constraints, and energy consumption. This paper addresses these challenges by presenting an efficient framework for deploying the Qwen2.5-0.5B model on the Xilinx Kria KV260 edge platform, a heterogeneous system integrating an ARM Cortex-A53 CPU with reconfigurable FPGA logic. Leveraging Activation-aware Weight Quantization (AWQ) with FPGA-accelerated execution pipelines, the proposed approach enhances both model compression rate and system throughput. Additionally, we propose a hybrid execution strategy that intelligently offloads compute-intensive operations to the FPGA while utilizing the CPU for lighter tasks, effectively balancing the computational workload and maximizing overall performance. Our framework achieves a model compression rate of 55.08% compared to the original model and produces output at a rate of 5.1 tokens per second, outperforming the baseline performance of 2.8 tokens per second.

[LG-26] Doubly Adaptive Social Learning

链接: https://arxiv.org/abs/2504.17370
作者: Marco Carpentiero,Virginia Bordignon,Vincenzo Matta,Ali H. Sayed
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:In social learning, a network of agents assigns probability scores (beliefs) to some hypotheses of interest, which rule the generation of local streaming data observed by each agent. Belief formation takes place by means of an iterative two-step procedure where: i) the agents update locally their beliefs by using some likelihood model; and ii) the updated beliefs are combined with the beliefs of the neighboring agents, using a pooling rule. This procedure can fail to perform well in the presence of dynamic drifts, leading the agents to incorrect decision making. Here, we focus on the fully online setting where both the true hypothesis and the likelihood models can change over time. We propose the doubly adaptive social learning ( \textA^2\textSL ) strategy, which infuses social learning with the necessary adaptation capabilities. This goal is achieved by exploiting two adaptation stages: i) a stochastic gradient descent update to learn and track the drifts in the decision model; ii) and an adaptive belief update to track the true hypothesis changing over time. These stages are controlled by two adaptation parameters that govern the evolution of the error probability for each agent. We show that all agents learn consistently for sufficiently small adaptation parameters, in the sense that they ultimately place all their belief mass on the true hypothesis. In particular, the probability of choosing the wrong hypothesis converges to values on the order of the adaptation parameters. The theoretical analysis is illustrated both on synthetic data and by applying the \textA^2\textSL strategy to a social learning problem in the online setting using real data.

[LG-27] Machine learning-based condition monitoring of powertrains in modern electric drives

链接: https://arxiv.org/abs/2504.17305
作者: Dinan Li,Panagiotis Kakosimos,Luca Peretti
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:The recent technological advances in digitalization have revolutionized the industrial sector. Leveraging data analytics has now enabled the collection of deep insights into the performance and, as a result, the optimization of assets. Industrial drives, for example, already accumulate all the necessary information to control electric machines. These signals include but are not limited to currents, frequency, and temperature. Integrating machine learning (ML) models responsible for predicting the evolution of those directly collected or implicitly derived parameters enhances the smartness of industrial systems even further. In this article, data already residing in most modern electric drives has been used to develop a data-driven thermal model of a power module. A test bench has been designed and used specifically for training and validating the thermal digital twin undergoing various static and dynamic operating profiles. Different approaches, from traditional linear models to deep neural networks, have been implemented to emanate the best ML model for estimating the case temperature of a power module. Several evaluation metrics were then used to assess the investigated methods’ performance and implementation in industrial embedded systems.

[LG-28] he Ultimate Cookbook for Invisible Poison: Crafting Subtle Clean-Label Text Backdoors with Style Attributes

链接: https://arxiv.org/abs/2504.17300
作者: Wencong You,Daniel Lowd
类目: Machine Learning (cs.LG)
*备注: Accepted at SaTML 2025

点击查看摘要

Abstract:Backdoor attacks on text classifiers can cause them to predict a predefined label when a particular “trigger” is present. Prior attacks often rely on triggers that are ungrammatical or otherwise unusual, leading to conspicuous attacks. As a result, human annotators, who play a critical role in curating training data in practice, can easily detect and filter out these unnatural texts during manual inspection, reducing the risk of such attacks. We argue that a key criterion for a successful attack is for text with and without triggers to be indistinguishable to humans. However, prior work neither directly nor comprehensively evaluated attack subtlety and invisibility with human involvement. We bridge the gap by conducting thorough human evaluations to assess attack subtlety. We also propose \emphAttrBkd, consisting of three recipes for crafting subtle yet effective trigger attributes, such as extracting fine-grained attributes from existing baseline backdoor attacks. Our human evaluations find that AttrBkd with these baseline-derived attributes is often more effective (higher attack success rate) and more subtle (fewer instances detected by humans) than the original baseline backdoor attacks, demonstrating that backdoor attacks can bypass detection by being inconspicuous and appearing natural even upon close inspection, while still remaining effective. Our human annotation also provides information not captured by automated metrics used in prior work, and demonstrates the misalignment of these metrics with human judgment.

[LG-29] HeRB: Heterophily-Resolved Structure Balancer for Graph Neural Networks

链接: https://arxiv.org/abs/2504.17276
作者: Ke-Jia Chen,Wenhui Mu,Zheng Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has witnessed the remarkable progress of Graph Neural Networks (GNNs) in the realm of graph data representation. However, GNNs still encounter the challenge of structural imbalance. Prior solutions to this problem did not take graph heterophily into account, namely that connected nodes process distinct labels or features, thus resulting in a deficiency in effectiveness. Upon verifying the impact of heterophily on solving the structural imbalance problem, we propose to rectify the heterophily first and then transfer homophilic knowledge. To the end, we devise a method named HeRB (Heterophily-Resolved Structure Balancer) for GNNs. HeRB consists of two innovative components: 1) A heterophily-lessening augmentation module which serves to reduce inter-class edges and increase intra-class edges; 2) A homophilic knowledge transfer mechanism to convey homophilic information from head nodes to tail nodes. Experimental results demonstrate that HeRB achieves superior performance on two homophilic and six heterophilic benchmark datasets, and the ablation studies further validate the efficacy of two proposed components.

[LG-30] Signal Recovery from Random Dot-Product Graphs Under Local Differential Privacy

链接: https://arxiv.org/abs/2504.17274
作者: Siddharth Vishwanath,Jonathan Hehir
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of recovering latent information from graphs under \varepsilon -edge local differential privacy where the presence of relationships/edges between two users/vertices remains confidential, even from the data curator. For the class of generalized random dot-product graphs, we show that a standard local differential privacy mechanism induces a specific geometric distortion in the latent positions. Leveraging this insight, we show that consistent recovery of the latent positions is achievable by appropriately adjusting the statistical inference procedure for the privatized graph. Furthermore, we prove that our procedure is nearly minimax-optimal under local edge differential privacy constraints. Lastly, we show that this framework allows for consistent recovery of geometric and topological information underlying the latent positions, as encoded in their persistence diagrams. Our results extend previous work from the private community detection literature to a substantially richer class of models and inferential tasks.

[LG-31] Rate-Distortion-Perception Theory for the Quadratic Wasserstein Space

链接: https://arxiv.org/abs/2504.17236
作者: Xiqiang Qu,Jun Chen,Lei Yu,Xiangyu Xu
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We establish a single-letter characterization of the fundamental distortion-rate-perception tradeoff with limited common randomness under the squared error distortion measure and the squared Wasserstein-2 perception measure. Moreover, it is shown that this single-letter characterization can be explicitly evaluated for the Gaussian source. Various notions of universal representation are also clarified.

[LG-32] Multi-Modal Traffic Analysis: Integrating Time-Series Forecasting Accident Prediction and Image Classification

链接: https://arxiv.org/abs/2504.17232
作者: Nivedita M,Yasmeen Shajitha S
类目: Machine Learning (cs.LG)
*备注: 5 pages,10 figures

点击查看摘要

Abstract:This study proposes an integrated machine learning framework for advanced traffic analysis, combining time-series forecasting, classification, and computer vision techniques. The system utilizes an ARIMA(2,0,1) model for traffic prediction (MAE: 2.1), an XGBoost classifier for accident severity classification (100% accuracy on balanced data), and a Convolutional Neural Network (CNN) for traffic image classification (92% accuracy). Tested on diverse datasets, the framework outperforms baseline models and identifies key factors influencing accident severity, including weather and road infrastructure. Its modular design supports deployment in smart city systems for real-time monitoring, accident prevention, and resource optimization, contributing to the evolution of intelligent transportation systems.

[LG-33] High-Fidelity And Complex Test Data Generation For Real-World SQL Code Generation Services

链接: https://arxiv.org/abs/2504.17203
作者: Shivasankari Kannan,Yeounoh Chung,Amita Gondi,Tristan Swadell,Fatma Ozcan
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically ``meaningful’’ mock data for complex schema that includes columns with nested structures that we frequently encounter in Google SQL code generation workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex schema, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate realistic high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the test targets (SQL queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant services. Our results demonstrate the practical utility of an out-of-the-box LLM (\textitgemini) based test data generation for industrial SQL code generation services where generating realistic test data is essential due to the frequent unavailability of production datasets.

[LG-34] A Double-Norm Aggregated Tensor Latent Factorization Model for Temporal-Aware Traffic Speed Imputation

链接: https://arxiv.org/abs/2504.17196
作者: Jiawen Hou,Hao Wu
类目: Machine Learning (cs.LG)
*备注: 11pages,3figures

点击查看摘要

Abstract:In intelligent transportation systems (ITS), traffic management departments rely on sensors, cameras, and GPS devices to collect real-time traffic data. Traffic speed data is often incomplete due to sensor failures, data transmission delays, or occlusions, resulting in missing speed data in certain road segments. Currently, tensor decomposition based methods are extensively utilized, they mostly rely on the L_2 -norm to construct their learning objectives, which leads to reduced robustness in the algorithms. To address this, we propose Temporal-Aware Traffic Speed Imputation (TATSI), which combines the L_2 -norm and smooth L_1 ( SL_1 )-norm in its loss function, thereby achieving both high accuracy and robust performance in imputing missing time-varying traffic speed data. TATSI adopts a single latent factor-dependent, nonnegative, and multiplicative update (SLF-NMU) approach, which serves as an efficient solver for performing nonnegative latent factor analysis (LFA) on a tensor. Empirical studies on three real-world time-varying traffic speed datasets demonstrate that, compared with state-of-the-art traffic speed predictors, TATSI more precisely captures temporal patterns, thereby yielding the most accurate imputations for missing traffic speed data.

[LG-35] Lessons from Deploying Learning-based CSI Localization on a Large-Scale ISAC Platform

链接: https://arxiv.org/abs/2504.17173
作者: Tianyu Zhang,Dongheng Zhang,Ruixu Geng,Xuecheng Xie,Shuai Yang,Yan Chen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Channel State Information (CSI), recognized for its fine-grained spatial characteristics, has attracted increasing attention in WiFi-based indoor localization. However, despite its potential, CSI-based approaches have yet to achieve the same level of deployment scale and commercialization as those based on Received Signal Strength Indicator (RSSI). A key limitation lies in the fact that most existing CSI-based systems are developed and evaluated in controlled, small-scale environments, limiting their generalizability. To bridge this gap, we explore the deployment of a large-scale CSI-based localization system involving over 400 Access Points (APs) in a real-world building under the Integrated Sensing and Communication (ISAC) paradigm. We highlight two critical yet often overlooked factors: the underutilization of unlabeled data and the inherent heterogeneity of CSI measurements. To address these challenges, we propose a novel CSI-based learning framework for WiFi localization, tailored for large-scale ISAC deployments on the server side. Specifically, we employ a novel graph-based structure to model heterogeneous CSI data and reduce redundancy. We further design a pretext pretraining task that incorporates spatial and temporal priors to effectively leverage large-scale unlabeled CSI data. Complementarily, we introduce a confidence-aware fine-tuning strategy to enhance the robustness of localization results. In a leave-one-smartphone-out experiment spanning five floors and 25, 600 m2, we achieve a median localization error of 2.17 meters and a floor accuracy of 99.49%. This performance corresponds to an 18.7% reduction in mean absolute error (MAE) compared to the best-performing baseline.

[LG-36] PACE: A Framework for Learning and Control in Linear Incomplete-Information Differential Games

链接: https://arxiv.org/abs/2504.17128
作者: Seyed Yousef Soltanian,Wenlong Zhang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted to 7th Annual Conference on Learning for Dynamics and Control (L4DC) 2025. Camera-ready version using the official PMLR template. The full version including appendix and proofs

点击查看摘要

Abstract:In this paper, we address the problem of a two-player linear quadratic differential game with incomplete information, a scenario commonly encountered in multi-agent control, human-robot interaction (HRI), and approximation methods for solving general-sum differential games. While solutions to such linear differential games are typically obtained through coupled Riccati equations, the complexity increases when agents have incomplete information, particularly when neither is aware of the other’s cost function. To tackle this challenge, we propose a model-based Peer-Aware Cost Estimation (PACE) framework for learning the cost parameters of the other agent. In PACE, each agent treats its peer as a learning agent rather than a stationary optimal agent, models their learning dynamics, and leverages this dynamic to infer the cost function parameters of the other agent. This approach enables agents to infer each other’s objective function in real time based solely on their previous state observations and dynamically adapt their control policies. Furthermore, we provide a theoretical guarantee for the convergence of parameter estimation and the stability of system states in PACE. Additionally, in our numerical studies, we demonstrate how modeling the learning dynamics of the other agent benefits PACE, compared to approaches that approximate the other agent as having complete information, particularly in terms of stability and convergence speed.

[LG-37] Discovering the Precursors of Traffic Breakdowns Using Spatiotemporal Graph Attribution Networks

链接: https://arxiv.org/abs/2504.17109
作者: Zhaobin Mo,Xiangyi Liao,Dominik A. Karbowski,Yanbing Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding and predicting the precursors of traffic breakdowns is critical for improving road safety and traffic flow management. This paper presents a novel approach combining spatiotemporal graph neural networks (ST-GNNs) with Shapley values to identify and interpret traffic breakdown precursors. By extending Shapley explanation methods to a spatiotemporal setting, our proposed method bridges the gap between black-box neural network predictions and interpretable causes. We demonstrate the method on the Interstate-24 data, and identify that road topology and abrupt braking are major factors that lead to traffic breakdowns.

[LG-38] GeoRDF2Vec Learning Location-Aware Entity Representations in Knowledge Graphs ESWC2025

链接: https://arxiv.org/abs/2504.17099
作者: Martin Boeckling,Heiko Paulheim,Sarah Detzler
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 18 pages, ESWC 2025

点击查看摘要

Abstract:Many knowledge graphs contain a substantial number of spatial entities, such as cities, buildings, and natural landmarks. For many of these entities, exact geometries are stored within the knowledge graphs. However, most existing approaches for learning entity representations do not take these geometries into account. In this paper, we introduce a variant of RDF2Vec that incorporates geometric information to learn location-aware embeddings of entities. Our approach expands different nodes by flooding the graph from geographic nodes, ensuring that each reachable node is considered. Based on the resulting flooded graph, we apply a modified version of RDF2Vec that biases graph walks using spatial weights. Through evaluations on multiple benchmark datasets, we demonstrate that our approach outperforms both non-location-aware RDF2Vec and GeoTransE.

[LG-39] A Novel Hybrid Approach Using an Attention-Based Transformer GRU Model for Predicting Cryptocurrency Prices

链接: https://arxiv.org/abs/2504.17079
作者: Esam Mahdi,C. Martin-Barreiro,X. Cabezas
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In this article, we introduce a novel deep learning hybrid model that integrates attention Transformer and Gated Recurrent Unit (GRU) architectures to improve the accuracy of cryptocurrency price predictions. By combining the Transformer’s strength in capturing long-range patterns with the GRU’s ability to model short-term and sequential trends, the hybrid model provides a well-rounded approach to time series forecasting. We apply the model to predict the daily closing prices of Bitcoin and Ethereum based on historical data that include past prices, trading volumes, and the Fear and Greed index. We evaluate the performance of our proposed model by comparing it with four other machine learning models: two are non-sequential feedforward models: Radial Basis Function Network (RBFN) and General Regression Neural Network (GRNN), and two are bidirectional sequential memory-based models: Bidirectional Long-Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU). The performance of the model is assessed using several metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), along with statistical validation through the nonparametric Friedman test followed by a post hoc Wilcoxon signed rank test. The results demonstrate that our hybrid model consistently achieves superior accuracy, highlighting its effectiveness for financial prediction tasks. These findings provide valuable insights for improving real-time decision making in cryptocurrency markets and support the growing use of hybrid deep learning models in financial analytics.

[LG-40] Conditional Diffusion-Based Retrieval of Atmospheric CO2 from Earth Observing Spectroscopy WWW ICLR2025

链接: https://arxiv.org/abs/2504.17074
作者: William R. Keely,Otto Lamminpää,Steffen Mauceri,Sean M. R. Crowell,Christopher W. O’Dell,Gregory R. McGarragh
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: Published as a workshop paper in “Tackling Climate Change with Machine Learning”, ICLR 2025. this https URL

点击查看摘要

Abstract:Satellite-based estimates of greenhouse gas (GHG) properties from observations of reflected solar spectra are integral for understanding and monitoring complex terrestrial systems and their impact on the carbon cycle due to their near global coverage. Known as retrieval, making GHG concentration estimations from these observations is a non-linear Bayesian inverse problem, which is operationally solved using a computationally expensive algorithm called Optimal Estimation (OE), providing a Gaussian approximation to a non-Gaussian posterior. This leads to issues in solver algorithm convergence, and to unrealistically confident uncertainty estimates for the retrieved quantities. Upcoming satellite missions will provide orders of magnitude more data than the current constellation of GHG observers. Development of fast and accurate retrieval algorithms with robust uncertainty quantification is critical. Doing so stands to provide substantial climate impact of moving towards the goal of near continuous real-time global monitoring of carbon sources and sinks which is essential for policy making. To achieve this goal, we propose a diffusion-based approach to flexibly retrieve a Gaussian or non-Gaussian posterior, for NASA’s Orbiting Carbon Observatory-2 spectrometer, while providing a substantial computational speed-up over the current operational state-of-the-art.

[LG-41] Sparse Phased Array Optimization Using Deep Learning

链接: https://arxiv.org/abs/2504.17073
作者: David Lu,Lior Maman,Jackson Earls,Amir Boag,Pierre Baldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Antenna arrays are widely used in wireless communication, radar systems, radio astronomy, and military defense to enhance signal strength, directivity, and interference suppression. We introduce a deep learning-based optimization approach that enhances the design of sparse phased arrays by reducing grating lobes. This approach begins by generating sparse array configurations to address the non-convex challenges and extensive degrees of freedom inherent in array design. We use neural networks to approximate the non-convex cost function that estimates the energy ratio between the main and side lobes. This differentiable approximation facilitates cost function minimization through gradient descent, optimizing the antenna elements’ coordinates and leading to an improved layout. Additionally, we incorporate a tailored penalty mechanism that includes various physical and design constraints into the optimization process, enhancing its robustness and practical applicability. We demonstrate the effectiveness of our method by applying it to the ten array configurations with the lowest initial costs, achieving further cost reductions ranging from 411% to 643%, with an impressive average improvement of 552%. By significantly reducing side lobe levels in antenna arrays, this breakthrough paves the way for ultra-precise beamforming, enhanced interference mitigation, and next-generation wireless and radar systems with unprecedented efficiency and clarity.

[LG-42] In-Context Learning can distort the relationship between sequence likelihoods and biological fitness

链接: https://arxiv.org/abs/2504.17068
作者: Pranav Kantroo,Günter P. Wagner,Benjamin B. Machta
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Language models have emerged as powerful predictors of the viability of biological sequences. During training these models learn the rules of the grammar obeyed by sequences of amino acids or nucleotides. Once trained, these models can take a sequence as input and produce a likelihood score as an output; a higher likelihood implies adherence to the learned grammar and correlates with experimental fitness measurements. Here we show that in-context learning can distort the relationship between fitness and likelihood scores of sequences. This phenomenon most prominently manifests as anomalously high likelihood scores for sequences that contain repeated motifs. We use protein language models with different architectures trained on the masked language modeling objective for our experiments, and find transformer-based models to be particularly vulnerable to this effect. This behavior is mediated by a look-up operation where the model seeks the identity of the masked position by using the other copy of the repeated motif as a reference. This retrieval behavior can override the model’s learned priors. This phenomenon persists for imperfectly repeated sequences, and extends to other kinds of biologically relevant features such as reversed complement motifs in RNA sequences that fold into hairpin structures.

[LG-43] Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching

链接: https://arxiv.org/abs/2504.17066
作者: Kewen Peng,Yicheng Yang,Hao Zhuo,Tim Menzies
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Software Engineering (cs.SE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fairness-aware learning aims to mitigate discrimination against specific protected social groups (e.g., those categorized by gender, ethnicity, age) while minimizing predictive performance loss. Despite efforts to improve fairness in machine learning, prior studies have shown that many models remain unfair when measured against various fairness metrics. In this paper, we examine whether the way training and testing data are sampled affects the reliability of reported fairness metrics. Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data, potentially skewing fairness assessments. To address this, we propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias. FairMatch identifies control and treatment pairs with similar propensity scores in the test set and adjusts decision thresholds for different subgroups accordingly. For samples that cannot be matched, we perform probabilistic calibration using fairness-aware loss functions. Experimental results demonstrate that our approach can (a) precisely locate subsets of the test data where the model is unbiased, and (b) significantly reduce bias on the remaining data. Overall, propensity score matching offers a principled way to improve both fairness evaluation and mitigation, without sacrificing predictive performance.

[LG-44] Antenna Near-Field Reconstruction from Far-Field Data Using Convolutional Neural Networks

链接: https://arxiv.org/abs/2504.17065
作者: Sahar Bagherkhani,Jackson Christopher Earls,Franco De Flaviis,Pierre Baldi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electromagnetic field reconstruction is crucial in many applications, including antenna diagnostics, electromagnetic interference analysis, and system modeling. This paper presents a deep learning-based approach for Far-Field to Near-Field (FF-NF) transformation using Convolutional Neural Networks (CNNs). The goal is to reconstruct near-field distributions from the far-field data of an antenna without relying on explicit analytical transformations. The CNNs are trained on paired far-field and near-field data and evaluated using mean squared error (MSE). The best model achieves a training error of 0.0199 and a test error of 0.3898. Moreover, visual comparisons between the predicted and true near-field distributions demonstrate the model’s effectiveness in capturing complex electromagnetic field behavior, highlighting the potential of deep learning in electromagnetic field reconstruction.

[LG-45] Safety Pretraining: Toward the Next Generation of Safe AI

链接: https://arxiv.org/abs/2504.16980
作者: Pratyush Maini,Sachin Goyal,Dylan Sam,Alex Robey,Yash Savani,Yiding Jiang,Andy Zou,Zacharcy C. Lipton,J. Zico Kolter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.

[LG-46] STFM: A Spatio-Temporal Information Fusion Model Based on Phase Space Reconstruction for Sea Surface Temperature Prediction

链接: https://arxiv.org/abs/2504.16970
作者: Yin Wang,Chunlin Gong,Xiang Wu,Hanleran Zhang
类目: Machine Learning (cs.LG)
*备注: 19 pages, 14 figures

点击查看摘要

Abstract:The sea surface temperature (SST), a key environmental parameter, is crucial to optimizing production planning, making its accurate prediction a vital research topic. However, the inherent nonlinearity of the marine dynamic system presents significant challenges. Current forecasting methods mainly include physics-based numerical simulations and data-driven machine learning approaches. The former, while describing SST evolution through differential equations, suffers from high computational complexity and limited applicability, whereas the latter, despite its computational benefits, requires large datasets and faces interpretability challenges. This study presents a prediction framework based solely on data-driven techniques. Using phase space reconstruction, we construct initial-delay attractor pairs with a mathematical homeomorphism and design a Spatio-Temporal Fusion Mapping (STFM) to uncover their intrinsic connections. Unlike conventional models, our method captures SST dynamics efficiently through phase space reconstruction and achieves high prediction accuracy with minimal training data in comparative tests

[LG-47] Engineering the Law-Machine Learning Translation Problem: Developing Legally Aligned Models

链接: https://arxiv.org/abs/2504.16969
作者: Mathias Hanson,Gregory Lewkowicz,Sam Verboven
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 16 pages, 1 figure

点击查看摘要

Abstract:Organizations developing machine learning-based (ML) technologies face the complex challenge of achieving high predictive performance while respecting the law. This intersection between ML and the law creates new complexities. As ML model behavior is inferred from training data, legal obligations cannot be operationalized in source code directly. Rather, legal obligations require “indirect” operationalization. However, choosing context-appropriate operationalizations presents two compounding challenges: (1) laws often permit multiple valid operationalizations for a given legal obligation-each with varying degrees of legal adequacy; and, (2) each operationalization creates unpredictable trade-offs among the different legal obligations and with predictive performance. Evaluating these trade-offs requires metrics (or heuristics), which are in turn difficult to validate against legal obligations. Current methodologies fail to fully address these interwoven challenges as they either focus on legal compliance for traditional software or on ML model development without adequately considering legal complexities. In response, we introduce a five-stage interdisciplinary framework that integrates legal and ML-technical analysis during ML model development. This framework facilitates designing ML models in a legally aligned way and identifying high-performing models that are legally justifiable. Legal reasoning guides choices for operationalizations and evaluation metrics, while ML experts ensure technical feasibility, performance optimization and an accurate interpretation of metric values. This framework bridges the gap between more conceptual analysis of law and ML models’ need for deterministic specifications. We illustrate its application using a case study in the context of anti-money laundering.

[LG-48] A Novel Graph Transformer Framework for Gene Regulatory Network Inference

链接: https://arxiv.org/abs/2504.16961
作者: Binon Teji,Swarup Roy
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Genomics (q-bio.GN); Molecular Networks (q-bio.MN); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The inference of gene regulatory networks (GRNs) is a foundational stride towards deciphering the fundamentals of complex biological systems. Inferring a possible regulatory link between two genes can be formulated as a link prediction problem. Inference of GRNs via gene coexpression profiling data may not always reflect true biological interactions, as its susceptibility to noise and misrepresenting true biological regulatory relationships. Most GRN inference methods face several challenges in the network reconstruction phase. Therefore, it is important to encode gene expression values, leverege the prior knowledge gained from the available inferred network structures and positional informations of the input network nodes towards inferring a better and more confident GRN network reconstruction. In this paper, we explore the integration of multiple inferred networks to enhance the inference of Gene Regulatory Networks (GRNs). Primarily, we employ autoencoder embeddings to capture gene expression patterns directly from raw data, preserving intricate biological signals. Then, we embed the prior knowledge from GRN structures transforming them into a text-like representation using random walks, which are then encoded with a masked language model, BERT, to generate global embeddings for each gene across all networks. Additionally, we embed the positional encodings of the input gene networks to better identify the position of each unique gene within the graph. These embeddings are integrated into graph transformer-based model, termed GT-GRN, for GRN inference. The GT-GRN model effectively utilizes the topological structure of the ground truth network while incorporating the enriched encoded information. Experimental results demonstrate that GT-GRN significantly outperforms existing GRN inference methods, achieving superior accuracy and highlighting the robustness of our approach.

[LG-49] Flexibility of German gas-fired generation: evidence from clustering empirical operation

链接: https://arxiv.org/abs/2504.16943
作者: Chiara Fusar Bassini,Alice Lixuan Xu,Jorge Sánchez Canales,Lion Hirth,Lynn H. Kaack
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 29 pages, 6 figures, 6 tables

点击查看摘要

Abstract:A key input to energy models are assumptions about the flexibility of power generation units, i.e., how quickly and often they can start up. These assumptions are usually calibrated on the technical characteristics of the units, such as installed capacity or technology type. However, even if power generation units technically can dispatch flexibly, service obligations and market incentives may constrain their operation. Here, we cluster over 60% of German national gas generation (generation units of 100 MWp or above) based on their empirical flexibility. We process the hourly dispatch of sample units between 2019 and 2023 using a novel deep learning approach, that transforms time series into easy-to-cluster representations. We identify two clusters of peaker units and two clusters of non-peaker units, whose different empirical flexibility is quantified by cluster-level ramp rates. Non-peaker units, around half of the sample, are empirically less flexible than peakers, and make up for more than 83% of sample must-run generation. Regulatory changes addressing the low market responsiveness of non-peakers are needed to unlock their flexibility.

[LG-50] Evaluating Uncertainty in Deep Gaussian Processes

链接: https://arxiv.org/abs/2504.17719
作者: Matthijs van der Lende,Jeremias Lino Ferrao,Niclas Müller-Hof
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable uncertainty estimates are crucial in modern machine learning. Deep Gaussian Processes (DGPs) and Deep Sigma Point Processes (DSPPs) extend GPs hierarchically, offering promising methods for uncertainty quantification grounded in Bayesian principles. However, their empirical calibration and robustness under distribution shift relative to baselines like Deep Ensembles remain understudied. This work evaluates these models on regression (CASP dataset) and classification (ESR dataset) tasks, assessing predictive performance (MAE, Accu- racy), calibration using Negative Log-Likelihood (NLL) and Expected Calibration Error (ECE), alongside robustness under various synthetic feature-level distribution shifts. Results indicate DSPPs provide strong in-distribution calibration leveraging their sigma point approximations. However, compared to Deep Ensembles, which demonstrated superior robustness in both per- formance and calibration under the tested shifts, the GP-based methods showed vulnerabilities, exhibiting particular sensitivity in the observed metrics. Our findings underscore ensembles as a robust baseline, suggesting that while deep GP methods offer good in-distribution calibration, their practical robustness under distribution shift requires careful evaluation. To facilitate reproducibility, we make our code available at this https URL.

[LG-51] On the Generalization of Adversarially Trained Quantum Classifiers

链接: https://arxiv.org/abs/2504.17690
作者: Petros Georgiou,Aaron Mark Thomas,Sharu Theresa Jose,Osvaldo Simeone
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures

点击查看摘要

Abstract:Quantum classifiers are vulnerable to adversarial attacks that manipulate their input classical or quantum data. A promising countermeasure is adversarial training, where quantum classifiers are trained by using an attack-aware, adversarial loss function. This work establishes novel bounds on the generalization error of adversarially trained quantum classifiers when tested in the presence of perturbation-constrained adversaries. The bounds quantify the excess generalization error incurred to ensure robustness to adversarial attacks as scaling with the training sample size m as 1/\sqrtm , while yielding insights into the impact of the quantum embedding. For quantum binary classifiers employing \textitrotation embedding, we find that, in the presence of adversarial attacks on classical inputs \mathbfx , the increase in sample complexity due to adversarial training over conventional training vanishes in the limit of high dimensional inputs \mathbfx . In contrast, when the adversary can directly attack the quantum state \rho(\mathbfx) encoding the input \mathbfx , the excess generalization error depends on the choice of embedding only through its Hilbert space dimension. The results are also extended to multi-class classifiers. We validate our theoretical findings with numerical experiments.

[LG-52] Likelihood-Free Variational Autoencoders

链接: https://arxiv.org/abs/2504.17622
作者: Chen Xu,Qiang Wang,Lijun Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational Autoencoders (VAEs) typically rely on a probabilistic decoder with a predefined likelihood, most commonly an isotropic Gaussian, to model the data conditional on latent variables. While convenient for optimization, this choice often leads to likelihood misspecification, resulting in blurry reconstructions and poor data fidelity, especially for high-dimensional data such as images. In this work, we propose \textitEnVAE, a novel likelihood-free generative framework that has a deterministic decoder and employs the energy score – a proper scoring rule – to build the reconstruction loss. This enables likelihood-free inference without requiring explicit parametric density functions. To address the computational inefficiency of the energy score, we introduce a fast variant, \textitFEnVAE, based on the local smoothness of the decoder and the sharpness of the posterior distribution of latent variables. This yields an efficient single-sample training objective that integrates seamlessly into existing VAE pipelines with minimal overhead. Empirical results on standard benchmarks demonstrate that \textitEnVAE achieves superior reconstruction and generation quality compared to likelihood-based baselines. Our framework offers a general, scalable, and statistically principled alternative for flexible and nonparametric distribution learning in generative modeling.

[LG-53] Quantum Autoencoder for Multivariate Time Series Anomaly Detection

链接: https://arxiv.org/abs/2504.17548
作者: Kilian Tscharke,Maximilian Wendlinger,Afrae Ahouzi,Pallavi Bhardwaj,Kaweh Amoi-Taleghani,Michael Schrödl-Baumann,Pascal Debus
类目: Quantum Physics (quant-ph); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to IEEE International Conference on Quantum Computing and Engineering (QCE) 2025

点击查看摘要

Abstract:Anomaly Detection (AD) defines the task of identifying observations or events that deviate from typical - or normal - patterns, a critical capability in IT security for recognizing incidents such as system misconfigurations, malware infections, or cyberattacks. In enterprise environments like SAP HANA Cloud systems, this task often involves monitoring high-dimensional, multivariate time series (MTS) derived from telemetry and log data. With the advent of quantum machine learning offering efficient calculations in high-dimensional latent spaces, many avenues open for dealing with such complex data. One approach is the Quantum Autoencoder (QAE), an emerging and promising method with potential for application in both data compression and AD. However, prior applications of QAEs to time series AD have been restricted to univariate data, limiting their relevance for real-world enterprise systems. In this work, we introduce a novel QAE-based framework designed specifically for MTS AD towards enterprise scale. We theoretically develop and experimentally validate the architecture, demonstrating that our QAE achieves performance competitive with neural-network-based autoencoders while requiring fewer trainable parameters. We evaluate our model on datasets that closely reflect SAP system telemetry and show that the proposed QAE is a viable and efficient alternative for semisupervised AD in real-world enterprise settings.

[LG-54] An introduction to R package mvs

链接: https://arxiv.org/abs/2504.17546
作者: Wouter van Loon
类目: Computation (stat.CO); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 15 pages, 4 figures. Package vignette corresponding to this https URL

点击查看摘要

Abstract:In biomedical science, a set of objects or persons can often be described by multiple distinct sets of features obtained from different data sources or modalities (called “multi-view data”). Classical machine learning methods ignore the multi-view structure of such data, limiting model interpretability and performance. The R package mvs provides methods that were designed specifically for dealing with multi-view data, based on the multi-view stacking (MVS) framework. MVS is a form of supervised (machine) learning used to train multi-view classification or prediction models. MVS works by training a learning algorithm on each view separately, estimating the predictive power of each view-specific model through cross-validation, and then using another learning algorithm to assign weights to the view-specific models based on their estimated predictions. MVS is a form of ensemble learning, dividing the large multi-view learning problem into smaller sub-problems. Most of these sub-problems can be solved in parallel, making it computationally attractive. Additionally, the number of features of the sub-problems is greatly reduced compared with the full multi-view learning problem. This makes MVS especially useful when the total number of features is larger than the number of observations (i.e., high-dimensional data). MVS can still be applied even if the sub-problems are themselves high-dimensional by adding suitable penalty terms to the learning algorithms. Furthermore, MVS can be used to automatically select the views which are most important for prediction. The R package mvs makes fitting MVS models, including such penalty terms, easily and openly accessible. mvs allows for the fitting of stacked models with any number of levels, with different penalty terms, different outcome distributions, and provides several options for missing data handling.

[LG-55] HydroStartML: A combined machine learning and physics-based approach to reduce hydrological model spin-up time

链接: https://arxiv.org/abs/2504.17420
作者: Louisa Pawusch,Stefania Scheurer,Wolfgang Nowak,Reed Maxwell
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 13 pages, 14 figures. To be published in Advances in Water Resources

点击查看摘要

Abstract:Finding the initial depth-to-water table (DTWT) configuration of a catchment is a critical challenge when simulating the hydrological cycle with integrated models, significantly impacting simulation outcomes. Traditionally, this involves iterative spin-up computations, where the model runs under constant atmospheric settings until steady-state is achieved. These so-called model spin-ups are computationally expensive, often requiring many years of simulated time, particularly when the initial DTWT configuration is far from steady state. To accelerate the model spin-up process we developed HydroStartML, a machine learning emulator trained on steady-state DTWT configurations across the contiguous United States. HydroStartML predicts, based on available data like conductivity and surface slopes, a DTWT configuration of the respective watershed, which can be used as an initial DTWT. Our results show that initializing spin-up computations with HydroStartML predictions leads to faster convergence than with other initial configurations like spatially constant DTWTs. The emulator accurately predicts configurations close to steady state, even for terrain configurations not seen in training, and allows especially significant reductions in computational spin-up effort in regions with deep DTWTs. This work opens the door for hybrid approaches that blend machine learning and traditional simulation, enhancing predictive accuracy and efficiency in hydrology for improving water resource management and understanding complex environmental interactions. Comments: 13 pages, 14 figures. To be published in Advances in Water Resources Subjects: Geophysics (physics.geo-ph); Machine Learning (cs.LG) Cite as: arXiv:2504.17420 [physics.geo-ph] (or arXiv:2504.17420v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2504.17420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from space ICLR2025

链接: https://arxiv.org/abs/2504.17321
作者: Michael J. Smith,Luke Fleming,James E. Geach,Ryan J. Roberts,Freddie Kalaitzis,James Banister
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures, spotlight at `Tackling Climate Change with Machine Learning’, ICLR 2025

点击查看摘要

Abstract:We present Dargana, a fine-tuned variant of the EarthPT time-series foundation model that achieves specialisation using 3% of its pre-training data volume and 5% of its pre-training compute. Dargana is fine-tuned to generate regularly updated classification of tree canopy cover at 10m resolution, distinguishing conifer and broadleaved tree types. Using Cornwall, UK, as a test case, the model achieves a pixel-level ROC-AUC of 0.98 and a PR-AUC of 0.83 on unseen satellite imagery. Dargana can identify fine structures like hedgerows and coppice below the training sample limit, and can track temporal changes to canopy cover such as new woodland establishment. Our results demonstrate how pre-trained Large Observation Models like EarthPT can be specialised for granular, dynamic land cover monitoring from space, providing a valuable, scalable tool for natural capital management and conservation.

[LG-57] Causal rule ensemble approach for multi-arm data

链接: https://arxiv.org/abs/2504.17166
作者: Ke Wan,Kensuke Tanioka,Toshio Shimokawa
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous treatment effect (HTE) estimation is critical in medical research. It provides insights into how treatment effects vary among individuals, which can provide statistical evidence for precision medicine. While most existing methods focus on binary treatment situations, real-world applications often involve multiple interventions. However, current HTE estimation methods are primarily designed for binary comparisons and often rely on black-box models, which limit their applicability and interpretability in multi-arm settings. To address these challenges, we propose an interpretable machine learning framework for HTE estimation in multi-arm trials. Our method employs a rule-based ensemble approach consisting of rule generation, rule ensemble, and HTE estimation, ensuring both predictive accuracy and interpretability. Through extensive simulation studies and real data applications, the performance of our method was evaluated against state-of-the-art multi-arm HTE estimation approaches. The results indicate that our approach achieved lower bias and higher estimation accuracy compared with those of existing methods. Furthermore, the interpretability of our framework allows clearer insights into how covariates influence treatment effects, facilitating clinical decision making. By bridging the gap between accuracy and interpretability, our study contributes a valuable tool for multi-arm HTE estimation, supporting precision medicine.

[LG-58] Reinforcement learning framework for the mechanical design of microelectronic components under multiphysics constraints

链接: https://arxiv.org/abs/2504.17142
作者: Siddharth Nair,Timothy F. Walsh,Greg Pickrell,Fabio Semperlotti
类目: Computational Physics (physics.comp-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 27 pages of main text, 15 figures

点击查看摘要

Abstract:This study focuses on the development of reinforcement learning based techniques for the design of microelectronic components under multiphysics constraints. While traditional design approaches based on global optimization approaches are effective when dealing with a small number of design parameters, as the complexity of the solution space and of the constraints increases different techniques are needed. This is an important reason that makes the design and optimization of microelectronic components (characterized by large solution space and multiphysics constraints) very challenging for traditional methods. By taking as prototypical elements an application-specific integrated circuit (ASIC) and a heterogeneously integrated (HI) interposer, we develop and numerically test an optimization framework based on reinforcement learning (RL). More specifically, we consider the optimization of the bonded interconnect geometry for an ASIC chip as well as the placement of components on a HI interposer while satisfying thermoelastic and design constraints. This placement problem is particularly interesting because it features a high-dimensional solution space.

[LG-59] Physics-informed features in supervised machine learning

链接: https://arxiv.org/abs/2504.17112
作者: Margherita Lampani,Sabrina Guastavino,Michele Piana,Federico Benvenuto
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Supervised machine learning involves approximating an unknown functional relationship from a limited dataset of features and corresponding labels. The classical approach to feature-based machine learning typically relies on applying linear regression to standardized features, without considering their physical meaning. This may limit model explainability, particularly in scientific applications. This study proposes a physics-informed approach to feature-based machine learning that constructs non-linear feature maps informed by physical laws and dimensional analysis. These maps enhance model interpretability and, when physical laws are unknown, allow for the identification of relevant mechanisms through feature ranking. The method aims to improve both predictive performance in regression tasks and classification skill scores by integrating domain knowledge into the learning process, while also enabling the potential discovery of new physical equations within the context of explainable machine learning.

[LG-60] Neural Contraction Metrics with Formal Guarantees for Discrete-Time Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2504.17102
作者: Haoyu Li,Xiangru Zhong,Bin Hu,Huan Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted by L4DC 2025

点击查看摘要

Abstract:Contraction metrics are crucial in control theory because they provide a powerful framework for analyzing stability, robustness, and convergence of various dynamical systems. However, identifying these metrics for complex nonlinear systems remains an open challenge due to the lack of scalable and effective tools. This paper explores the approach of learning verifiable contraction metrics parametrized as neural networks (NNs) for discrete-time nonlinear dynamical systems. While prior works on formal verification of contraction metrics for general nonlinear systems have focused on convex optimization methods (e.g. linear matrix inequalities, etc) under the assumption of continuously differentiable dynamics, the growing prevalence of NN-based controllers, often utilizing ReLU activations, introduces challenges due to the non-smooth nature of the resulting closed-loop dynamics. To bridge this gap, we establish a new sufficient condition for establishing formal neural contraction metrics for general discrete-time nonlinear systems assuming only the continuity of the dynamics. We show that from a computational perspective, our sufficient condition can be efficiently verified using the state-of-the-art neural network verifier \alpha,!\beta -CROWN, which scales up non-convex neural network verification via novel integration of symbolic linear bound propagation and branch-and-bound. Built upon our analysis tool, we further develop a learning method for synthesizing neural contraction metrics from sampled data. Finally, our approach is validated through the successful synthesis and verification of NN contraction metrics for various nonlinear examples.

[LG-61] Mathematical Modeling of Protein Structures: A Cohomology-Based Approach to the Flagellar Motor

链接: https://arxiv.org/abs/2504.16941
作者: Zakaria Lamine,Abdelatif Hafid,Mohamed Rahouti
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:This study presents a novel mathematical model derived from cohomology, leveraging the KEEL-proven theorem that establishes cohomology as tautological, generated by boundary classes of curves with fixed dual graphs. Simplicial complexes are constructed using skew-commutative graded algebra, and the structure theorem is applied to connect distinct homologies, enabling precise interpretations of the resulting geometric forms. The proposed model is utilized for protein structure analysis and prediction, with a specific application to the Flagellar Motor structure. This approach offers new insights into the geometric and algebraic foundations of biological macromolecular modeling, highlighting its potential for advancement in structural biology.

信息检索

[IR-0] Quadratic Interest Network for Multimodal Click-Through Rate Prediction

链接: https://arxiv.org/abs/2504.17699
作者: Honghao Li,Hanwei Li,Jing Zhang,Yi Zhang,Ziniu Yu,Lei Sang,Yiwen Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal click-through rate (CTR) prediction is a key technique in industrial recommender systems. It leverages heterogeneous modalities such as text, images, and behavioral logs to capture high-order feature interactions between users and items, thereby enhancing the system’s understanding of user interests and its ability to predict click behavior. The primary challenge in this field lies in effectively utilizing the rich semantic information from multiple modalities while satisfying the low-latency requirements of online inference in real-world applications. To foster progress in this area, the Multimodal CTR Prediction Challenge Track of the WWW 2025 EReL@MIR Workshop formulates the problem into two tasks: (1) Task 1 of Multimodal Item Embedding: this task aims to explore multimodal information extraction and item representation learning methods that enhance recommendation tasks; and (2) Task 2 of Multimodal CTR Prediction: this task aims to explore what multimodal recommendation model can effectively leverage multimodal embedding features and achieve better performance. In this paper, we propose a novel model for Task 2, named Quadratic Interest Network (QIN) for Multimodal CTR Prediction. Specifically, QIN employs adaptive sparse target attention to extract multimodal user behavior features, and leverages Quadratic Neural Networks to capture high-order feature interactions. As a result, QIN achieved an AUC of 0.9798 on the leaderboard and ranked second in the competition. The model code, training logs, hyperparameter configurations, and checkpoints are available at this https URL.

[IR-1] Replication and Exploration of Generative Retrieval over Dynamic Corpora SIGIR2025 SIGIR

链接: https://arxiv.org/abs/2504.17519
作者: Zhen Zhang,Xinyu Ma,Weiwei Sun,Pengjie Ren,Zhumin Chen,Shuaiqiang Wang,Dawei Yin,Maarten de Rijke,Zhaochun Ren
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)

点击查看摘要

Abstract:Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textittext-based docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textitnumeric-based docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with language models’ pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.

[IR-2] Adaptive Orchestration of Modular Generative Information Access Systems SIGIR2025

链接: https://arxiv.org/abs/2504.17454
作者: Mohanna Hoveyda,Harrie Oosterhuis,Arjen P. de Vries,Maarten de Rijke,Faegheh Hasibi
类目: Information Retrieval (cs.IR)
*备注: Accepted at SIGIR 2025 Perspective Paper Track

点击查看摘要

Abstract:Advancements in large language models (LLMs) have driven the emergence of complex new systems to provide access to information, that we will collectively refer to as modular generative information access (GenIA) systems. They integrate a broad and evolving range of specialized components, including LLMs, retrieval models, and a heterogeneous set of sources and tools. While modularity offers flexibility, it also raises critical challenges: How can we systematically characterize the space of possible modules and their interactions? How can we automate and optimize interactions among these heterogeneous components? And, how do we enable this modular system to dynamically adapt to varying user query requirements and evolving module capabilities? In this perspective paper, we argue that the architecture of future modular generative information access systems will not just assemble powerful components, but enable a self-organizing system through real-time adaptive orchestration – where components’ interactions are dynamically configured for each user input, maximizing information relevance while minimizing computational overhead. We give provisional answers to the questions raised above with a roadmap that depicts the key principles and methods for designing such an adaptive modular system. We identify pressing challenges, and propose avenues for addressing them in the years ahead. This perspective urges the IR community to rethink modular system designs for developing adaptive, self-optimizing, and future-ready architectures that evolve alongside their rapidly advancing underlying technologies.

[IR-3] Beyond Whole Dialogue Modeling: Contextual Disentanglement for Conversational Recommendation

链接: https://arxiv.org/abs/2504.17427
作者: Guojia An,Jie Zou,Jiwei Wei,Chaoning Zhang,Fuming Sun,Yang Yang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Conversational recommender systems aim to provide personalized recommendations by analyzing and utilizing contextual information related to dialogue. However, existing methods typically model the dialogue context as a whole, neglecting the inherent complexity and entanglement within the dialogue. Specifically, a dialogue comprises both focus information and background information, which mutually influence each other. Current methods tend to model these two types of information mixedly, leading to misinterpretation of users’ actual needs, thereby lowering the accuracy of recommendations. To address this issue, this paper proposes a novel model to introduce contextual disentanglement for improving conversational recommender systems, named DisenCRS. The proposed model DisenCRS employs a dual disentanglement framework, including self-supervised contrastive disentanglement and counterfactual inference disentanglement, to effectively distinguish focus information and background information from the dialogue context under unsupervised conditions. Moreover, we design an adaptive prompt learning module to automatically select the most suitable prompt based on the specific dialogue context, fully leveraging the power of large language models. Experimental results on two widely used public datasets demonstrate that DisenCRS significantly outperforms existing conversational recommendation models, achieving superior performance on both item recommendation and response generation tasks.

[IR-4] DataScout: Automatic Data Fact Retrieval for Statement Augmentation with an LLM -Based Agent

链接: https://arxiv.org/abs/2504.17334
作者: Chuer Chen,Yuqi Liu,Danqing Shi,Shixiong Cao,Nan Cao
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:A data story typically integrates data facts from multiple perspectives and stances to construct a comprehensive and objective narrative. However, retrieving these facts demands time for data search and challenges the creator’s analytical skills. In this work, we introduce DataScout, an interactive system that automatically performs reasoning and stance-based data facts retrieval to augment the user’s statement. Particularly, DataScout leverages an LLM-based agent to construct a retrieval tree, enabling collaborative control of its expansion between users and the agent. The interface visualizes the retrieval tree as a mind map that eases users to intuitively steer the retrieval direction and effectively engage in reasoning and analysis. We evaluate the proposed system through case studies and in-depth expert interviews. Our evaluation demonstrates that DataScout can effectively retrieve multifaceted data facts from different stances, helping users verify their statements and enhance the credibility of their stories.

[IR-5] Dynamic Superblock Pruning for Fast Learned Sparse Retrieval SIGIR25

链接: https://arxiv.org/abs/2504.17045
作者: Parker Carlson,Wentai Xie,Shanxiu He,Tao Yang
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures, SIGIR 25

点击查看摘要

Abstract:This paper proposes superblock pruning (SP) during top-k online document retrieval for learned sparse representations. SP structures the sparse index as a set of superblocks on a sequence of document blocks and conducts a superblock-level selection to decide if some superblocks can be pruned before visiting their child blocks. SP generalizes the previous flat block or cluster-based pruning, allowing the early detection of groups of documents that cannot or are less likely to appear in the final top-k list. SP can accelerate sparse retrieval in a rank-safe or approximate manner under a high-relevance competitiveness constraint. Our experiments show that the proposed scheme significantly outperforms state-of-the-art baselines on MS MARCO passages on a single-threaded CPU.

附件下载

点击下载今日全部论文列表