本篇博文主要内容为 2025-04-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-04-28)

今日共更新397篇论文,其中:

  • 自然语言处理44篇(Computation and Language (cs.CL))
  • 人工智能87篇(Artificial Intelligence (cs.AI))
  • 计算机视觉89篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习106篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] RACE Back from the Future: A Probabilistic Reasoning Approach to Controllable Language Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LMs)在生成过程中难以有效控制其输出以符合人类价值观或特定属性(如去毒化、个性化、主题控制等)的问题。现有方法要么针对每个新属性微调或后训练模型,成本高昂且缺乏灵活性;要么通过采样或训练近似未来序列的期望属性概率(Expected Attribute Probability, EAP),但这种方法对于罕见属性计算缓慢且不够可靠。

论文的关键解决方案是提出了一种名为TRACE(Tractable Probabilistic Reasoning for Adaptable Controllable gEneration)的新框架。该框架通过可追踪的概率推理和轻量级控制,高效计算EAP并适应新的属性。具体而言,TRACE从LM蒸馏出一个隐马尔可夫模型(Hidden Markov Model, HMM),并与一个小分类器结合,用于估计属性概率,从而在HMM预测的未来序列上实现精确的EAP计算。基于此EAP,TRACE重新加权LM的下一个词概率,以生成全局一致的继续文本。实验表明,TRACE在去毒化任务中实现了最先进的性能,仅增加10%的解码开销,并能在数秒内适配76个低资源个性化LLMs,同时无缝扩展到复合属性。

链接: https://arxiv.org/abs/2504.18535
作者: Gwen Yidou Weng,Benjie Wang,Guy Van den Broeck
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models (LMs) advance, there is an increasing need to control their outputs to align with human values (e.g., detoxification) or desired attributes (e.g., personalization, topic). However, autoregressive models focus on next-token predictions and struggle with global properties that require looking ahead. Existing solutions either tune or post-train LMs for each new attribute - expensive and inflexible - or approximate the Expected Attribute Probability (EAP) of future sequences by sampling or training, which is slow and unreliable for rare attributes. We introduce TRACE (Tractable Probabilistic Reasoning for Adaptable Controllable gEneration), a novel framework that efficiently computes EAP and adapts to new attributes through tractable probabilistic reasoning and lightweight control. TRACE distills a Hidden Markov Model (HMM) from an LM and pairs it with a small classifier to estimate attribute probabilities, enabling exact EAP computation over the HMM’s predicted futures. This EAP is then used to reweigh the LM’s next-token probabilities for globally compliant continuations. Empirically, TRACE achieves state-of-the-art results in detoxification with only 10% decoding overhead, adapts to 76 low-resource personalized LLMs within seconds, and seamlessly extends to composite attributes.
zh

[NLP-1] Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues SIGDIAL

【速读】: 本文旨在解决如何使大型语言模型(Large Language Models, LLMs)在共构建解释对话(co-constructive explanation dialogues)中作为解释者(explainers)有效参与的问题。解决方案的关键在于评估LLMs是否能够动态监测解释接受者(explainees)的理解状态,并据此调整解释内容以促进其对主题的理解。研究通过用户实验检验了部分受指导进行共构建解释的LLMs的表现,结果显示当前LLMs能够在一定程度上展现出共构建行为,如提出验证性问题,从而提高解释接受者的参与度与理解水平。然而,它们在准确监控实时理解状态并相应地搭建解释方面的能力仍然有限。

链接: https://arxiv.org/abs/2504.18483
作者: Leandra Fichtel,Maximilian Spliethöver,Eyke Hüllermeier,Patricia Jimenez,Nils Klowait,Stefan Kopp,Axel-Cyrille Ngonga Ngomo,Amelie Robrecht,Ingrid Scharlau,Lutz Terfloth,Anna-Lisa Vollmer,Henning Wachsmuth
机构: Leibniz University Hannover (汉诺威莱布尼茨大学, Institute of Artificial Intelligence); LMU Munich (慕尼黑大学, MCML); Paderborn University (帕德伯恩大学); Bielefeld University (比勒费尔德大学, CITEC)
类目: Computation and Language (cs.CL)
备注: Submitted to the SIGDial Conference 2025

点击查看摘要

Abstract:The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research has focused on co-constructive explanation dialogues, where the explainer continuously monitors the explainee’s understanding and adapts explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with LLMs, of which some have been instructed to explain a predefined topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results indicate that current LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.
zh

[NLP-2] Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions ACL2025

【速读】: 该论文旨在解决任务导向对话(Task-Oriented Dialogue, TOD)系统中槽位结构诱导(Slot Schema Induction, SSI)的问题,即如何在没有人工干预的情况下自动从对话数据中识别关键信息槽位。论文的关键解决方案是将SSI视为一个文本生成任务,并提出一种基于语言模型的方法,使其能够逐步构建和优化槽位结构。此外,论文通过完全自动化的LLM驱动的TOD模拟方法生成具有高质量状态标签的数据,以应对新任务领域的需求。同时,针对现有SSI评估中存在的数据泄露和评价指标与人工判断不一致的问题,论文通过结合人工指导和修正创建新的评估数据集,并设计改进的评价指标来解决这些问题。这些贡献为未来SSI研究奠定了基础,并推动了对话理解与系统开发的技术前沿。

链接: https://arxiv.org/abs/2504.18474
作者: James D. Finch,Yasasvi Josyula,Jinho D. Choi
机构: Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注: Accepted (B) to TACL 2025

点击查看摘要

Abstract:In task-oriented dialogue (TOD) systems, Slot Schema Induction (SSI) is essential for automatically identifying key information slots from dialogue data without manual intervention. This paper presents a novel state-of-the-art (SoTA) approach that formulates SSI as a text generation task, where a language model incrementally constructs and refines a slot schema over a stream of dialogue data. To develop this approach, we present a fully automatic LLM-based TOD simulation method that creates data with high-quality state labels for novel task domains. Furthermore, we identify issues in SSI evaluation due to data leakage and poor metric alignment with human judgment. We resolve these by creating new evaluation data using our simulation method with human guidance and correction, as well as designing improved evaluation metrics. These contributions establish a foundation for future SSI research and advance the SoTA in dialogue understanding and system development.
zh

[NLP-3] Fast-Slow Thinking for Large Vision-Language Model Reasoning

【速读】: 该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的“过度推理”(\textit{overthinking})现象,即模型在所有任务中都会生成冗长的推理过程,无论问题的实际需求。为了解决这一问题,论文提出了一种名为\textbf{FAST}的新框架,其关键在于引入了一种动态调整推理深度的快慢思考(Fast-Slow Thinking)机制,该机制能够根据问题的特点自适应地决定推理所需的深度。通过构建基于模型的问题特征度量、自适应推理奖励机制以及与难度相关的KL正则化等三个组件,FAST不仅实现了相对于基础模型超过10%的相对准确率提升,同时相比传统的慢速推理方法减少了32.7%-67.3%的token使用量,从而有效平衡了推理长度与准确性之间的权衡。

链接: https://arxiv.org/abs/2504.18458
作者: Wenyi Xiao,Leilei Gan,Weilong Dai,Wanggui He,Ziwei Huang,Haoyuan Li,Fangxun Shu,Zhelun Yu,Peng Zhang,Hao Jiang,Fei Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures, and 12 tables

点击查看摘要

Abstract:Recent advances in large vision-language models (LVLMs) have revealed an \textitoverthinking phenomenon, where models generate verbose reasoning across all tasks regardless of questions. To address this issue, we present \textbfFAST, a novel \textbfFast-\textbfSlow \textbfThinking framework that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. We develop FAST-GRPO with three components: model-based metrics for question characterization, an adaptive thinking reward mechanism, and difficulty-aware KL regularization. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
zh

[NLP-4] Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation

【速读】: 该论文旨在解决现有放射学报告生成模型缺乏结构化推理能力的问题,导致临床信任度低且难以解释,特别是在未能将视觉发现与精确的解剖位置关联方面。论文提出的关键解决方案是BoxMed-RL,这是一种统一的训练框架,通过两个集成阶段实现空间可验证性和可解释性的放射学报告生成:(1) 预训练阶段利用医学概念学习并通过链式思维监督优化模型以模拟放射科医生的工作流程,随后结合基于边界框的空间验证强化学习确保发现与解剖位置的对齐;(2) 下游适配阶段冻结预训练权重并微调下游适配器以生成流畅且可信的临床报告。这种方法精准复刻了放射科医生的工作流程,迫使模型将高级医学概念与明确的解剖学证据相连接。实验结果表明,BoxMed-RL在METEOR和ROUGE-L指标上平均提升7%,并在基于大型语言模型的指标上进一步提高5%,证明了其生成高质量放射学报告的鲁棒性。

链接: https://arxiv.org/abs/2504.18453
作者: Peiyuan Jing,Kinhei Lee,Zhenxuan Zhang,Huichi Zhou,Zhengqing Yuan,Zhifan Gao,Lei Zhu,Giorgos Papanastasiou,Yingying Fang,Guang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology report generation is critical for efficiency but current models lack the structured reasoning of experts, hindering clinical trust and explainability by failing to link visual findings to precise anatomical locations. This paper introduces BoxMed-RL, a groundbreaking unified training framework for generating spatially verifiable and explainable radiology reports. Built on a large vision-language model, BoxMed-RL revolutionizes report generation through two integrated phases: (1) In the Pretraining Phase, we refine the model via medical concept learning, using Chain-of-Thought supervision to internalize the radiologist-like workflow, followed by spatially verifiable reinforcement, which applies reinforcement learning to align medical findings with bounding boxes. (2) In the Downstream Adapter Phase, we freeze the pretrained weights and train a downstream adapter to ensure fluent and clinically credible reports. This framework precisely mimics radiologists’ workflow, compelling the model to connect high-level medical concepts with definitive anatomical evidence. Extensive experiments on public datasets demonstrate that BoxMed-RL achieves an average 7% improvement in both METEOR and ROUGE-L metrics compared to state-of-the-art methods. An average 5% improvement in large language model-based metrics further underscores BoxMed-RL’s robustness in generating high-quality radiology reports.
zh

[NLP-5] PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

【速读】: 该论文试图解决多语言数学推理基准测试领域中难度覆盖不足、语言多样性欠缺以及翻译质量不高的问题。论文的关键解决方案是提出PolyMath基准,它涵盖了18种语言和4个由易到难的难度等级,并确保了难度的全面性、语言的多样性以及高质量的翻译。通过构建这样一个高度判别性的多语言数学基准,论文评估了高级大型语言模型(LLMs)的表现,并揭示了当前LLMs在多语言推理中的几个关键挑战,包括推理性能的语言间差异、输入输出语言一致性低以及思考长度的语言间显著差异。此外,研究还表明,在指令中控制输出语言可能会影响推理性能,特别是对资源匮乏的语言,这为提升LLMs的多语言能力提供了有前景的方向。

链接: https://arxiv.org/abs/2504.18428
作者: Yiming Wang,Pei Zhang,Jialong Tang,Haoran Wei,Baosong Yang,Rui Wang,Chenshu Sun,Feitong Sun,Jiran Zhang,Junxuan Wu,Qiqian Cang,Yichang Zhang,Fei Huang,Junyang Lin,Fei Huang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Deepseek-R1-671B and Qwen-QwQ-32B, achieve only 43.4 and 41.8 benchmark scores, with less than 30% accuracy under the highest level. From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.
zh

[NLP-6] BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLM s

【速读】: 该论文旨在解决1比特大语言模型(1-bit LLMs)高效部署受激活异常值阻碍的问题,这些异常值使得低比特量化变得复杂。论文提出的关键解决方案是BitNet v2框架,它通过引入H-BitLinear模块实现原生4比特激活量化。H-BitLinear模块在激活量化之前应用在线哈达玛变换(Hadamard transformation),该变换将尖锐的激活分布平滑为更符合高斯分布的形式,从而更适合低比特表示。实验表明,从零训练的BitNet v2使用8比特激活与BitNet b1.58性能相当,并且当使用原生4比特激活训练时,性能下降极小,显著减少了批量推理中的内存占用和计算成本。

链接: https://arxiv.org/abs/2504.18415
作者: Hongyu Wang,Shuming Ma,Furu Wei
机构: Microsoft Research (微软研究); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Efficient deployment of 1-bit Large Language Models (LLMs) is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.
zh

[NLP-7] Expressing stigma and inappropriate responses prevents LLM s from safely replacing mental health providers

【速读】: 该论文旨在探讨大型语言模型(Large Language Model, LLM)是否适合作为心理治疗师的可行性,并试图解决LLMs在替代心理健康服务提供者这一应用中的潜在问题。论文的关键在于通过映射回顾主要医疗机构使用的治疗指南,识别出治疗关系中的关键要素,如治疗联盟的重要性,并评估当前LLMs(例如gpt-4o)在复现和遵循这些治疗关系要素方面的表现。研究发现,与医学界的最佳实践相悖,LLMs存在表达对心理健康状况患者污名化以及在自然主义治疗环境中对某些常见且关键情况做出不当回应的问题,例如鼓励患者的妄想性思维。此外,论文指出采用LLMs作为治疗师的基础性和实用性障碍,如治疗联盟需要人类特有的特质(例如身份认同和情感投入)。基于此,论文得出结论:LLMs不应取代治疗师,并讨论了LLMs在临床治疗中的其他潜在角色。

链接: https://arxiv.org/abs/2504.18412
作者: Jared Moore,Declan Grabb,William Agnew,Kevin Klyman,Stevie Chancellor,Desmond C. Ong,Nick Haber
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to replace mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as gpt-4o. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings – e.g., LLMs encourage clients’ delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes). For these reasons, we conclude that LLMs should not replace therapists, and we discuss alternative roles for LLMs in clinical therapy.
zh

[NLP-8] HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?

【速读】: 该论文试图解决视觉大语言模型(Vision Large Language Models, VLMs)在高分辨率图像(High-resolution Image, HRI)理解领域缺乏全面评估基准的问题。当前,虽然有研究表明VLMs能够处理HRIs,但现有评估标准无法充分反映其在真实场景下的性能。为填补这一空白,论文提出了HRScene,这是一个包含25个真实世界数据集和2个合成诊断数据集的新颖统一基准,涵盖了从1,024 × 1,024到35,503 × 26,627不同分辨率的图像,并涉及从显微镜到天文学等多种场景。关键在于通过重新标注这些数据集,确保其覆盖广泛的HRI应用场景,同时设计合成诊断数据集来测试模型对图像区域的有效利用能力。实验结果显示,现有VLMs在实际任务中的平均准确率约为50%,表明存在显著的理解差距;而在合成数据集上的表现则揭示了模型在处理HRI时存在的区域分歧和中间迷失现象,为未来研究指明方向。

链接: https://arxiv.org/abs/2504.18406
作者: Yusen Zhang,Wenliang Zheng,Aashrith Madasu,Peng Shi,Ryo Kamoi,Hao Zhou,Zhuoyang Zou,Shu Zhao,Sarkar Snigdha Sarathi Das,Vipul Gupta,Xiaoxin Lu,Nan Zhang,Ranran Haoran Zhang,Avitej Iyer,Renze Lou,Wenpeng Yin,Rui Zhang
机构: Penn State University (宾夕法尼亚州立大学); Amazon Web Services (亚马逊云服务)
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:High-resolution image (HRI) understanding aims to process images with a large number of pixels, such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) can allegedly handle HRIs, however, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 \times 1,024 to 35,503 \times 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic to radiology images, street views, long-range pictures, and telescope images. It includes HRIs of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and distracting images in different orders, assessing how well models utilize regions in HRI. We conduct extensive experiments involving 28 VLMs, including Gemini 2.0 Flash and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions, showing significant Regional Divergence and lost-in-middle, shedding light on future research.
zh

[NLP-9] A UD Treebank for Bohairic Coptic

【速读】: 该论文旨在解决波海里克科普特语(Bohairic Coptic)这一主要古典语言资源严重匮乏的问题。论文的关键在于构建并评估首个波海里克科普特语句法注释语料库,该语料库涵盖圣经文本、圣徒传记以及基督教苦修文学等多种作品。此外,论文通过对比现有的撒伊迪克科普特语(Sahidic Coptic)普遍使用的通用依赖树库(Universal Dependencies treebank),分析了两种方言的主要差异,并进行了跨方言联合与独立解析实验,揭示了波海里克作为与撒伊迪克相关但独特的语言变体的特性。

链接: https://arxiv.org/abs/2504.18386
作者: Amir Zeldes,Nina Speransky,Nicholas Wagner,Caroline T. Schroeder
机构: Georgetown University (乔治城大学); Hebrew University of Jerusalem (耶路撒冷希伯来大学); Duke University (杜克大学); University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent advances in digital resources for other Coptic dialects, especially Sahidic, Bohairic Coptic, the main Coptic dialect for pre-Mamluk, late Byzantine Egypt, and the contemporary language of the Coptic Church, remains critically under-resourced. This paper presents and evaluates the first syntactically annotated corpus of Bohairic Coptic, sampling data from a range of works, including Biblical text, saints’ lives and Christian ascetic writing. We also explore some of the main differences we observe compared to the existing UD treebank of Sahidic Coptic, the classical dialect of the language, and conduct joint and cross-dialect parsing experiments, revealing the unique nature of Bohairic as a related, but distinct variety from the more often studied Sahidic.
zh

[NLP-10] Pushing the boundary on Natural Language Inference

【速读】: 该论文试图解决当前自然语言推理(Natural Language Inference, NLI)系统过度依赖带标注数据集的问题,这些数据集通常包含注释 artifacts 和偏差,限制了模型的泛化能力和实际应用。为了解决这一问题,论文提出了一种基于强化学习的方法,利用基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的链式思考(Chain-of-Thought, CoT)学习方法,无需依赖标注的推理依据,从而能够在更具挑战性的数据集(如ANLI)上进行训练。关键在于采用参数高效微调技术(如LoRA和QLoRA)对不同规模的语言模型(7B、14B、32B)进行微调,并通过激进的量化(AWQ)在有限的22GB内存约束下实现卓越的性能,特别是在对抗性NLI基准测试中超越了现有最先进方法。

链接: https://arxiv.org/abs/2504.18376
作者: Pablo Miralles-González,Javier Huertas-Tato,Alejandro Martín,David Camacho
机构: Technical University of Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering, and information retrieval. Despite its importance, current NLI systems heavily rely on supervised learning with datasets that often contain annotation artifacts and biases, limiting generalization and real-world applicability. In this work, we apply a reinforcement learning-based approach using Group Relative Policy Optimization (GRPO) for Chain-of-Thought (CoT) learning in NLI, eliminating the need for labeled rationales and enabling this type of training on more challenging datasets such as ANLI. We fine-tune 7B, 14B, and 32B language models using parameter-efficient techniques (LoRA and QLoRA), demonstrating strong performance across standard and adversarial NLI benchmarks. Our 32B AWQ-quantized model surpasses state-of-the-art results on 7 out of 11 adversarial sets \unicodex2013 or on all of them considering our replication \unicodex2013 within a 22GB memory footprint, showing that robust reasoning can be retained under aggressive quantization. This work provides a scalable and practical framework for building robust NLI systems without sacrificing inference quality.
zh

[NLP-11] Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

【速读】: 该论文试图解决多语言模型(LLMs)驱动的多智能体框架缺乏专门基准数据集的问题。解决方案的关键在于引入Auto-SLURP数据集,该数据集通过重新标注原始SLURP数据,并整合模拟服务器和外部服务,扩展了原有的数据集功能,从而构建了一个涵盖语言理解、任务执行和响应生成的完整端到端评估管道。这一增强使得当前最先进的多智能体框架面临显著挑战,表明真正可靠且智能的个人助理系统仍需进一步研究。

链接: https://arxiv.org/abs/2504.18373
作者: Lei Shen,Xiaoyu Shen
机构: GEB Tech; Ningbo Institute of Digital Twin, EIT, Ningbo
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, multi-agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto-SLURP, a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset – initially developed for natural language understanding tasks – by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end-to-end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks, highlighting that truly reliable and intelligent multi-agent personal assistants remain a work in progress. The dataset and related code are available at this https URL.
zh

[NLP-12] Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models : A Systematic Review

【速读】: 该论文旨在解决大型语言模型(LLMs)中幻觉现象(hallucination)导致的输出错误信息问题,并探索如何准确评估和量化LLMs的不确定性。论文的关键在于通过系统性回顾代表性先前工作,分析不确定性量化(Uncertainty Quantification, UQ)与校准技术在LLMs中的有效性,并引入严格的基准测试以实现现有解决方案之间的深入比较。为此,研究者基于两个广泛使用的可靠性数据集,实证评估了六种相关方法,验证了文献综述中的重要发现。论文还提出了未来研究的关键方向,并指出了开放性的挑战。据作者所知,这是首项专门针对LLMs的校准方法及其相关度量进行全面审查的研究。

链接: https://arxiv.org/abs/2504.18346
作者: Toghrul Abbasli,Kentaroh Toyoda,Yuan Wang,Leon Witt,Muhammad Asif Ali,Yukai Miao,Dan Li,Qingsong Wei
机构: Department of Computer Science and Technology, Tsinghua University, Beijing, China (清华大学计算机科学与技术系, 北京, 中国); Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore (新加坡科技研究局高性能计算研究所); Computer, Electrical and Mathematical Sciences and Engineering (CEMSE), King Abdullah University of Science and Technology, Thuwal, Makkah, Kingdom of Saudi Arabia (沙特阿拉伯国王科技大学计算机、电气与数学科学与工程学院); Zhongguancun Laboratory, Beijing, China (中关村实验室, 北京, 中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been transformative across many domains. However, hallucination – confidently outputting incorrect information – remains one of the leading challenges for LLMs. This raises the question of how to accurately assess and quantify the uncertainty of LLMs. Extensive literature on traditional models has explored Uncertainty Quantification (UQ) to measure uncertainty and employed calibration techniques to address the misalignment between uncertainty and accuracy. While some of these methods have been adapted for LLMs, the literature lacks an in-depth analysis of their effectiveness and does not offer a comprehensive benchmark to enable insightful comparison among existing solutions. In this work, we fill this gap via a systematic survey of representative prior works on UQ and calibration for LLMs and introduce a rigorous benchmark. Using two widely used reliability datasets, we empirically evaluate six related methods, which justify the significant findings of our review. Finally, we provide outlooks for key future directions and outline open challenges. To the best of our knowledge, this survey is the first dedicated study to review the calibration methods and relevant metrics for LLMs.
zh

[NLP-13] Adversarial Attacks on LLM -as-a-Judge Systems: Insights from Prompt Injections

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)作为评估系统(如文本质量、代码正确性和论点强度)时易受提示注入攻击(Prompt Injection Attacks)的问题。论文的关键在于提出一个框架,将内容作者攻击与系统提示攻击分开,并通过在四种任务中使用五十个提示条件评估五种模型(Gemma 3.27B, Gemma 3.4B, Llama 3.2 3B, GPT 4 和 Claude 3 Opus),同时结合多种防御措施来分析攻击的成功率、模型的脆弱性及迁移性。研究发现,较小的模型更易受到攻击,且攻击成功率最高可达73.8%,迁移性范围为50.5%到62.6%。论文建议采用多模型委员会和对比评分的方法,并公开所有代码和数据集。

链接: https://arxiv.org/abs/2504.18333
作者: Narek Maloyan,Dmitry Namiot
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM as judge systems used to assess text quality code correctness and argument strength are vulnerable to prompt injection attacks. We introduce a framework that separates content author attacks from system prompt attacks and evaluate five models Gemma 3.27B Gemma 3.4B Llama 3.2 3B GPT 4 and Claude 3 Opus on four tasks with various defenses using fifty prompts per condition. Attacks achieved up to seventy three point eight percent success smaller models proved more vulnerable and transferability ranged from fifty point five to sixty two point six percent. Our results contrast with Universal Prompt Injection and AdvPrompter We recommend multi model committees and comparative scoring and release all code and datasets
zh

[NLP-14] xtTIGER: Text-based Intelligent Generation with Entity Prompt Refinement for Text-to-Image Generation

【速读】: 该论文旨在解决从包含特定实体的提示生成图像时,模型难以有效保留大量且不断涌现的实体相关知识的问题。为应对这一挑战,论文提出了一种基于文本的智能生成方法——Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER),其关键在于通过增强提示中实体的知识,并利用大语言模型(Large Language Models, LLMs)对增强后的描述进行总结,从而缓解因输入长度增加导致的性能下降问题。通过引入包含图像、标题及实体列表的WiT-Cub数据集验证,实验结果表明,TextTIGER在标准评估指标(如IS、FID和CLIPScore)下显著提升了图像生成性能,同时总结出的描述被证明更加信息丰富,验证了LLMs生成简洁而富有内涵描述的能力。

链接: https://arxiv.org/abs/2504.18269
作者: Shintaro Ozaki,Kazuki Hayashi,Yusuke Sakai,Jingun Kwon,Hidetaka Kamigaito,Katsuhiko Hayashi,Manabu Okumura,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST)(奈良先端科学技术大学院大学), Japan; The University of Tokyo(东京大学), Japan; Chungnam National University(忠南国立大学), Korea; Institute of Science Tokyo(东京科学研究所), Japan
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Generating images from prompts containing specific entities requires models to retain as much entity-specific knowledge as possible. However, fully memorizing such knowledge is impractical due to the vast number of entities and their continuous emergence. To address this, we propose Text-based Intelligent Generation with Entity prompt Refinement (TextTIGER), which augments knowledge on entities included in the prompts and then summarizes the augmented descriptions using Large Language Models (LLMs) to mitigate performance degradation from longer inputs. To evaluate our method, we introduce WiT-Cub (WiT with Captions and Uncomplicated Background-explanations), a dataset comprising captions, images, and an entity list. Experiments on four image generation models and five LLMs show that TextTIGER improves image generation performance in standard metrics (IS, FID, and CLIPScore) compared to caption-only prompts. Additionally, multiple annotators’ evaluation confirms that the summarized descriptions are more informative, validating LLMs’ ability to generate concise yet rich descriptions. These findings demonstrate that refining prompts with augmented and summarized entity-related descriptions enhances image generation capabilities. The code and dataset will be available upon acceptance.
zh

[NLP-15] MAGI: Multi-Agent Guided Interview for Psychiatric Assessment

【速读】: 该论文旨在解决自动化结构化临床访谈在精神健康诊断中的应用难题,现有大型语言模型(Large Language Models, LLMs)方法难以与精神病学诊断协议保持一致。论文提出MAGI框架,通过多智能体协同合作将Mini International Neuropsychiatric Interview (MINI) 转换为自动化的计算工作流。其关键是设计了四个专业化智能体:1)遵循MINI分支结构的访谈树引导导航智能体;2)融合诊断探查、解释及共情能力的自适应提问智能体;3)用于验证参与者响应是否满足节点要求的判断智能体;4)生成心理计量链式思维(PsyCoT)轨迹以明确映射症状至临床标准的诊断智能体。实验结果表明,MAGI在覆盖抑郁、广泛性焦虑、社交焦虑及自杀倾向的1,002名真实参与者中提升了LLM辅助的精神健康评估质量,结合了临床严谨性、对话适应性和可解释推理能力。

链接: https://arxiv.org/abs/2504.18260
作者: Guanqun Bi,Zhuang Chen,Zhoufu Liu,Hongkai Wang,Xiyao Xiao,Yuqiang Xie,Wen Zhang,Yongkang Huang,Yuxuan Chen,Libiao Peng,Yi Feng,Minlie Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: In progress

点击查看摘要

Abstract:Automating structured clinical interviews could revolutionize mental healthcare accessibility, yet existing large language models (LLMs) approaches fail to align with psychiatric diagnostic protocols. We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational workflows through coordinated multi-agent collaboration. MAGI dynamically navigates clinical logic via four specialized agents: 1) an interview tree guided navigation agent adhering to the MINI’s branching structure, 2) an adaptive question agent blending diagnostic probing, explaining, and empathy, 3) a judgment agent validating whether the response from participants meet the node, and 4) a diagnosis Agent generating Psychometric Chain-of- Thought (PsyCoT) traces that explicitly map symptoms to clinical criteria. Experimental results on 1,002 real-world participants covering depression, generalized anxiety, social anxiety and suicide shows that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.
zh

[NLP-16] Efficient Single-Pass Training for Multi-Turn Reasoning

【速读】: 该论文旨在解决在多轮推理数据集上微调大语言模型(LLMs)时面临的挑战:即生成的推理标记需要在后续输入中被排除,这导致无法在一个单一前向传递中处理整个对话,而这种优化在非推理的多轮数据集微调中是可行的。论文的关键解决方案是通过响应标记复制以及自定义注意力掩码来强制执行适当的可见性约束,从而克服这一限制。这种方法显著减少了训练时间,并实现了在多轮推理数据集上的高效微调。

链接: https://arxiv.org/abs/2504.18246
作者: Ritesh Goru,Shanay Mehta,Prateek Jain
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Training Large Language Models ( LLMs) to generate explicit reasoning before they produce an answer has been shown to improve their performance across various tasks such as mathematics and coding. However, fine-tuning LLMs on multi-turn reasoning datasets presents a unique challenge: LLMs must generate reasoning tokens that are excluded from subsequent inputs to the LLM. This discrepancy prevents us from processing an entire conversation in a single forward pass-an optimization readily available when we fine-tune on a multi-turn non-reasoning dataset. This paper proposes a novel approach that overcomes this limitation through response token duplication and a custom attention mask that enforces appropriate visibility constraints. Our approach significantly reduces the training time and allows efficient fine-tuning on multi-turn reasoning datasets.
zh

[NLP-17] Even Small Reason ers Should Quote Their Sources: Introducing the Pleias-RAG Model Family

【速读】: 该论文试图解决多语言环境下检索增强生成(RAG)模型在标准化基准测试中的性能下降以及跨语言引用定位不一致的问题。解决方案的关键在于引入了Pleias-RAG-350m和Pleias-RAG-1B两个新世代的小型推理模型,它们通过在大规模合成数据集上进行中间训练,模拟从Common Crawl中检索多种语言开放源的能力,同时支持直接引用和精准引用定位,并重新整合了与RAG工作流相关的多项功能(如查询路由、查询重写和源重排序)。这些模型不仅在HotPotQA和2Wiki等标准化RAG基准测试中超越了参数量小于40亿的SLMs,还与更大规模的流行模型(如Qwen-2.5-7B、Llama-3.1-8B和Gemma-3-4B)竞争,且是目前唯一能够在主要欧洲语言中保持一致RAG性能并确保语句系统性引用定位的SLMs。其设计使得模型在受限基础设施上易于部署的同时具备更高的事实准确性,从而解锁了生成式AI (Generative AI) 的更多应用场景。

链接: https://arxiv.org/abs/2504.18225
作者: Pierre-Carl Langlais,Pavel Chizhov,Mattia Nee,Carlos Rosas Hinostroza,Matthieu Delsart,Irène Girard,Othman Hicheur,Anastasia Stasenko,Ivan P. Yamshchikov
机构: PleIAs(普莱亚斯), Paris, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a new generation of small reasoning models for RAG, search, and source summarization. Pleias-RAG-350m and Pleias-RAG-1B are mid-trained on a large synthetic dataset emulating the retrieval of a wide variety of multilingual open sources from the Common Corpus. They provide native support for citation and grounding with literal quotes and reintegrate multiple features associated with RAG workflows, such as query routing, query reformulation, and source reranking. Pleias-RAG-350m and Pleias-RAG-1B outperform SLMs below 4 billion parameters on standardized RAG benchmarks (HotPotQA, 2wiki) and are competitive with popular larger models, including Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B. They are the only SLMs to date maintaining consistent RAG performance across leading European languages and ensuring systematic reference grounding for statements. Due to their size and ease of deployment on constrained infrastructure and higher factuality by design, the models unlock a range of new use cases for generative AI.
zh

[NLP-18] Optimising ChatGPT for creativity in literary translation: A case study from English into Dutch Chinese Catalan and Spanish

【速读】: 本文旨在研究Chat-GPT机器翻译(MT)在四种语言下六种不同配置中的输出变异性,重点关注文学文本中的创造性表现。论文通过在不同文本粒度、温度设置和提示策略下评估GPT翻译,并采用创造力评分公式,探索如何提升其创造性翻译能力。关键解决方案在于使用极简指令提示ChatGPT,例如以“以富有创意的方式将以下文本翻译成[TG]”并在温度设置为1.0时进行操作,这种方法在西班牙语、荷兰语和中文中优于其他配置及DeepL。然而,Chat-GPT的整体表现仍明显逊于人工翻译(HT)。

链接: https://arxiv.org/abs/2504.18221
作者: Shuxiang Du,Ana Guerberof Arenas,Antonio Toral,Kyo Gerrits,Josep Marco Borillo
机构: Centre for Language and Cognition, University of Groningen (格罗宁根大学语言与认知中心); Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant (阿利坎特大学语言与信息系统系); Departament de Traducció i Comunicació, Universitat Jaume I (哈イメ一世大学翻译与传播系)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the MT Summit 2025 to be held in Geneva on June 23-27 2025

点击查看摘要

Abstract:This study examines the variability of Chat-GPT machine translation (MT) outputs across six different configurations in four languages,with a focus on creativity in a literary text. We evaluate GPT translations in different text granularity levels, temperature settings and prompting strategies with a Creativity Score formula. We found that prompting ChatGPT with a minimal instruction yields the best creative translations, with “Translate the following text into [TG] creatively” at the temperature of 1.0 outperforming other configurations and DeepL in Spanish, Dutch, and Chinese. Nonetheless, ChatGPT consistently underperforms compared to human translation (HT).
zh

[NLP-19] Aligning Language Models for Icelandic Legal Text Summarization

【速读】: 该论文试图解决法律领域语言模型在处理专业术语、细微语言表达及正式文体时面临的挑战,以提升其生成符合法律领域标准且满足用户偏好的文本能力。论文的关键解决方案在于采用基于偏好(preference-based)的训练技术,特别是通过从人类反馈中强化学习(Reinforcement Learning from Human Feedback, RLHF)和直接偏好优化(Direct Preference Optimization, DPO),与传统的监督学习方法进行对比,评估其在改善生成的冰岛语法律摘要质量和法律准确性方面的效果。

链接: https://arxiv.org/abs/2504.18180
作者: Þórir Hrafn Harðarson,Hrafn Loftsson,Stefán Ólafsson
机构: Reykjavik University (雷克雅未克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at NoDaLiDa 2025

点击查看摘要

Abstract:The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models’ performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.
zh

[NLP-20] EDU-NER-2025: Named Entity Recognition in Urdu Educational Texts using XLM-RoBERTa with X (formerly Twitter)

【速读】: 该论文致力于解决乌尔都语教育领域命名实体识别(NER)任务中因缺乏标注数据导致的研究不足问题。现有模型难以准确识别学术角色、课程名称及机构术语等特定领域的实体,这限制了其性能表现。为应对这一挑战,论文的关键贡献在于构建了一个名为EDU-NER-2025的手动标注教育领域数据集,包含13种与教育相关的独特且重要的实体类型;详细描述了标注过程与准则,并探讨了标注过程中面临的挑战;同时分析了乌尔都语正式文本中存在的形态复杂性和歧义性等语言学难题。这些努力旨在填补乌尔都语教育领域NER研究的空白,并为后续工作提供必要的资源支持。

链接: https://arxiv.org/abs/2504.18142
作者: Fida Ullah,Muhammad Ahmad,Muhammad Tayyab Zamir,Muhammad Arif,Grigori sidorov,Edgardo Manuel Felipe Riverón,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Named Entity Recognition (NER) plays a pivotal role in various Natural Language Processing (NLP) tasks by identifying and classifying named entities (NEs) from unstructured data into predefined categories such as person, organization, location, date, and time. While extensive research exists for high-resource languages and general domains, NER in Urdu particularly within domain-specific contexts like education remains significantly underexplored. This is Due to lack of annotated datasets for educational content which limits the ability of existing models to accurately identify entities such as academic roles, course names, and institutional terms, underscoring the urgent need for targeted resources in this domain. To the best of our knowledge, no dataset exists in the domain of the Urdu language for this purpose. To achieve this objective this study makes three key contributions. Firstly, we created a manually annotated dataset in the education domain, named EDU-NER-2025, which contains 13 unique most important entities related to education domain. Second, we describe our annotation process and guidelines in detail and discuss the challenges of labelling EDU-NER-2025 dataset. Third, we addressed and analyzed key linguistic challenges, such as morphological complexity and ambiguity, which are prevalent in formal Urdu texts.
zh

[NLP-21] mporal Entailment Pretraining for Clinical Language Models over EHR Data

【速读】: 该论文旨在解决现有临床语言模型在处理电子健康记录(EHR)时忽视患者轨迹随时间演变和因果交织的问题。大多数方法将EHR视为静态文档,而未充分利用其时间动态特性。为了解决这一局限,论文提出了一种新颖的时间蕴含预训练目标(temporal entailment pretraining objective),通过将EHR片段建模为按时间顺序排列的句子对,并训练模型判断后一状态是否由前一状态蕴含、矛盾或中立,从而实现对临床推理的时间建模。关键在于利用这种时间结构化的预训练任务,使模型能够学习到潜在的临床推理能力,进而提升其在预测和诊断等下游任务中的泛化性能。实验结果表明,该方法在时间临床问答(temporal clinical QA)、早期预警预测以及疾病进展建模等任务上取得了最先进的性能。

链接: https://arxiv.org/abs/2504.18128
作者: Tatsunori Tanaka,Fi Zheng,Kai Sato,Zhifeng Li,Yuanyun Zhang,Shi Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical language models have achieved strong performance on downstream tasks by pretraining on domain specific corpora such as discharge summaries and medical notes. However, most approaches treat the electronic health record as a static document, neglecting the temporally-evolving and causally entwined nature of patient trajectories. In this paper, we introduce a novel temporal entailment pretraining objective for language models in the clinical domain. Our method formulates EHR segments as temporally ordered sentence pairs and trains the model to determine whether a later state is entailed by, contradictory to, or neutral with respect to an earlier state. Through this temporally structured pretraining task, models learn to perform latent clinical reasoning over time, improving their ability to generalize across forecasting and diagnosis tasks. We pretrain on a large corpus derived from MIMIC IV and demonstrate state of the art results on temporal clinical QA, early warning prediction, and disease progression modeling.
zh

[NLP-22] Evaluating Evaluation Metrics – The Mirag e of Hallucination Detection

【速读】: 该论文试图解决语言模型幻觉(hallucinations)测量不准确的问题,这一问题阻碍了模型的可靠性和广泛应用。论文的关键在于通过大规模实证研究评估了6组多样化的幻觉检测指标在4个数据集、5个解码方法以及来自5个家族的37个语言模型上的表现。研究发现现有评估指标存在与人工判断不一致、对问题理解过于狭隘以及在参数扩展中收益不稳定等令人担忧的局限性。然而,论文指出基于大型语言模型(LLM)特别是GPT-4的评估表现出最佳整体效果,且模式搜索(mode-seeking)解码方法能够有效减少知识驱动场景中的幻觉现象。这些结果强调了开发更稳健的幻觉量化指标及改进缓解策略的重要性。

链接: https://arxiv.org/abs/2504.18114
作者: Atharva Kulkarni,Yuan Zhang,Joel Ruben Antony Moniz,Xiou Ge,Bo-Hsiang Tseng,Dhivya Piraviperumal,Swabha Swayamdipta,Hong Yu
机构: University of Southern California (南加州大学); Apple Inc. (苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.
zh

[NLP-23] Comparative Study on the Discourse Meaning of Chinese and English Media in the Paris Olympics Based on LDA Topic Modeling Technology and LLM Prompt Engineering

【速读】: 该论文旨在通过主题建模(Topic Modeling)、大型语言模型提示工程(LLM Prompt Engineering)及语料库短语学方法,分析中英媒体对巴黎奥运会报道的相似性和差异性,特别是探讨话语建构方式及态度意义的异同。论文的关键在于采用跨学科方法论,结合定量与定性分析手段,以揭示中英媒体在话题聚焦(如开幕仪式、运动员表现、赞助品牌等)及情感极性表达上的特点,从而为理解不同文化背景下媒介叙事提供实证依据。

链接: https://arxiv.org/abs/2504.18106
作者: Yinglong Yu,Zhaopu Yao,Fang Yuan
机构: Communication University of China (中国传媒大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study analyzes Chinese and English media reports on the Paris Olympics using topic modeling, Large Language Model (LLM) prompt engineering, and corpus phraseology methods to explore similarities and differences in discourse construction and attitudinal meanings. Common topics include the opening ceremony, athlete performance, and sponsorship brands. Chinese media focus on specific sports, sports spirit, doping controversies, and new technologies, while English media focus on female athletes, medal wins, and eligibility controversies. Chinese reports show more frequent prepositional co-occurrences and positive semantic prosody in describing the opening ceremony and sports spirit. English reports exhibit positive semantic prosody when covering female athletes but negative prosody in predicting opening ceremony reactions and discussing women’s boxing controversies.
zh

[NLP-24] Application and Optimization of Large Models Based on Prompt Tuning for Fact-Check-Worthiness Estimation

【速读】: 该论文旨在应对全球化和信息化背景下日益严重的 misinformation(虚假信息)问题,提出了一种基于 prompt tuning(提示微调)的可事实核查性评估分类方法。论文的关键解决方案在于通过在大语言模型中应用设计好的提示模板,利用 in-context learning(上下文学习)和 prompt tuning 技术,在有限或未标注数据情况下提升判断声明是否具有事实核查价值的准确性。实验结果表明,该方法在包括 BERT、GPT-3.5 和 GPT-4 等多种基线模型的对比中表现出色,特别是在 F1 分数和准确率等评估指标上具有一定优势,有效验证了其在事实核查性评估任务中的有效性与先进性。

链接: https://arxiv.org/abs/2504.18104
作者: Yinglong Yu,Hao Shen,Zhengyi Lyu,Qi He
机构: Communication University of China (中国传媒大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In response to the growing problem of misinformation in the context of globalization and informatization, this paper proposes a classification method for fact-check-worthiness estimation based on prompt tuning. We construct a model for fact-check-worthiness estimation at the methodological level using prompt tuning. By applying designed prompt templates to large language models, we establish in-context learning and leverage prompt tuning technology to improve the accuracy of determining whether claims have fact-check-worthiness, particularly when dealing with limited or unlabeled data. Through extensive experiments on public datasets, we demonstrate that the proposed method surpasses or matches multiple baseline methods in the classification task of fact-check-worthiness estimation assessment, including classical pre-trained models such as BERT, as well as recent popular large models like GPT-3.5 and GPT-4. Experiments show that the prompt tuning-based method proposed in this study exhibits certain advantages in evaluation metrics such as F1 score and accuracy, thereby effectively validating its effectiveness and advancement in the task of fact-check-worthiness estimation.
zh

[NLP-25] racking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

【速读】: 该论文旨在解决通过语音声学信号预测舌部及唇部构音特征的问题。解决方案的关键在于采用了一种结合堆叠双向长短期记忆网络(stacked Bidirectional Long Short-Term Memory, BiLSTM)与一维卷积神经网络(Convolutional Neural Network, CNN)的架构,并使用固定权重初始化方法进行后处理。这种设计不仅有效捕捉了语音信号的时间依赖性,还通过CNN进一步优化了特征表示,从而提高了预测精度。实验结果表明,该固定权重方法在较少训练轮次内实现了优于自适应权重初始化的性能,特别是在不同说话人和语料库条件下均表现出了良好的泛化能力。

链接: https://arxiv.org/abs/2504.18099
作者: Leena G Pillai,D. Muhammad Noorul Mubarak,Elizabeth Sherly
机构: University of Kerala (喀拉拉大学); Digital University Kerala (数字大学喀拉拉)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 10 pages with 8 figures. This paper presented in an international Conference

点击查看摘要

Abstract:Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.
zh

[NLP-26] Random-Set Large Language Models

【速读】: 本文旨在解决大型语言模型(Large Language Models, LLMs)生成文本可信度的问题。具体而言,研究关注如何量化LLMs中的不确定性(Uncertainty Quantification)。论文提出了一种新颖的随机集大型语言模型(Random-Set Large Language Model, RSLLM)方法,通过预测令牌空间上的有限随机集(belief functions),而非传统的概率向量,来实现这一目标。关键在于引入基于层次聚类的方法,提取并利用一组“焦点”子集(focal subsets),使得预测过程在保持效率的同时具备可扩展性。RSLLMs通过与训练集规模和多样性相关的credal集大小,编码生成过程中产生的认识论不确定性(epistemic uncertainty)。实验结果表明,该方法在CoQA和OBQA数据集上优于标准模型,不仅提高了答案的正确性,还展示了估计二级不确定性和检测幻觉(hallucination)的能力。

链接: https://arxiv.org/abs/2504.18085
作者: Muhammad Mubashar,Shireen Kudukkil Manchingal,Fabio Cuzzolin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are known to produce very high-quality tests and responses to our queries. But how much can we trust this generated text? In this paper, we study the problem of uncertainty quantification in LLMs. We propose a novel Random-Set Large Language Model (RSLLM) approach which predicts finite random sets (belief functions) over the token space, rather than probability vectors as in classical LLMs. In order to allow so efficiently, we also present a methodology based on hierarchical clustering to extract and use a budget of “focal” subsets of tokens upon which the belief prediction is defined, rather than using all possible collections of tokens, making the method scalable yet effective. RS-LLMs encode the epistemic uncertainty induced in their generation process by the size and diversity of its training set via the size of the credal sets associated with the predicted belief functions. The proposed approach is evaluated on CoQA and OBQA datasets using Llama2-7b, Mistral-7b and Phi-2 models and is shown to outperform the standard model in both datasets in terms of correctness of answer while also showing potential in estimating the second level uncertainty in its predictions and providing the capability to detect when its hallucinating.
zh

[NLP-27] Stabilizing Reasoning in Medical LLM s with Continued Pretraining and Reasoning Preference Optimization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医学领域应用中面临的三个主要挑战:事实准确性、特定语言(如日语)的局限性以及生成可信赖推理解释的可靠性。在临床应用中,后者尤为重要,因为可靠的推理解释是建立信任的前提条件。论文的关键解决方案是提出Preferred-MedLLM-Qwen-72B模型,并采用两阶段微调策略:首先通过Continued Pretraining (CPT) 在全面的日文医学语料库上训练以深入掌握领域知识;其次通过基于偏好的推理偏好优化(Reasoning Preference Optimization, RPO),在保持高答案准确性的同时增强生成可靠推理路径的能力。这种两阶段方法有效解决了现有模型在需要生成解释时准确性显著下降的问题,证明了RPO在稳定推理生成方面的有效性。

链接: https://arxiv.org/abs/2504.18080
作者: Wataru Kawakami,Keita Suzuki,Junichiro Iwasawa
机构: Preferred Networks Inc. (首选网络公司), Tokyo, Japan; Graduate School of Information Science and Technology, The University of Tokyo (东京大学信息科学与技术研究生院), Tokyo, Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) show potential in medicine, yet clinical adoption is hindered by concerns over factual accuracy, language-specific limitations (e.g., Japanese), and critically, their reliability when required to generate reasoning explanations – a prerequisite for trust. This paper introduces Preferred-MedLLM-Qwen-72B, a 72B-parameter model optimized for the Japanese medical domain to achieve both high accuracy and stable reasoning. We employ a two-stage fine-tuning process on the Qwen2.5-72B base model: first, Continued Pretraining (CPT) on a comprehensive Japanese medical corpus instills deep domain knowledge. Second, Reasoning Preference Optimization (RPO), a preference-based method, enhances the generation of reliable reasoning pathways while preserving high answer accuracy. Evaluations on the Japanese Medical Licensing Exam benchmark (IgakuQA) show Preferred-MedLLM-Qwen-72B achieves state-of-the-art performance (0.868 accuracy), surpassing strong proprietary models like GPT-4o (0.866). Crucially, unlike baseline or CPT-only models which exhibit significant accuracy degradation (up to 11.5% and 3.8% respectively on IgakuQA) when prompted for explanations, our model maintains its high accuracy (0.868) under such conditions. This highlights RPO’s effectiveness in stabilizing reasoning generation. This work underscores the importance of optimizing for reliable explanations alongside accuracy. We release the Preferred-MedLLM-Qwen-72B model weights to foster research into trustworthy LLMs for specialized, high-stakes applications.
zh

[NLP-28] PropRAG : Guiding Retrieval with Beam Search over Proposition Paths

【速读】: 该论文旨在解决标准 Retrieval Augmented Generation (RAG) 方法在处理复杂推理(关联性)和上下文理解(意义建构)时未能捕捉人类记忆相互关联性的局限性。同时,尽管结构化 RAG 方法如 HippoRAG 利用知识图谱(Knowledge Graphs, KGs),但其固有的上下文丢失限制了检索的准确性。论文的关键解决方案是提出 PropRAG 框架,该框架通过利用富含语境的命题(propositions)以及一种新颖的基于命题路径的束搜索算法,显式地发现多步推理链。PropRAG 的在线检索过程完全避免调用生成式大语言模型(Generative LLMs),而是依赖高效的图遍历和预计算嵌入,从而消除了在线 LLM 推理的成本及证据收集过程中潜在的一致性问题。此外,PropRAG 在离线阶段利用 LLM 提取高质量命题,并在检索后用于答案生成。通过引入更丰富的表示方式和明确的 LLM-免费在线路径查找,PropRAG 在 PopQA、2Wiki、HotpotQA 和 MuSiQue 等任务上实现了最先进的零样本 Recall@5 结果和顶级 F1 分数,显著提升了非参数连续学习的能力。

链接: https://arxiv.org/abs/2504.18070
作者: Jingjin Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and data to be released at: this https URL

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has become the standard non-parametric approach for equipping Large Language Models (LLMs) with up-to-date knowledge and mitigating catastrophic forgetting common in continual learning. However, standard RAG, relying on independent passage retrieval, fails to capture the interconnected nature of human memory crucial for complex reasoning (associativity) and contextual understanding (sense-making). While structured RAG methods like HippoRAG utilize knowledge graphs (KGs) built from triples, the inherent context loss limits fidelity. We introduce PropRAG, a framework leveraging contextually rich propositions and a novel beam search algorithm over proposition paths to explicitly discover multi-step reasoning chains. Crucially, PropRAG’s online retrieval process operates entirely without invoking generative LLMs, relying instead on efficient graph traversal and pre-computed embeddings. This avoids online LLM inference costs and potential inconsistencies during evidence gathering. LLMs are used effectively offline for high-quality proposition extraction and post-retrieval for answer generation. PropRAG achieves state-of-the-art zero-shot Recall@5 results on PopQA (55.3%), 2Wiki (93.7%), HotpotQA (97.0%), and MuSiQue (77.3%), alongside top F1 scores (e.g., 52.4% on MuSiQue). By improving evidence retrieval through richer representation and explicit, LLM-free online path finding, PropRAG advances non-parametric continual learning.
zh

[NLP-29] Exploring Personality-Aware Interactions in Salesperson Dialogue Agents

【速读】: 本文旨在解决对话代理在销售领域的应用中,如何有效理解和适应具有不同人格特质用户的交互问题。研究通过引入基于迈尔斯-布里格斯性格分类法(Myers-Briggs Type Indicator, MBTI)定义的用户人格类型,评估预训练对话代理在交互质量、任务完成率及对话自然性方面的表现,并探索其适应性和个性化能力。关键在于开发了基于人格定义的用户模拟器,这些模拟器能够不受领域限制,为构建更适应性强、以用户为中心的销售领域对话系统提供实用洞见,并推动个性化对话系统的广泛应用潜力。

链接: https://arxiv.org/abs/2504.18058
作者: Sijia Cheng,Wen-Yu Chang,Yun-Nung Chen
机构: National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IWSDS 2025

点击查看摘要

Abstract:The integration of dialogue agents into the sales domain requires a deep understanding of how these systems interact with users possessing diverse personas. This study explores the influence of user personas, defined using the Myers-Briggs Type Indicator (MBTI), on the interaction quality and performance of sales-oriented dialogue agents. Through large-scale testing and analysis, we assess the pre-trained agent’s effectiveness, adaptability, and personalization capabilities across a wide range of MBTI-defined user types. Our findings reveal significant patterns in interaction dynamics, task completion rates, and dialogue naturalness, underscoring the future potential for dialogue agents to refine their strategies to better align with varying personality traits. This work not only provides actionable insights for building more adaptive and user-centric conversational systems in the sales domain but also contributes broadly to the field by releasing persona-defined user simulators. These simulators, unconstrained by domain, offer valuable tools for future research and demonstrate the potential for scaling personalized dialogue systems across diverse applications.
zh

[NLP-30] DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models NAACL2025

【速读】: 本文旨在解决多模态大语言模型(MLLMs)因视觉和文本数据集成而引入的新维度潜在攻击和复杂风险组合所带来的独特安全挑战。论文的关键在于通过逐步推理对多模态输入中的风险进行系统性解耦,以显著提高MLLMs的风险意识。为此,作者提出了\textbfDREAM方法,它通过有监督微调和基于AI反馈的迭代强化学习(RLAIF)增强MLLMs的安全对齐。实验结果表明,与GPT-4V相比,DREAM在推理和训练阶段均显著提升了安全性,同时保持正常任务性能不变(即不过度安全),SIUO安全-有效分数提高了16.17%。

链接: https://arxiv.org/abs/2504.18053
作者: Jianyu Liu,Hangyu Guo,Ranjie Duan,Xingyuan Bu,Yancheng He,Shilong Li,Hui Huang,Jiaheng Liu,Yucheng Wang,Chenchen Jing,Xingwei Qu,Xiao Zhang,Yingshui Tan,Yanan Wu,Jihao Gu,Yangguang Li,Jianke Zhu
机构: Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); M-A-P; The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: [NAACL 2025] The first four authors contribute equally, 23 pages, repo at this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbfDREAM (\textit\textbfDisentangling \textbfRisks to \textbfEnhance Safety \textbfAlignment in \textbfMLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17% improvement in the SIUO safe\effective score compared to GPT-4V. The data and code are available at this https URL.
zh

[NLP-31] RAG LLM s are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models NAACL2025

【速读】: 该论文试图解决在Retrieval-Augmented Generation (RAG)框架下,大型语言模型 (Large Language Models, LLMs) 安全性研究不足的问题。论文的关键在于通过详细比较RAG与非RAG框架下的模型行为,揭示RAG可能导致模型安全性下降,并分析其根本原因,包括安全模型与安全文档组合仍可能产生不安全输出的现象。此外,论文评估了现有红队测试方法在RAG场景中的有效性,发现其效果不如在非RAG场景中显著。最终,研究强调了针对RAG LLMs专门设计安全研究和红队测试方法的必要性。

链接: https://arxiv.org/abs/2504.18041
作者: Bang An,Shiyue Zhang,Mark Dredze
机构: University of Maryland (马里兰大学); Bloomberg AI (彭博人工智能); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025

点击查看摘要

Abstract:Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming. However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model’s safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.
zh

[NLP-32] SMARTFinRAG : Interactive Modularized Financial RAG Benchmark

【速读】: 该论文致力于解决金融领域专用Retrieval-Augmented Generation (RAG) 系统评估的三个关键挑战:(1) 提出了一种完全模块化的架构,支持运行时动态组件替换;(2) 设计了一种以文档为中心的评估范式,从新摄入的金融文档中生成特定领域的问答对;(3) 开发了一个直观的界面以弥合研究与实现之间的鸿沟。其解决方案的关键在于通过模块化架构提升灵活性,结合文档生成的领域特定评估方法增强针对性,并借助用户友好的接口促进实际应用部署。平台的开源特性进一步推动了透明且可复现的研究,同时应对金融机构在RAG系统部署中的实际难题。

链接: https://arxiv.org/abs/2504.18024
作者: Yiwei Zha
机构: Khoury College of Computer Science (Khoury 计算机科学学院), Northeastern University (东北大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: For open source github repo, see this https URL

点击查看摘要

Abstract:Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform’s open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.
zh

[NLP-33] Memory Reviving Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation

【速读】: 该论文试图解决多模态机器翻译(MMT)中预训练编码器和解码器的作用及其影响机制这一问题。论文的关键在于系统性地研究不同预训练策略(从从头训练到使用部分冻结的预训练组件)对多模态翻译模型性能的影响,并揭示预训练在多模态设置下的关键但不对称的作用:预训练解码器能够显著提升输出的流畅性和准确性,而预训练编码器的效果则依赖于视觉-文本对齐的质量。此外,论文还探讨了模态融合与预训练组件之间的相互作用,为未来多模态翻译系统的架构设计提供了指导。

链接: https://arxiv.org/abs/2504.18012
作者: Zhuang Yu,Shiliang Sun,Jing Zhao,Tengfei Song,Hao Yang
机构: Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系); School of Computer Science and Technology, East China Normal University (华东师范大学计算机科学与技术学院); 2012 Labs, Huawei Technologies CO., LTD (华为技术有限公司2012实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Machine Translation (MMT) aims to improve translation quality by leveraging auxiliary modalities such as images alongside textual input. While recent advances in large-scale pre-trained language and vision models have significantly benefited unimodal natural language processing tasks, their effectiveness and role in MMT remain underexplored. In this work, we conduct a systematic study on the impact of pre-trained encoders and decoders in multimodal translation models. Specifically, we analyze how different training strategies, from training from scratch to using pre-trained and partially frozen components, affect translation performance under a unified MMT framework. Experiments are carried out on the Multi30K and CoMMuTE dataset across English-German and English-French translation tasks. Our results reveal that pre-training plays a crucial yet asymmetrical role in multimodal settings: pre-trained decoders consistently yield more fluent and accurate outputs, while pre-trained encoders show varied effects depending on the quality of visual-text alignment. Furthermore, we provide insights into the interplay between modality fusion and pre-trained components, offering guidance for future architecture design in multimodal translation systems.
zh

[NLP-34] Improving LLM Personas via Rationalization with Psychological Scaffolds

【速读】: 该论文试图解决现有基于用户人口统计属性和/或先前判断构建persona的方法未能捕捉用户判断背后深层推理的问题。论文提出了一种名为PBJ(心理学行为与判断)的新框架,其关键是通过引入由大语言模型(LLMs)生成且基于心理支架(psychological scaffolds)的rationales来增强persona的构建,这些心理支架基于人格特质理论(如Big 5人格特征)和基本世界观理论(Primal World Beliefs)。这些rationales旨在从用户的经历、性格特征或信念角度解释其行为和判断。实验结果表明,结合PBJ rationales的LLM persona在公共意见和电影偏好预测任务中优于仅依赖人口统计数据和判断的方法,并且使用描述用户信念的心理支架构建的persona与人工编写rationales的方法表现相当。

链接: https://arxiv.org/abs/2504.17993
作者: Brihi Joshi,Xiang Ren,Swabha Swayamdipta,Rik Koncel-Kedziorski,Tim Paek
机构: University of Southern California (南加州大学); Apple (苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models prompted with a user description or persona can predict a user’s preferences and opinions, but existing approaches to building personas – based solely on a user’s demographic attributes and/or prior judgments – fail to capture the underlying reasoning behind said user judgments. We introduce PBJ (Psychology of Behavior and Judgments), a framework that improves LLM personas by incorporating rationales of why a user might make specific judgments. These rationales are LLM-generated, and aim to reason about a user’s behavior on the basis of their experiences, personality traits or beliefs. This is done using psychological scaffolds – structured frameworks grounded in theories such as the Big 5 Personality Traits and Primal World Beliefs – that help provide structure to the generated rationales. Experiments on public opinion and movie preference prediction tasks demonstrate that LLM personas augmented with PBJ rationales consistently outperform methods using only a user’s demographics and/or judgments. Additionally, LLM personas constructed using scaffolds describing user beliefs perform competitively with those using human-written rationales.
zh

[NLP-35] Optimism Expectation or Sarcasm? Multi-Class Hope Speech Detection in Spanish and English

【速读】: 该论文试图解决自然语言处理系统在检测希望(Hope)这一复杂情感状态时面临的挑战,特别是希望的不同亚类型(Generalized、Realistic、Unrealistic和Sarcastic)难以被准确识别的问题。论文的关键解决方案在于引入了PolyHope V2数据集,这是一个包含超过30,000条经过标注的英语和西班牙语推文的多语言、细粒度希望表达数据集,并通过显式标注讽刺(Sarcastic)实例来增强现有数据集。此外,论文通过微调预训练Transformer模型与基于提示的大语言模型(LLMs)如GPT-4和Llama 3进行对比实验,在零样本和少量样本设置下评估性能,发现微调后的Transformer模型在区分细微的希望类别和讽刺方面表现更优。

链接: https://arxiv.org/abs/2504.17974
作者: Sabur Butt,Fazlourrahman Balouchzahi,Ahmad Imam Amjad,Maaz Amjad,Hector G. Ceballos,Salud Maria Jimenez-Zafra
机构: tec.mx (特米拉潘理工学院); cic.ipn.mx (国家理工学院研究所); ttu.edu (德克萨斯理工大学); ujaen.es (哈恩大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hope is a complex and underexplored emotional state that plays a significant role in education, mental health, and social interaction. Unlike basic emotions, hope manifests in nuanced forms ranging from grounded optimism to exaggerated wishfulness or sarcasm, making it difficult for Natural Language Processing systems to detect accurately. This study introduces PolyHope V2, a multilingual, fine-grained hope speech dataset comprising over 30,000 annotated tweets in English and Spanish. This resource distinguishes between four hope subtypes Generalized, Realistic, Unrealistic, and Sarcastic and enhances existing datasets by explicitly labeling sarcastic instances. We benchmark multiple pretrained transformer models and compare them with large language models (LLMs) such as GPT 4 and Llama 3 under zero-shot and few-shot regimes. Our findings show that fine-tuned transformers outperform prompt-based LLMs, especially in distinguishing nuanced hope categories and sarcasm. Through qualitative analysis and confusion matrices, we highlight systematic challenges in separating closely related hope subtypes. The dataset and results provide a robust foundation for future emotion recognition tasks that demand greater semantic and contextual sensitivity across languages.
zh

[NLP-36] Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

【速读】: 该论文旨在研究大型语言模型(LLMs)如何自适应地协作以执行复杂的具身推理任务。论文提出了解决方案的关键在于提升多智能体协作中的自然语言通信效率,尤其是在具身场景下的协作能力。为此,作者开发了MINDcraft平台和MineCollab基准测试集,并通过实验发现当前最先进的智能体在协作过程中面临的主要瓶颈是高效的自然语言交流,当需要详细规划任务完成步骤时,智能体的表现会下降多达15%。论文结论指出现有的LLM智能体在多智能体协作方面优化不足,特别是具身场景中,强调需要采用超越上下文学习和模仿学习的方法。

链接: https://arxiv.org/abs/2504.17950
作者: Isadora White,Kolby Nottingham,Ayush Maniar,Max Robinson,Hansen Lillemark,Mehul Maheshwari,Lianhui Qin,Prithviraj Ammanabrolu
机构: University of California, San Diego (加州大学圣地亚哥分校); Latitude Games (纬度游戏); Emergent Garden (新兴花园)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 9 pages of main paper with 6 main figures, overall 28 pages

点击查看摘要

Abstract:Collaboration is ubiquitous and essential in day-to-day life – from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: this https URL
zh

[NLP-37] oward a Human-Centered Evaluation Framework for Trustworthy LLM -Powered GUI Agents

【速读】: 该论文旨在解决大型语言模型(LLMs)驱动的图形用户界面(GUI)代理在处理敏感数据时面临的隐私和安全风险问题。尽管GUI代理通过提升自动化能力带来了显著的技术进步,但其在有限人工监管下操作敏感信息的能力也引发了重要的隐私和安全隐忧。论文识别出GUI代理的三个关键风险,并对比了其与传统GUI自动化及通用自主代理的区别。然而,现有评估主要集中于性能指标,而对隐私和安全性的评估尚显不足。此外,论文回顾了当前GUI和通用LLM代理的评估指标,并提出了将人类评估者融入GUI代理评估的五个关键挑战。为填补这些空白,论文倡导一种以用户为中心的评估框架,该框架强调风险评估、通过上下文同意增强用户意识,并将隐私和安全考量嵌入到GUI代理的设计与评估过程中。关键在于提出一个综合考虑隐私、安全和用户体验的全新评估方法论。

链接: https://arxiv.org/abs/2504.17934
作者: Chaoran Chen,Zhiping Zhang,Ibrahim Khalilov,Bingcan Guo,Simret A Gebreegziabher,Yanfang Ye,Ziang Xiao,Yaxing Yao,Tianshi Li,Toby Jia-Jun Li
机构: University of Notre Dame(圣母大学); Northeastern University(东北大学); Virginia Tech(弗吉尼亚理工学院); University of Washington(华盛顿大学); Johns Hopkins University(约翰斯·霍普金斯大学); Toby Jia-Jun Li⋆(标注星号的作者单位与前面重复)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has revolutionized Graphical User Interface (GUI) automation through LLM-powered GUI agents, yet their ability to process sensitive data with limited human oversight raises significant privacy and security risks. This position paper identifies three key risks of GUI agents and examines how they differ from traditional GUI automation and general autonomous agents. Despite these risks, existing evaluations focus primarily on performance, leaving privacy and security assessments largely unexplored. We review current evaluation metrics for both GUI and general LLM agents and outline five key challenges in integrating human evaluators for GUI agent assessments. To address these gaps, we advocate for a human-centered evaluation framework that incorporates risk assessments, enhances user awareness through in-context consent, and embeds privacy and security considerations into GUI agent design and evaluation.
zh

[NLP-38] CAMU: Context Augmentation for Meme Understanding ACM-MM2025

【速读】: 该论文旨在解决社交平台中仇恨言论检测在 meme 图像(包含视觉与文本信息)这一复杂领域中的挑战,特别是如何有效结合视觉与语言线索以识别文化语境下的仇恨内容。论文提出了一种名为 CAMU 的新框架,其关键在于利用大型视觉-语言模型生成更详细的描述性字幕,并通过字幕评分的神经网络突出与仇恨相关的线索,同时采用参数高效的 CLIP 文本编码器的微调方法,提升多模态 meme 理解能力。实验表明,选择性地调整深层文本编码器层显著提升了所有评估指标的表现,且 CAMU 在 Hateful Memes 数据集上的准确率(0.807)和 F1 值(0.806)与当前最先进的方法相当,但效率更高;此外,CAMU 在 MultiOFF 数据集中也展示了其通用性,取得了最佳的 F1 分数(0.673)。论文进一步指出,稳健的视觉定位能力和精细的文本表征对于可靠的仇恨及冒犯性内容检测至关重要。

链接: https://arxiv.org/abs/2504.17902
作者: Girish A. Koushik,Diptesh Kanojia,Helen Treharne,Aditya Joshi
机构: NICE Research, University of Surrey (萨里大学), Guildford (吉尔福德), United Kingdom; Surrey Centre for Cyber Security (萨里网络安全中心), University of Surrey (萨里大学), Guildford (吉尔福德), United Kingdom; University of New South Wales (新南威尔士大学), Sydney (悉尼), Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Under review at ACM MM 2025

点击查看摘要

Abstract:Social media memes are a challenging domain for hate detection because they intertwine visual and textual cues into culturally nuanced messages. We introduce a novel framework, CAMU, which leverages large vision-language models to generate more descriptive captions, a caption-scoring neural network to emphasise hate-relevant content, and parameter-efficient fine-tuning of CLIP’s text encoder for an improved multimodal understanding of memes. Experiments on publicly available hateful meme datasets show that simple projection layer fine-tuning yields modest gains, whereas selectively tuning deeper text encoder layers significantly boosts performance on all evaluation metrics. Moreover, our approach attains high accuracy (0.807) and F1-score (0.806) on the Hateful Memes dataset, at par with the existing SoTA framework while being much more efficient, offering practical advantages in real-world scenarios that rely on fixed decision thresholds. CAMU also achieves the best F1-score of 0.673 on the MultiOFF dataset for offensive meme identification, demonstrating its generalisability. Additional analyses on benign confounders reveal that robust visual grounding and nuanced text representations are crucial for reliable hate and offence detection. We will publicly release CAMU along with the resultant models for further research. Disclaimer: This paper includes references to potentially disturbing, hateful, or offensive content due to the nature of the task. Comments: Under review at ACM MM 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2504.17902 [cs.CV] (or arXiv:2504.17902v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.17902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] oken Sequence Compression for Efficient Multimodal Computing

【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在跨模态推理能力提升过程中面临的高计算成本问题,特别关注视觉语言模型。论文的关键在于揭示当前视觉编码器中存在的冗余性和低效性,并提出一种针对多模态数据的自适应压缩方法。研究通过基准测试和定性分析,系统地评估了多种视觉标记选择与合并策略,发现简单的聚类级标记聚合方法在标记选择与合并任务中表现优于现有最先进的技术,包括基于视觉编码器级别的合并和注意力机制的方法。此外,论文通过跨模态注意力可视化揭示了视觉标记选择的某些令人困惑的趋势,强调了现有视觉编码器中的冗余性。这项工作首次尝试提高高维数据编码和处理的有效性,为构建更具扩展性和可持续性的多模态系统奠定了基础。

链接: https://arxiv.org/abs/2504.17892
作者: Yasmine Omri,Parth Shroff,Thierry Tambe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency in current vision encoders, and seek to construct an adaptive compression method for multimodal data. In this work, we characterize a panoply of visual token selection and merging approaches through both benchmarking and qualitative analysis. In particular, we demonstrate that simple cluster-level token aggregation outperforms prior state-of-the-art works in token selection and merging, including merging at the vision encoder level and attention-based approaches. We underline the redundancy in current vision encoders, and shed light on several puzzling trends regarding principles of visual token selection through cross-modal attention visualizations. This work is a first effort towards more effective encoding and processing of high-dimensional data, and paves the way for more scalable and sustainable multimodal systems.
zh

[NLP-40] Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Retrieval SIGIR2025

【速读】: 本文关注于密集信息检索中的语料库投毒攻击问题,其中对手通过向语料库注入少量恶意生成的文档来破坏搜索算法的排序性能。论文解决了当前文献中的两个局限性:首先,现有的对抗梯度驱动的词替换攻击在离散词汇空间中进行,而检索本身发生在连续嵌入空间中。为此,作者提出了一种直接在嵌入空间中操作的优化方法。具体而言,训练一个扰动模型,以保持原始文档嵌入与对抗文档嵌入之间的几何距离,同时最大化原始文档与对抗文档在词级别上的差异。其次,现有相关工作通常假设对手对查询有先验知识,本文聚焦于更具有挑战性的无监督场景,即对手对查询分布一无所知。核心贡献是一种快速且有效的对抗语料库攻击方法。实验结果表明,在域内和域外数据集上,无论是白盒还是黑盒设置下,该方法针对top-1攻击和语料库投毒攻击均表现出色,生成每个目标文档的对抗样本耗时不到两分钟,比文献中最快速的基于梯度的词替换方法快四倍(使用相同硬件)。此外,所提出的对抗生成方法产生的文本更符合自然文本的分布特性(低困惑度),因而更难被检测。

链接: https://arxiv.org/abs/2504.17884
作者: Yongkang Li,Panagiotis Eustratiadis,Simon Lupart,Evangelos Kanoulas
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: This paper has been accepted as a full paper at SIGIR 2025 and will be presented orally

点击查看摘要

Abstract:This paper concerns corpus poisoning attacks in dense information retrieval, where an adversary attempts to compromise the ranking performance of a search algorithm by injecting a small number of maliciously generated documents into the corpus. Our work addresses two limitations in the current literature. First, attacks that perform adversarial gradient-based word substitution search do so in the discrete lexical space, while retrieval itself happens in the continuous embedding space. We thus propose an optimization method that operates in the embedding space directly. Specifically, we train a perturbation model with the objective of maintaining the geometric distance between the original and adversarial document embeddings, while also maximizing the token-level dissimilarity between the original and adversarial documents. Second, it is common for related work to have a strong assumption that the adversary has prior knowledge about the queries. In this paper, we focus on a more challenging variant of the problem where the adversary assumes no prior knowledge about the query distribution (hence, unsupervised). Our core contribution is an adversarial corpus attack that is fast and effective. We present comprehensive experimental results on both in- and out-of-domain datasets, focusing on two related tasks: a top-1 attack and a corpus poisoning attack. We consider attacks under both a white-box and a black-box setting. Notably, our method can generate successful adversarial examples in under two minutes per target document; four times faster compared to the fastest gradient-based word substitution methods in the literature with the same hardware. Furthermore, our adversarial generation method generates text that is more likely to occur under the distribution of natural text (low perplexity), and is therefore more difficult to detect.
zh

[NLP-41] Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection

【速读】: 本文旨在解决电影评论中剧透(spoilers)检测的问题,现有方法主要基于文本分析,但忽视了电影类型(genre)和用户偏好的影响,导致效果受限。为应对这一挑战,论文提出了一种新的框架——GUSD(Genre-aware and User-specific Spoiler Detection),其关键是结合了电影类型的特定数据以及用户行为偏差的建模。具体而言,用户偏差通过动态图模型分析评论历史计算得出,R2GFormer模块融合了RetGAT处理图信息与GenreFormer进行类型特定聚合,而GMoE(Genre-Aware Mixture of Experts)模型则依据类型将评论分配给专门的专家处理。实验表明,GUSD在基准数据集上取得了最先进的性能,通过识别类型和用户特定模式显著提升了电影评论平台的用户体验。

链接: https://arxiv.org/abs/2504.17834
作者: Haokai Zhang,Shengtao Zhang,Zijian Cai,Heng Wang,Ruixuan Zhu,Zinan Zeng,Minnan Luo
机构: Xi’an Jiaotong University(Xi’an Jiaotong University); Institute of Automation, Chinese Academy of Sciences(Institute of Automation, Chinese Academy of Sciences)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 11 pages, 6 figures, under review

点击查看摘要

Abstract:Spoilers in movie reviews are important on platforms like IMDb and Rotten Tomatoes, offering benefits and drawbacks. They can guide some viewers’ choices but also affect those who prefer no plot details in advance, making effective spoiler detection essential. Existing spoiler detection methods mainly analyze review text, often overlooking the impact of movie genres and user bias, limiting their effectiveness. To address this, we analyze movie review data, finding genre-specific variations in spoiler rates and identifying that certain users are more likely to post spoilers. Based on these findings, we introduce a new spoiler detection framework called GUSD (The code is available at this https URL) (Genre-aware and User-specific Spoiler Detection), which incorporates genre-specific data and user behavior bias. User bias is calculated through dynamic graph modeling of review history. Additionally, the R2GFormer module combines RetGAT (Retentive Graph Attention Network) for graph information and GenreFormer for genre-specific aggregation. The GMoE (Genre-Aware Mixture of Experts) model further assigns reviews to specialized experts based on genre. Extensive testing on benchmark datasets shows that GUSD achieves state-of-the-art results. This approach advances spoiler detection by addressing genre and user-specific patterns, enhancing user experience on movie review platforms.
zh

[NLP-42] VideoVista-CulturalLingo: 360circ Horizons-Bridging Cultures Languages and Domains in Video Comprehension

【速读】: 该论文试图解决视频理解能力评估中文化、语言和领域多样性不足的问题。现有视频评估基准大多局限于单一语言(通常是英语)和以西方文化为核心的视频内容,缺乏跨文化交流与多语言支持。为应对这一挑战,论文提出了VideoVista-CulturalLingo,这是一种旨在弥合文化、语言和领域鸿沟的视频评估基准。其关键创新在于:1) 提供涵盖中国、北美及欧洲文化的多样化数据集;2) 使用中文和英语两种广泛使用的语言设计问题;3) 涵盖来自数百个不同人类创造领域的视频内容。通过构建包含1,389段视频和3,134组问答对的数据集,并测试24种近期开源或专有视频大模型的表现,研究发现现有模型在处理与中国历史相关等中文中心问题时表现较差,且在时间理解方面存在局限性,特别是事件定位任务中的表现仅达45.2%。同时,主流模型在科学问题上表现较强,而开源模型在数学问题上的表现较弱。因此,该论文的关键解决方案在于通过构建一个更加全面和多样化的评估基准来推动多模态AI系统在全球化场景下的视频理解能力提升。

链接: https://arxiv.org/abs/2504.17821
作者: Xinyu Chen,Yunxin Li,Haoyuan Shi,Baotian Hu,Wenhan Luo,Yaowei Wang,Min Zhang
机构: Harbin Institute of Technology (哈尔滨工业大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.
zh

[NLP-43] Kimi-Audio Technical Report

【速读】: 该论文旨在构建一个在音频理解、生成和对话方面表现卓越的开源基础模型Kimi-Audio。为实现这一目标,论文的关键解决方案在于:首先设计了一种基于大规模预训练语言模型(LLM)的新架构,采用连续特征作为输入、离散标记作为输出,并开发了一种基于流匹配的分块流式解码器;其次,构建了一个包含超过1300万小时跨多种模态(语音、声音和音乐)数据的预训练数据集,并通过精心设计的任务在音频和文本数据上进行持续预训练与微调;最后,提出了一种从预训练到多样化下游任务适配的有效方法。这些创新点共同推动了Kimi-Audio在语音识别、音频理解、问答及对话等领域的性能达到当前最优水平。

链接: https://arxiv.org/abs/2504.18425
作者: KimiTeam,Ding Ding,Zeqian Ju,Yichong Leng,Songxiang Liu,Tong Liu,Zeyu Shang,Kai Shen,Wei Song,Xu Tan,Heyi Tang,Zhengtao Wang,Chu Wei,Yifei Xin,Xinran Xu,Jianwei Yu,Yutao Zhang,Xinyu Zhou,Y. Charles,Jun Chen,Yanru Chen,Yulun Du,Weiran He,Zhenxing Hu,Guokun Lai,Qingcheng Li,Yangyang Liu,Weidong Sun,Jianzhou Wang,Yuzhi Wang,Yuefeng Wu,Yuxin Wu,Dongchao Yang,Hao Yang,Ying Yang,Zhilin Yang,Aoxiong Yin,Ruibin Yuan,Yutong Zhang,Zaida Zhou
机构: Kimi Team
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in this https URL.
zh

计算机视觉

[CV-0] Augmenting Perceptual Super-Resolution via Image Quality Predictors

【速读】:该论文旨在解决传统超分辨率(Super-resolution, SR)任务中生成结果偏向于像素级误差最小化而导致的非感知性失真问题,提出了一种更关注人类视觉感知质量的替代方案。论文的关键在于利用强大的无参考图像质量评估(Non-Reference Image Quality Assessment, NR-IQA)模型来引导超分辨率学习。具体而言,作者首先分析了NR-IQA指标在人工生成的超分辨率数据上的表现,评估其准确性与互补性;随后探索了两种应用NR-IQA模型的方法:一是通过改进现有的多Ground Truth框架调整数据采样策略;二是直接优化可微的质量评分。这种方案的核心在于实现感知保真度与人类调优的NR-IQA度量之间的平衡,从而获得更符合人类感知需求的超分辨率结果。

链接: https://arxiv.org/abs/2504.18524
作者: Fengjia Zhang,Samrudhdhi B. Rangrej,Tristan Aumentado-Armstrong,Afsaneh Fazly,Alex Levinshtein
机构: AI Center – Toronto, Samsung Electronics (三星电子 AI 中心 – 多伦多)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Super-resolution (SR), a classical inverse problem in computer vision, is inherently ill-posed, inducing a distribution of plausible solutions for every input. However, the desired result is not simply the expectation of this distribution, which is the blurry image obtained by minimizing pixelwise error, but rather the sample with the highest image quality. A variety of techniques, from perceptual metrics to adversarial losses, are employed to this end. In this work, we explore an alternative: utilizing powerful non-reference image quality assessment (NR-IQA) models in the SR context. We begin with a comprehensive analysis of NR-IQA metrics on human-derived SR data, identifying both the accuracy (human alignment) and complementarity of different metrics. Then, we explore two methods of applying NR-IQA models to SR learning: (i) altering data sampling, by building on an existing multi-ground-truth SR framework, and (ii) directly optimizing a differentiable quality score. Our results demonstrate a more human-centric perception-distortion tradeoff, focusing less on non-perceptual pixel-wise distortion, instead improving the balance between perceptual fidelity and human-tuned NR-IQA measures.
zh

[CV-1] E-VLC: A Real-World Dataset for Event-based Visible Light Communication And Localization CVPR

【速读】:该论文旨在解决基于事件相机的LED调制光通信在实际场景中解码与定位的缺乏公开数据集的问题,并提出了一种新颖的定位方法。论文的关键在于构建了一个包含事件相机、帧相机及精确同步真值姿态的首个公开数据集,该数据集涵盖了室内和室外多种光照条件下的丰富场景运动样本。此外,论文提出利用对比度最大化框架进行运动估计与补偿的新方法,以提升基于LED标记的定位性能。实验结果表明,该方法在定位精度方面优于传统基于AR标记的方法,为事件相机在移动设备上的广泛应用奠定了基础。

链接: https://arxiv.org/abs/2504.18521
作者: Shintaro Shiba,Quan Kong,Norimasa Kobori
机构: Woven by Toyota
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Signal Processing (eess.SP)
备注: 10 pages, 9 figures, 5 tables, CVPRW on EventVision 2025

点击查看摘要

Abstract:Optical communication using modulated LEDs (e.g., visible light communication) is an emerging application for event cameras, thanks to their high spatio-temporal resolutions. Event cameras can be used simply to decode the LED signals and also to localize the camera relative to the LED marker positions. However, there is no public dataset to benchmark the decoding and localization in various real-world settings. We present, to the best of our knowledge, the first public dataset that consists of an event camera, a frame camera, and ground-truth poses that are precisely synchronized with hardware triggers. It provides various camera motions with various sensitivities in different scene brightness settings, both indoor and outdoor. Furthermore, we propose a novel method of localization that leverages the Contrast Maximization framework for motion estimation and compensation. The detailed analysis and experimental results demonstrate the advantages of LED-based localization with events over the conventional AR-marker–based one with frames, as well as the efficacy of the proposed method in localization. We hope that the proposed dataset serves as a future benchmark for both motion-related classical computer vision tasks and LED marker decoding tasks simultaneously, paving the way to broadening applications of event cameras on mobile devices. this https URL
zh

[CV-2] Examining the Impact of Optical Aberrations to Image Classification and Object Detection Models

【速读】:该论文试图解决深度神经网络(DNNs)在面对真实光学模糊(optical blur)影响时的鲁棒性评估问题。现有基准测试通常以过于简化的形式近似模糊,如仅考虑离焦(defocus),而忽略了由光学系统产生的不同模糊核形状。为研究模型在现实光学模糊条件下的鲁棒性,论文的关键解决方案是提出了两个新的模糊腐败数据集——OpticsBench 和 LensCorruptions。OpticsBench 针对初级像差(如彗差、离焦和散光)进行建模,这些像差可以通过调整泽尼克多项式(Zernike polynomials)的单个参数来表示;而 LensCorruptions 则通过采样泽尼克多项式向量空间中的线性组合,模拟了100种实际镜头的效果。这种设计超越了传统基于理论但合成的初级像差设置,从而更真实地反映实际场景中的模糊情况。

链接: https://arxiv.org/abs/2504.18510
作者: Patrick Müller,Alexander Braun,Margret Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: v1.0

点击查看摘要

Abstract:Deep neural networks (DNNs) have proven to be successful in various computer vision applications such that models even infer in safety-critical situations. Therefore, vision models have to behave in a robust way to disturbances such as noise or blur. While seminal benchmarks exist to evaluate model robustness to diverse corruptions, blur is often approximated in an overly simplistic way to model defocus, while ignoring the different blur kernel shapes that result from optical systems. To study model robustness against realistic optical blur effects, this paper proposes two datasets of blur corruptions, which we denote OpticsBench and LensCorruptions. OpticsBench examines primary aberrations such as coma, defocus, and astigmatism, i.e. aberrations that can be represented by varying a single parameter of Zernike polynomials. To go beyond the principled but synthetic setting of primary aberrations, LensCorruptions samples linear combinations in the vector space spanned by Zernike polynomials, corresponding to 100 real lenses. Evaluations for image classification and object detection on ImageNet and MSCOCO show that for a variety of different pre-trained models, the performance on OpticsBench and LensCorruptions varies significantly, indicating the need to consider realistic image corruptions to evaluate a model’s robustness against blur.
zh

[CV-3] Eval3D: Interpretable and Fine-grained Evaluation for 3D Generation CVPR2025

【速读】:该论文试图解决的问题是如何有效地评估生成式 3D 数据的质量,尤其是在视觉吸引力、几何一致性和语义一致性方面的表现。当前的 3D 评估指标往往忽视了几何质量或仅依赖于黑箱多模态大语言模型进行粗略评估。论文的关键解决方案是提出 Eval3D,这是一种细粒度且可解释的评估工具,能够基于多种不同但互补的标准来忠实地评价生成的 3D 资产质量。其核心观察点在于,许多期望的 3D 生成特性(如语义和几何一致性)可以通过测量不同基础模型和工具之间的一致性来有效捕捉,因此作者利用一组多样化的模型和工具作为探针,从多个方面评估生成 3D 资产的不一致性。相比以往的工作,Eval3D 提供像素级测量,支持精确的空间反馈,并更紧密地与人类判断保持一致。

链接: https://arxiv.org/abs/2504.18509
作者: Shivam Duggal,Yushi Hu,Oscar Michel,Aniruddha Kembhavi,William T. Freeman,Noah A. Smith,Ranjay Krishna,Antonio Torralba,Ali Farhadi,Wei-Chiu Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page and codes: this https URL

点击查看摘要

Abstract:Despite the unprecedented progress in the field of 3D generation, current systems still often fail to produce high-quality 3D assets that are visually appealing and geometrically and semantically consistent across multiple viewpoints. To effectively assess the quality of the generated 3D data, there is a need for a reliable 3D evaluation tool. Unfortunately, existing 3D evaluation metrics often overlook the geometric quality of generated assets or merely rely on black-box multimodal large language models for coarse assessment. In this paper, we introduce Eval3D, a fine-grained, interpretable evaluation tool that can faithfully evaluate the quality of generated 3D assets based on various distinct yet complementary criteria. Our key observation is that many desired properties of 3D generation, such as semantic and geometric consistency, can be effectively captured by measuring the consistency among various foundation models and tools. We thus leverage a diverse set of models and tools as probes to evaluate the inconsistency of generated 3D assets across different aspects. Compared to prior work, Eval3D provides pixel-wise measurement, enables accurate 3D spatial feedback, and aligns more closely with human judgments. We comprehensively evaluate existing 3D generation models using Eval3D and highlight the limitations and challenges of current models.
zh

[CV-4] An Improved ResNet50 Model for Predicting Pavement Condition Index (PCI) Directly from Pavement Images

【速读】:该论文旨在解决从路面图像准确预测路面状况指数(Pavement Condition Index, PCI)这一基础设施维护中的关键问题。论文提出了一种改进的ResNet50架构,通过集成卷积块注意力模块(Convolutional Block Attention Module, CBAM),实现了直接从路面图像预测PCI而无需额外标注的目标。解决方案的关键在于利用CBAM自动突出图像中的重要特征,从而显著提升预测精度,相较于原始的ResNet50和DenseNet161模型,其平均绝对百分比误差(Mean Absolute Percentage Error, MAPE)分别降低了12.6%和7.32%。这一结果验证了引入注意力机制以优化特征提取的有效性,为自动化路面分析提供了更精准且高效的手段。

链接: https://arxiv.org/abs/2504.18490
作者: Andrews Danyo,Anthony Dontoh,Armstrong Aboah
机构: TriQuint Semiconductor (TriQuint Semiconductor); XLIM Laboratory, UMR 7252, University of Limoges (XLIM实验室, UMR 7252, 路易斯大学); Department of Electrical, Computer and Energy Engineering, University of Colorado, Boulder, CO, 80309-0425 USA (科罗拉多大学波德分校电气、计算机与能源工程系); Qualcomm Inc. (高通公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately predicting the Pavement Condition Index (PCI), a measure of roadway conditions, from pavement images is crucial for infrastructure maintenance. This study proposes an enhanced version of the Residual Network (ResNet50) architecture, integrated with a Convolutional Block Attention Module (CBAM), to predict PCI directly from pavement images without additional annotations. By incorporating CBAM, the model autonomously prioritizes critical features within the images, improving prediction accuracy. Compared to the original baseline ResNet50 and DenseNet161 architectures, the enhanced ResNet50-CBAM model achieved a significantly lower mean absolute percentage error (MAPE) of 58.16%, compared to the baseline models that achieved 70.76% and 65.48% respectively. These results highlight the potential of using attention mechanisms to refine feature extraction, ultimately enabling more accurate and efficient assessments of pavement conditions. This study emphasizes the importance of targeted feature refinement in advancing automated pavement analysis through attention mechanisms.
zh

[CV-5] RGS-DR: Reflective Gaussian Surfels with Deferred Rendering for Shiny Objects

【速读】:本文介绍了一种名为RGS-DR的新颖逆向渲染方法,旨在重建和渲染带有光泽和反射特性的物体,并支持灵活的重照明与场景编辑。现有方法(如NeRF和3D高斯点绘)在处理视点相关效应方面存在困难,而RGS-DR通过采用2D高斯surfel表示法,能够精确估计几何结构与表面法线,这是高质量逆向渲染的关键属性之一。该方法通过可学习的基元显式建模几何与材质特性,并将其光栅化到延迟着色管道中,有效减少了渲染伪影并保留了锐利的反射效果。利用多级立方体 mipmap,RGS-DR能够准确近似环境光照积分,从而实现高质量的重建与重照明。此外,基于球面mipmap的方向编码残差通道进一步优化了外观建模。实验表明,RGS-DR在重建和渲染闪亮物体的质量上表现出色,通常优于无法进行重照明的状态-of-the-art重建方法。论文的核心在于引入了基于2D Gaussian surfel的几何与材质建模方式以及多级立方体 mipmap 的环境光照近似技术,以解决复杂视点相关效应下的高质量逆向渲染挑战。

链接: https://arxiv.org/abs/2504.18468
作者: Georgios Kouros,Minye Wu,Tinne Tuytelaars
机构: KU Leuven (鲁汶大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce RGS-DR, a novel inverse rendering method for reconstructing and rendering glossy and reflective objects with support for flexible relighting and scene editing. Unlike existing methods (e.g., NeRF and 3D Gaussian Splatting), which struggle with view-dependent effects, RGS-DR utilizes a 2D Gaussian surfel representation to accurately estimate geometry and surface normals, an essential property for high-quality inverse rendering. Our approach explicitly models geometric and material properties through learnable primitives rasterized into a deferred shading pipeline, effectively reducing rendering artifacts and preserving sharp reflections. By employing a multi-level cube mipmap, RGS-DR accurately approximates environment lighting integrals, facilitating high-quality reconstruction and relighting. A residual pass with spherical-mipmap-based directional encoding further refines the appearance modeling. Experiments demonstrate that RGS-DR achieves high-quality reconstruction and rendering quality for shiny objects, often outperforming reconstruction-exclusive state-of-the-art methods incapable of relighting.
zh

[CV-6] NoiseController: Towards Consistent Multi-view Video Generation via Noise Decomposition and Collaboration

【速读】:该论文旨在解决视频生成中时空一致性难以保障的问题,当前方法通常依赖注意力机制或噪声修改来实现一致性的视频生成,但忽略了全局时空信息的作用。论文的关键解决方案是提出NoiseController,包含多级噪声分解(Multi-Level Noise Decomposition)、多帧噪声协作(Multi-Frame Noise Collaboration)以及联合去噪(Joint Denoising)。其中,多级噪声分解通过分解初始噪声为场景级别的前景/背景噪声,并进一步细分为共享与残差成分以保留一致性同时保持多样性;多帧噪声协作引入跨视图时空协作矩阵和视图内影响协作矩阵以捕捉跨视图交互与历史跨帧影响;联合去噪则利用两个并行U-Net模型去除场景级别噪声,相互增强视频生成效果。实验表明,NoiseController在公共数据集上的视频生成及下游任务中表现出最先进的性能。

链接: https://arxiv.org/abs/2504.18448
作者: Haotian Dong,Xin Wang,Di Lin,Yipeng Wu,Qin Chen,Ruonan Liu,Kairui Yang,Ping Li,Qing Guo
机构: Tianjin University (天津大学); The Hong Kong Polytechnic University (香港理工大学); Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality video generation is crucial for many fields, including the film industry and autonomous driving. However, generating videos with spatiotemporal consistencies remains challenging. Current methods typically utilize attention mechanisms or modify noise to achieve consistent videos, neglecting global spatiotemporal information that could help ensure spatial and temporal consistency during video generation. In this paper, we propose the NoiseController, consisting of Multi-Level Noise Decomposition, Multi-Frame Noise Collaboration, and Joint Denoising, to enhance spatiotemporal consistencies in video generation. In multi-level noise decomposition, we first decompose initial noises into scene-level foreground/background noises, capturing distinct motion properties to model multi-view foreground/background variations. Furthermore, each scene-level noise is further decomposed into individual-level shared and residual components. The shared noise preserves consistency, while the residual component maintains diversity. In multi-frame noise collaboration, we introduce an inter-view spatiotemporal collaboration matrix and an intra-view impact collaboration matrix , which captures mutual cross-view effects and historical cross-frame impacts to enhance video quality. The joint denoising contains two parallel denoising U-Nets to remove each scene-level noise, mutually enhancing video generation. We evaluate our NoiseController on public datasets focusing on video generation and downstream tasks, demonstrating its state-of-the-art performance.
zh

[CV-7] Iterative Event-based Motion Segmentation by Variational Contrast Maximization CVPR

【速读】:该论文旨在解决事件相机中事件数据的运动分割问题,即如何将场景中的视觉变化产生的事件数据分类为不同的运动类型(如背景主导运动假设和独立运动残差),以支持诸如物体检测和视觉伺服等任务。论文的关键在于提出了一种迭代式的运动分割方法,通过扩展对比最大化(Contrast Maximization)框架实现背景与前景事件的有效分离。该方法的核心创新在于其迭代分类策略,能够精准地区分公共数据集和自记录数据集中的事件簇,并生成清晰的、运动补偿后的边缘图像。实验结果表明,该方法在移动物体检测基准测试中达到了最先进的准确率,较现有方法提升了超过30%,同时展示了其在复杂真实场景中的应用潜力。

链接: https://arxiv.org/abs/2504.18447
作者: Ryo Yamaki,Shintaro Shiba,Guillermo Gallego,Yoshimitsu Aoki
机构: Keio University (庆应义塾大学), Japan; Technische Universität Berlin (柏林工业大学); Einstein Center Digital Future, Robotics Institute Germany, and Science of Intelligence Excellence Cluster, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 11 pages, 9 figures, 3 tables, CVPR Workshop 2025

点击查看摘要

Abstract:Event cameras provide rich signals that are suitable for motion estimation since they respond to changes in the scene. As any visual changes in the scene produce event data, it is paramount to classify the data into different motions (i.e., motion segmentation), which is useful for various tasks such as object detection and visual servoing. We propose an iterative motion segmentation method, by classifying events into background (e.g., dominant motion hypothesis) and foreground (independent motion residuals), thus extending the Contrast Maximization framework. Experimental results demonstrate that the proposed method successfully classifies event clusters both for public and self-recorded datasets, producing sharp, motion-compensated edge-like images. The proposed method achieves state-of-the-art accuracy on moving object detection benchmarks with an improvement of over 30%, and demonstrates its possibility of applying to more complex and noisy real-world scenes. We hope this work broadens the sensitivity of Contrast Maximization with respect to both motion parameters and input events, thus contributing to theoretical advancements in event-based motion segmentation estimation. this https URL
zh

[CV-8] LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

【速读】:该论文试图解决从单幅图像进行未知几何推理的问题。传统深度估计方法局限于可见表面,而本文提出的分层射线交点(Layered Ray Intersections, LaRI)方法通过分层点图建模相机射线相交的多个表面,实现了完整、高效且视角一致的几何推理,从而统一物体级和场景级任务。解决方案的关键在于LaRI的分层表示方式以及预测射线停止索引(ray stopping index),后者用于从LaRI输出中识别有效相交的像素和层次,并构建了一个包含合成与真实数据的完整训练数据生成管道,确保了方法的通用性和有效性。

链接: https://arxiv.org/abs/2504.18424
作者: Rui Li,Biao Zhang,Zhenyu Li,Federico Tombari,Peter Wonka
机构: KAUST (沙特国王科技大学); Google (谷歌); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present layered ray intersections (LaRI), a new method for unseen geometry reasoning from a single image. Unlike conventional depth estimation that is limited to the visible surface, LaRI models multiple surfaces intersected by the camera rays using layered point maps. Benefiting from the compact and layered representation, LaRI enables complete, efficient, and view-aligned geometric reasoning to unify object- and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI’s output. We build a complete training data generation pipeline for synthetic and real-world data, including 3D objects and scenes, with necessary data cleaning steps and coordination between rendering engines. As a generic method, LaRI’s performance is validated in two scenarios: It yields comparable object-level results to the recent large generative model using 4% of its training data and 17% of its parameters. Meanwhile, it achieves scene-level occluded geometry reasoning in only one feed-forward.
zh

[CV-9] A Multimodal Hybrid Late-Cascade Fusion Network for Enhanced 3D Object Detection

【速读】:本文提出了一种新的方法,用于从多模态输入中检测三维物体,通过结合激光雷达(LiDAR)和RGB摄像头,在混合式后级级联方案中利用RGB检测网络与三维激光雷达检测器。论文的关键在于采用晚期融合原则减少激光雷达误报(False Positives),通过将激光雷达边界框投影到图像上来匹配激光雷达检测结果与RGB检测结果;同时借助级联融合原则,利用视点分离的RGB检测生成的极线约束(epipolar constraints)和感兴趣区域(frustums)来恢复激光雷达漏检(False Negatives)。此解决方案可适配于任何底层单模态检测器之上,支持灵活的训练过程,可以利用预训练的激光雷达和RGB检测器,或单独训练两个分支。实验结果显示,在KITTI目标检测基准上的性能显著提升,特别是在行人和骑车者(Pedestrians 和 Cyclists)的检测方面。

链接: https://arxiv.org/abs/2504.18419
作者: Carlo Sgaravatti,Roberto Basla,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present a new way to detect 3D objects from multimodal inputs, leveraging both LiDAR and RGB cameras in a hybrid late-cascade scheme, that combines an RGB detection network and a 3D LiDAR detector. We exploit late fusion principles to reduce LiDAR False Positives, matching LiDAR detections with RGB ones by projecting the LiDAR bounding boxes on the image. We rely on cascade fusion principles to recover LiDAR False Negatives leveraging epipolar constraints and frustums generated by RGB detections of separate views. Our solution can be plugged on top of any underlying single-modal detectors, enabling a flexible training process that can take advantage of pre-trained LiDAR and RGB detectors, or train the two branches separately. We evaluate our results on the KITTI object detection benchmark, showing significant performance improvements, especially for the detection of Pedestrians and Cyclists.
zh

[CV-10] Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉线索利用上的局限性问题,特别是视觉链式思维(Visual Chain-of-Thought, Visual CoT)研究的不足。目前的视觉CoT方法主要依赖于有监督的微调(Supervised Fine-Tuning, SFT),需要大量的标注边界框数据且难以泛化到未见过的场景。为此,论文提出了一种名为无监督视觉链式思维(Unsupervised Visual CoT, UV-CoT)的新框架,通过偏好优化实现图像级的CoT推理。UV-CoT的关键在于无需边界框标注,而是通过自动数据生成管道获取偏好数据:首先由目标MLLM(如LLaVA-1.5-7B)基于模板提示生成候选边界框并对每个区域作答,再由评估MLLM(如OmniLLM-1.2B)对答案进行排名,这些排名作为监督信号,通过最小化负对数似然损失来训练目标MLLM。这种方法模拟人类感知过程,识别关键区域并基于此进行推理,从而提升视觉理解能力,尤其是在仅靠文本描述不足以完成的空间推理任务中表现出色。实验结果表明,UV-CoT在六个数据集上优于现有最先进的文本和视觉CoT方法,并在四个未见数据集的零样本测试中展现了强大的泛化能力。

链接: https://arxiv.org/abs/2504.18397
作者: Kesen Zhao,Beier Zhu,Qianru Sun,Hanwang Zhang
机构: Nanyang Technological University (南洋理工大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception–identifying key regions and reasoning based on them–UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in this https URL.
zh

[CV-11] Fast Autoregressive Models for Continuous Latent Generation

【速读】:该论文旨在解决连续域图像生成中自回归模型面临的挑战,特别是现有方法如掩码自回归模型(MAR)因迭代去噪过程计算成本高昂而导致推理速度慢的问题。论文的关键创新在于提出Fast AutoRegressive模型(FAR),通过用轻量级捷径头替代MAR的扩散头,实现了高效的小步采样,同时保持了自回归原理的一致性。此外,FAR能够无缝集成到因果Transformer中,将离散令牌生成扩展到连续令牌生成,而无需修改原有架构。实验结果表明,FAR的推理速度比MAR快2.3倍,同时保持了竞争性的FID和IS分数。这一工作首次建立了高效高保真连续空间图像生成的自回归范式,弥合了视觉自回归建模中质量与可扩展性之间的关键差距。

链接: https://arxiv.org/abs/2504.18391
作者: Tiankai Hang,Jianmin Bao,Fangyun Wei,Dong Chen
机构: Microsoft Research Asia (微软研究院亚洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR’s diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves 2.3\times faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.
zh

[CV-12] COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

【速读】:该论文旨在解决现有图像篡改检测与定位(Image Manipulation Detection and Localization, IMDL)方法主要聚焦于拼接或复制-移动伪造,而缺乏针对基于修补(inpainting)操作的专用基准数据集的问题。论文的关键解决方案是提出了COCOInpaint基准,它专门设计用于修补检测,并包含三个核心贡献:1)由六种最先进的修补模型生成的高质量修补样本;2)通过四种掩码生成策略实现的多样化生成场景,且支持可选文本引导;3)覆盖258,266张具有丰富语义多样性的修补图像。该基准强调修补区域与真实区域之间的内在不一致性,而非表面的语义伪影。论文还建立了严格的评估协议,使用三种标准指标来评估现有的IMDL方法。

链接: https://arxiv.org/abs/2504.18361
作者: Haozhen Yan,Yan Hong,Jiahui Zhan,Yikun Ji,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang
机构: Shanghai Jiao Tong University(上海交通大学); Ant Group(蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.
zh

[CV-13] Interpretable Affordance Detection on 3D Point Clouds with Probabilistic Prototypes

【速读】:该论文旨在解决传统基于深度学习的 affordance 检测模型(如 PointNet++、DGCNN 等)作为黑箱模型缺乏可解释性的问题,同时探索如何将具备可解释性的原型学习方法(Prototypical Learning)应用于三维点云数据的 affordance 检测任务。论文的关键在于提出利用原型网络(ProtoPNet)等原型学习方法,通过“this looks like that”案例推理的方式,在保持高性能的同时赋予模型内在的可解释性,从而为需要增强信任和安全的人机交互场景提供更优解决方案。

链接: https://arxiv.org/abs/2504.18355
作者: Maximilian Xiling Li,Korbinian Rudolf,Nils Blank,Rudolf Lioutikov
机构: Intuitive Robots Lab (直观机器人实验室); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)(德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Robotic agents need to understand how to interact with objects in their environment, both autonomously and during human-robot interactions. Affordance detection on 3D point clouds, which identifies object regions that allow specific interactions, has traditionally relied on deep learning models like PointNet++, DGCNN, or PointTransformerV3. However, these models operate as black boxes, offering no insight into their decision-making processes. Prototypical Learning methods, such as ProtoPNet, provide an interpretable alternative to black-box models by employing a “this looks like that” case-based reasoning approach. However, they have been primarily applied to image-based tasks. In this work, we apply prototypical learning to models for affordance detection on 3D point clouds. Experiments on the 3D-AffordanceNet benchmark dataset show that prototypical models achieve competitive performance with state-of-the-art black-box models and offer inherent interpretability. This makes prototypical models a promising candidate for human-robot interaction scenarios that require increased trust and safety.
zh

[CV-14] Revisiting Data Auditing in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(VLMs)在成员推理(Membership Inference, MI)审计中的分布偏移问题及其性能高估现象。论文的关键在于揭示当前MI基准测试中存在的成员样本与非成员样本之间的分布差异(distribution shift),这些差异引入了捷径线索(shortcut cues),导致MI性能被高估。为解决此问题,作者提出基于最优传输(optimal transport)的度量方法来量化这种分布差异,并构建了新的MI基准数据集,确保成员和非成员图像的独立同分布(i.i.d.)。此外,通过理论分析VLM嵌入空间中的贝叶斯最优性,进一步验证了MI任务的不可约误差率较高。尽管如此,论文还探讨了在微调(fine-tuning)、访问真实文本标签以及基于集合的推理等实际场景下,实现可信数据审计的可能性。这一研究为未来VLMs的可信审计提供了系统性的见解和指导。

链接: https://arxiv.org/abs/2504.18349
作者: Hongyu Zhu,Sichu Liang,Wenwen Wang,Boheng Li,Tongxin Yuan,Fangqi Li,ShiLin Wang,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Southeast University (东南大学); Carnegie Mellon University (卡内基梅隆大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the surge of large language models (LLMs), Large Vision-Language Models (VLMs)–which integrate vision encoders with LLMs for accurate visual grounding–have shown great potential in tasks like generalist agents and robotic control. However, VLMs are typically trained on massive web-scraped images, raising concerns over copyright infringement and privacy violations, and making data auditing increasingly urgent. Membership inference (MI), which determines whether a sample was used in training, has emerged as a key auditing technique, with promising results on open-source VLMs like LLaVA (AUC 80%). In this work, we revisit these advances and uncover a critical issue: current MI benchmarks suffer from distribution shifts between member and non-member images, introducing shortcut cues that inflate MI performance. We further analyze the nature of these shifts and propose a principled metric based on optimal transport to quantify the distribution discrepancy. To evaluate MI in realistic settings, we construct new benchmarks with i.i.d. member and non-member images. Existing MI methods fail under these unbiased conditions, performing only marginally better than chance. Further, we explore the theoretical upper bound of MI by probing the Bayes Optimality within the VLM’s embedding space and find the irreducible error rate remains high. Despite this pessimistic outlook, we analyze why MI for VLMs is particularly challenging and identify three practical scenarios–fine-tuning, access to ground-truth texts, and set-based inference–where auditing becomes feasible. Our study presents a systematic view of the limits and opportunities of MI for VLMs, providing guidance for future efforts in trustworthy data auditing.
zh

[CV-15] SCL:Multi-party loss Balancing scheme for deep learning Image steganography based on Curriculum learning

【速读】:该论文试图解决在基于深度学习的图像隐写框架中,如何通过自适应调整损失权重来平衡多种损失函数的问题。传统方法通常采用固定权重的损失函数进行训练优化,但这种设置与隐写任务本身的重要性及训练过程缺乏关联性。为了解决这一问题,论文提出了一种两阶段课程学习损失调度器(TSCL),其关键在于通过两个阶段实现多任务学习的动态平衡:第一阶段通过控制多边对抗训练中的损失权重,引导模型依次专注于信息嵌入、解码精度提升以及生成抗分析的隐写图像;第二阶段则通过计算迭代前后损失的下降幅度评估各任务的学习速度,进一步平衡每个任务的学习过程。实验结果表明,TSCL策略显著提升了隐写质量、解码准确率和安全性。

链接: https://arxiv.org/abs/2504.18348
作者: Fengchun Liu. Tong Zhang,Chunying Zhang
机构: North China University of Science and Technology (华北理工大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:For deep learning-based image steganography frameworks, in order to ensure the invisibility and recoverability of the information embedding, the loss function usually contains several losses such as embedding loss, recovery loss and steganalysis loss. In previous research works, fixed loss weights are usually chosen for training optimization, and this setting is not linked to the importance of the steganography task itself and the training process. In this paper, we propose a Two-stage Curriculum Learning loss scheduler (TSCL) for balancing multinomial losses in deep learning image steganography algorithms. TSCL consists of two phases: a priori curriculum control and loss dynamics control. The first phase firstly focuses the model on learning the information embedding of the original image by controlling the loss weights in the multi-party adversarial training; secondly, it makes the model shift its learning focus to improving the decoding accuracy; and finally, it makes the model learn to generate a steganographic image that is resistant to steganalysis. In the second stage, the learning speed of each training task is evaluated by calculating the loss drop of the before and after iteration rounds to balance the learning of each task. Experimental results on three large public datasets, ALASKA2, VOC2012 and ImageNet, show that the proposed TSCL strategy improves the quality of steganography, decoding accuracy and security.
zh

[CV-16] SSD-Poser: Avatar Pose Estimation with State Space Duality from Sparse Observations ICMR2025

【速读】:该论文旨在解决在Head-Mounted Displays (HMDs) 提供头部和手部关节信号的情况下,如何实现精确且实时的全身动作姿态估计,特别是面对不受约束的下肢运动时的挑战。现有的方法通常依赖于传统的神经网络(如Transformers)和生成模型(如扩散模型),但在保持高精度姿态重建的同时实现快速推理速度方面存在困难。论文的关键解决方案是提出了一种轻量级且高效的模型SSD-Poser,它通过设计一个名为State Space Attention Encoders的混合编码器来利用状态空间对偶性适应复杂的运动姿态,并实现实时真实的姿态重建。此外,引入的Frequency-Aware Decoder能够缓解由变频运动信号引起的抖动问题,显著提升动作的平滑度。实验结果表明,SSD-Poser在AMASS数据集上实现了卓越的精度和计算效率,其推理效率优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.18332
作者: Shuting Zhao,Linxin Bai,Liangjing Shao,Ye Zhang,Xinrong Chen
机构: Academy for Engineering & Technology, Fudan University(工程与技术学院, 复旦大学); College of Vocational and Technical Teacher Education, Shanghai Polytechnic University(职业技术师范学院, 上海第二工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 9 pages, 6 figures, conference ICMR 2025

点击查看摘要

Abstract:The growing applications of AR/VR increase the demand for real-time full-body pose estimation from Head-Mounted Displays (HMDs). Although HMDs provide joint signals from the head and hands, reconstructing a full-body pose remains challenging due to the unconstrained lower body. Recent advancements often rely on conventional neural networks and generative models to improve performance in this task, such as Transformers and diffusion models. However, these approaches struggle to strike a balance between achieving precise pose reconstruction and maintaining fast inference speed. To overcome these challenges, a lightweight and efficient model, SSD-Poser, is designed for robust full-body motion estimation from sparse observations. SSD-Poser incorporates a well-designed hybrid encoder, State Space Attention Encoders, to adapt the state space duality to complex motion poses and enable real-time realistic pose reconstruction. Moreover, a Frequency-Aware Decoder is introduced to mitigate jitter caused by variable-frequency motion signals, remarkably enhancing the motion smoothness. Comprehensive experiments on the AMASS dataset demonstrate that SSD-Poser achieves exceptional accuracy and computational efficiency, showing outstanding inference efficiency compared to state-of-the-art methods.
zh

[CV-17] Depth3DLane: Monocular 3D Lane Detection via Depth Prior Distillation ICCV2025

【速读】:该论文致力于解决单目3D车道检测中由于单目图像难以捕捉深度信息所带来的挑战。传统方法通常通过逆透视映射(Inverse Perspective Mapping, IPM)将前视图(Front-View, FV)图像转换为鸟瞰图(Bird’s-Eye View, BEV)空间以进行车道检测,但IPM的平面地面假设以及上下文信息的丢失会导致3D信息重建(尤其是高度方向)的不准确性。为解决这些问题,论文提出了一种基于鸟瞰图的框架,其关键在于引入了一个分层深度感知头(Hierarchical Depth-Aware Head),通过提供多尺度深度特征增强不同深度下的空间感知能力,从而缓解平面地面假设带来的限制;同时利用深度先验蒸馏(Depth Prior Distillation)从教师模型中迁移语义深度知识,以捕获更丰富的结构化和上下文信息。此外,为了进一步优化车道连续性并确保平滑的车道重建,论文还设计了一个条件随机场模块(Conditional Random Field Module)来强制车道预测的空间一致性。大量实验验证了该方法在z轴误差上的表现达到最先进水平,并在整体性能上优于其他现有方法。

链接: https://arxiv.org/abs/2504.18325
作者: Dongxin Lyu,Han Huang,Cheng Tan,Zimu Li
机构: Jilin University; Fuzhou University, China; Zhejiang University & Westlake University; Shanghai Jiao Tong University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitting to ICCV2025

点击查看摘要

Abstract:Monocular 3D lane detection is challenging due to the difficulty in capturing depth information from single-camera images. A common strategy involves transforming front-view (FV) images into bird’s-eye-view (BEV) space through inverse perspective mapping (IPM), facilitating lane detection using BEV features. However, IPM’s flat-ground assumption and loss of contextual information lead to inaccuracies in reconstructing 3D information, especially height. In this paper, we introduce a BEV-based framework to address these limitations and improve 3D lane detection accuracy. Our approach incorporates a Hierarchical Depth-Aware Head that provides multi-scale depth features, mitigating the flat-ground assumption by enhancing spatial awareness across varying depths. Additionally, we leverage Depth Prior Distillation to transfer semantic depth knowledge from a teacher model, capturing richer structural and contextual information for complex lane structures. To further refine lane continuity and ensure smooth lane reconstruction, we introduce a Conditional Random Field module that enforces spatial coherence in lane predictions. Extensive experiments validate that our method achieves state-of-the-art performance in terms of z-axis error and outperforms other methods in the field in overall performance. The code is released at: this https URL.
zh

[CV-18] Outlier-aware Tensor Robust Principal Component Analysis with Self-guided Data Augmentation

【速读】:该论文致力于解决传统 Tensor Robust Principal Component Analysis (TRPCA) 方法在处理具有结构化干扰(structured corruptions)的数据分解任务时效果不佳的问题。现有方法通常基于稀疏异常值假设,但在面对复杂或系统性干扰时表现欠佳。为了解决这一挑战,论文提出了一种自引导数据增强方法,其关键在于引入一种自适应加权机制,通过动态识别并抑制异常值的影响,将原始 TRPCA 问题重新表述为标准的 Tensor Principal Component Analysis (TPCA) 问题。这种方法的核心创新点在于设计了一个优化驱动的加权方案,在张量增强过程中实时调整异常值的贡献权重。此外,论文开发了一种高效的近端块坐标下降算法,结合封闭形式的更新规则,确保计算效率,并通过结合块坐标下降与重大化最小化原则的理论框架证明了算法的收敛性。数值实验表明,该方法在人脸恢复、背景减除和高光谱去噪等任务中能够有效应对多种污染模式,同时在精度和计算效率方面均优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.18323
作者: Yangyang Xu,Kexin Li,Li Yang,You-Wei Wen
机构: School of Mathematics and Statistics, Hunan Normal University, Changsha 410081, Hunan, China (湖南师范大学数学与统计学院,长沙 410081,中国); School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming 650221, Yunnan, China (云南财经大学统计与数学学院,昆明 650221,中国)
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Tensor Robust Principal Component Analysis (TRPCA) is a fundamental technique for decomposing multi-dimensional data into a low-rank tensor and an outlier tensor, yet existing methods relying on sparse outlier assumptions often fail under structured corruptions. In this paper, we propose a self-guided data augmentation approach that employs adaptive weighting to suppress outlier influence, reformulating the original TRPCA problem into a standard Tensor Principal Component Analysis (TPCA) problem. The proposed model involves an optimization-driven weighting scheme that dynamically identifies and downweights outlier contributions during tensor augmentation. We develop an efficient proximal block coordinate descent algorithm with closed-form updates to solve the resulting optimization problem, ensuring computational efficiency. Theoretical convergence is guaranteed through a framework combining block coordinate descent with majorization-minimization principles. Numerical experiments on synthetic and real-world datasets, including face recovery, background subtraction, and hyperspectral denoising, demonstrate that our method effectively handles various corruption patterns. The results show the improvements in both accuracy and computational efficiency compared to state-of-the-art methods.
zh

[CV-19] STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting

【速读】:该论文旨在解决现有文本到4D生成方法中存在的时空建模不足和提示对齐问题,这些问题导致生成内容存在时间不一致性、几何失真或质量低下等缺陷。为了解决这些问题,论文提出了一种名为STP4D的新方法,其关键是通过三个精心设计的模块——时变提示嵌入(Time-varying Prompt Embedding)、几何信息增强(Geometric Information Enhancement)以及时间扩展变形(Temporal Extension Deformation),实现全面的时空提示一致性建模,从而生成高质量的4D内容。此外,STP4D利用扩散模型生成4D高斯体素(4D Gaussians),结合了4D高斯体素的细粒度建模能力和实时渲染过程与扩散模型的快速推理速度,进一步提升了生成效率和质量。

链接: https://arxiv.org/abs/2504.18318
作者: Yunze Deng,Haijun Xiong,Bin Feng,Xinggang Wang,Wenyu Liu
机构: School of Electronic Information and Communications, Huazhong University of Science and Technology (华中科技大学电子与信息通信学院), Wuhan, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-4D generation is rapidly developing and widely applied in various scenarios. However, existing methods often fail to incorporate adequate spatio-temporal modeling and prompt alignment within a unified framework, resulting in temporal inconsistencies, geometric distortions, or low-quality 4D content that deviates from the provided texts. Therefore, we propose STP4D, a novel approach that aims to integrate comprehensive spatio-temporal-prompt consistency modeling for high-quality text-to-4D generation. Specifically, STP4D employs three carefully designed modules: Time-varying Prompt Embedding, Geometric Information Enhancement, and Temporal Extension Deformation, which collaborate to accomplish this goal. Furthermore, STP4D is among the first methods to exploit the Diffusion model to generate 4D Gaussians, combining the fine-grained modeling capabilities and the real-time rendering process of 4DGS with the rapid inference speed of the Diffusion model. Extensive experiments demonstrate that STP4D excels in generating high-fidelity 4D content with exceptional efficiency (approximately 4.6s per asset), surpassing existing methods in both quality and speed.
zh

[CV-20] ask-Oriented Communications for Visual Navigation with Edge-Aerial Collaboration in Low Altitude Economy

【速读】:该论文旨在解决低空经济(LAE)领域中无人机(UAV)在城市环境中无全球定位系统(GPS)信号情况下的精确定位问题。传统基于视觉的方法受限于轻量级无人机的带宽、内存和处理能力,而论文提出了一种面向任务的通信框架,通过多摄像头系统提取紧凑的多视角特征,并将定位任务卸载到边缘服务器以缓解资源限制。方案的关键在于引入了正交约束变分信息瓶颈编码器(O-VIB),其结合自动相关性确定(ARD)来剪枝非信息性特征,同时施加正交约束以最小化冗余,从而以最低的传输成本实现高效且精确的定位。

链接: https://arxiv.org/abs/2504.18317
作者: Zhengru Fang,Zhenghao Liu,Jingjing Wang,Senkang Hu,Yu Guo,Yiqin Deng,Yuguang Fang
机构: Hong Kong JC STEM Lab of Smart City and Department of Computer Science, City University of Hong Kong (香港城市大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: Code and dataset will be made publicly available: this https URL

点击查看摘要

Abstract:To support the Low Altitude Economy (LAE), precise unmanned aerial vehicles (UAVs) localization in urban areas where global positioning system (GPS) signals are unavailable. Vision-based methods offer a viable alternative but face severe bandwidth, memory and processing constraints on lightweight UAVs. Inspired by mammalian spatial cognition, we propose a task-oriented communication framework, where UAVs equipped with multi-camera systems extract compact multi-view features and offload localization tasks to edge servers. We introduce the Orthogonally-constrained Variational Information Bottleneck encoder (O-VIB), which incorporates automatic relevance determination (ARD) to prune non-informative features while enforcing orthogonality to minimize redundancy. This enables efficient and accurate localization with minimal transmission cost. Extensive evaluation on a dedicated LAE UAV dataset shows that O-VIB achieves high-precision localization under stringent bandwidth budgets. Code and dataset will be made publicly available: this http URL.
zh

[CV-21] Enhancing Long-Term Re-Identification Robustness Using Synthetic Data: A Comparative Analysis ICML

【速读】:该论文试图解决材料再识别中因材料磨损和老化导致的性能下降问题。论文的关键解决方案包括两方面:一是通过使用合成训练数据(synthetic training data),在模型训练中引入人工生成的数据,利用10%的人工数据显著提升了Rank-1准确性达13%,同时增强了模型在未见过数据上的泛化能力;二是采用动态更新的再识别画廊(continuously updating gallery),逐步考虑材料老化的影响,使平均Rank-1准确性提高了24%。此外,论文还开源了一个包含2,696张Euro托盘图像的新数据集(pallet-block-2696),用于模拟自然老化和人为损坏过程,进一步支持研究。

链接: https://arxiv.org/abs/2504.18286
作者: Christian Pionzewski,Rebecca Rademacher,Jérôme Rutinowski,Antonia Ponikarov,Stephan Matzke,Tim Chilla,Pia Schreynemackers,Alice Kirchheim
机构: Fraunhofer Institute for Material Flow and Logistics (弗劳恩霍夫物流与设施管理研究所); TU Dortmund University (杜伊斯堡-埃森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in: 2024 International Conference on Machine Learning and Applications (ICMLA), IEEE. 6 pages, 3 figures

点击查看摘要

Abstract:This contribution explores the impact of synthetic training data usage and the prediction of material wear and aging in the context of re-identification. Different experimental setups and gallery set expanding strategies are tested, analyzing their impact on performance over time for aging re-identification subjects. Using a continuously updating gallery, we were able to increase our mean Rank-1 accuracy by 24%, as material aging was taken into account step by step. In addition, using models trained with 10% artificial training data, Rank-1 accuracy could be increased by up to 13%, in comparison to a model trained on only real-world data, significantly boosting generalized performance on hold-out data. Finally, this work introduces a novel, open-source re-identification dataset, pallet-block-2696. This dataset contains 2,696 images of Euro pallets, taken over a period of 4 months. During this time, natural aging processes occurred and some of the pallets were damaged during their usage. These wear and tear processes significantly changed the appearance of the pallets, providing a dataset that can be used to generate synthetically aged pallets or other wooden materials.
zh

[CV-22] Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator CVPR2025

【速读】:该论文旨在解决现有音频-视觉生成模型无法从混合音频(multi-class audio)生成图像的问题,而现有方法仅限于单类别音频输入。论文的关键贡献在于提出了一种名为Audio-Visual Generation and Separation (AV-GAS) 的新模型,通过引入音频-视觉分离模块来实现从混合音频生成多类别的图像。其解决方案的核心是利用音频-视觉分离技术,将混合音频分解为多个单独的音频类别,并为每个类别生成对应的图像,同时提出新的评估指标Class Representation Score (CRS) 和改进的R@K以衡量生成结果的质量。论文在VGGSound数据集上的实验表明,该方法在生成真实感图像方面优于当前最先进的方法,提升了7%的CRS和4%的R@2*。

链接: https://arxiv.org/abs/2504.18283
作者: Minjae Kang,Martim Brandão
机构: King’s College London (伦敦国王学院), United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Originally submitted to CVPR 2025 on 2024-11-15 with paper ID 15808

点击查看摘要

Abstract:Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address this, we propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes (mixed audio containing multiple classes). Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input, and we propose a method that solves this task using an audio-visual separator. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input. Lastly, we propose new evaluation metrics for the audio-visual generation task: Class Representation Score (CRS) and a modified R@K. Our model is trained and evaluated on the VGGSound dataset. We show that our method outperforms the state-of-the-art, achieving 7% higher CRS and 4% higher R@2* in generating plausible images with mixed audio.
zh

[CV-23] SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology CVPR2025

【速读】:该论文试图解决全球生物多样性地图绘制等宏观生态研究面临的挑战,特别是由于数据标注稀缺和现有遥感数据集偏差(倾向于人类活动频繁区域)导致的问题。此外,现有的多时相影像数据集通常基于日历季节而非局部物候周期,无法充分捕捉植被的季节性变化。论文的关键解决方案是提出了一种基于物候信息的采样策略,并构建了一个名为SSL4Eco的多时相Sentinel-2数据集,通过引入季节对比目标训练模型,从而改善表征学习的质量并在多种生态下游任务中达到最先进的性能。

链接: https://arxiv.org/abs/2504.18256
作者: Elena Plekhanova,Damien Robert,Johannes Dollinger,Emilia Arens,Philipp Brun,Jan Dirk Wegner,Niklaus Zimmermann
机构: Swiss Federal Research Institute WSL (瑞士联邦森林、雪与景观研究所); University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025, EarthVision workshop

点击查看摘要

Abstract:With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at this https URL.
zh

[CV-24] Event-Based Eye Tracking. 2025 Event-based Vision Workshop

【速读】:该论文旨在解决基于事件相机的眼动追踪任务,具体目标是通过处理事件相机记录的眼动数据来预测瞳孔中心位置。论文的关键在于总结和分析在2025 CVPR事件驱动视觉工作坊组织的Event-Based Eye Tracking挑战赛中表现优异团队所提出的创新方法,这些方法在精度、模型大小以及操作次数等方面进行了详细报告。此外,论文还从硬件设计的角度探讨了事件驱动眼动追踪技术的发展方向。

链接: https://arxiv.org/abs/2504.18249
作者: Qinyu Chen,Chang Gao,Min Liu,Daniele Perrone,Yan Ru Pei,Zuowen Wang,Zhuo Zou,Shihang Tan,Tao Han,Guorui Lu,Zhen Xu,Junyuan Ding,Ziteng Wang,Zongwei Wu,Han Han,Yuliang Wu,Jinze Chen,Wei Zhai,Yang Cao,Zheng-jun Zha,Nuwan Bandara,Thivya Kandappu,Archan Misra,Xiaopeng Lin,Hongxiang Huang,Hongwei Ren,Bojun Cheng,Hoang M. Truong,Vinh-Thuan Ly,Huy G. Tran,Thuan-Phat Nguyen,Tram T. Doan
机构: Leiden University (莱顿大学); Delft University of Technology (代尔夫特理工大学); DVSense; Prophesee; NVIDIA; Institute of Neuroinformatics, UZH/ETH Zurich (苏黎世大学和瑞士联邦理工学院神经信息学研究所); Fudan University (复旦大学); University of Würzburg (维尔茨堡大学); University of Science and Technology of China (中国科学技术大学); Singapore Management University (新加坡管理大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); University of Science, VNU-HCM, Ho Chi Minh City, Vietnam (越南国家大学胡志明市科学技术大学); Vietnam National University, Ho Chi Minh City, Vietnam (越南国立大学胡志明市校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.
zh

[CV-25] BiasBench: A reproducible benchmark for tuning the biases of event cameras CVPR2025

【速读】:该论文旨在解决事件相机(Event-based Cameras)在实际应用中因偏置(biases)配置不当导致输出质量下降的问题。传统帧相机拥有先进的自动配置算法,而事件相机缺乏类似的工具来优化这些偏置参数。论文的关键在于提出BiasBench,这是一个包含多个场景的新型事件数据集,其偏置设置以网格模式采样,从而实现可重复性。此外,论文还引入了一种基于强化学习(RL)的新方法,用于促进在线偏置调整,解决了现有事件模拟器不适用于偏置调优的局限性。

链接: https://arxiv.org/abs/2504.18235
作者: Andreas Ziegler,David Joseph,Thomas Gossard,Emil Moldovan,Andreas Zell
机构: Cognitive Systems Group, University of Tübingen, Germany (认知系统小组, 图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2025 Workshop on Event-based Vision

点击查看摘要

Abstract:Event-based cameras are bio-inspired sensors that detect light changes asynchronously for each pixel. They are increasingly used in fields like computer vision and robotics because of several advantages over traditional frame-based cameras, such as high temporal resolution, low latency, and high dynamic range. As with any camera, the output’s quality depends on how well the camera’s settings, called biases for event-based cameras, are configured. While frame-based cameras have advanced automatic configuration algorithms, there are very few such tools for tuning these biases. A systematic testing framework would require observing the same scene with different biases, which is tricky since event cameras only generate events when there is movement. Event simulators exist, but since biases heavily depend on the electrical circuit and the pixel design, available simulators are not well suited for bias tuning. To allow reproducibility, we present BiasBench, a novel event dataset containing multiple scenes with settings sampled in a grid-like pattern. We present three different scenes, each with a quality metric of the downstream application. Additionally, we present a novel, RL-based method to facilitate online bias adjustments.
zh

[CV-26] Dense Geometry Supervision for Underwater Depth Estimation

【速读】:该论文致力于解决现有单目深度估计算法在水下环境中的挑战,特别是在数据稀缺和方法支持不足的情况下。论文的关键解决方案在于构建了一个经济高效的水下场景数据集,通过多视图深度估计生成监督信号和增强图像,并提出了一种基于水下光学成像原理的纹理-深度融合模块(Texture-Depth Fusion Module),以有效利用和整合纹理线索中的深度信息。实验结果表明,该方法显著提升了模型在水下环境中的精度和适应性。

链接: https://arxiv.org/abs/2504.18233
作者: Wenxiang Gua,Lin Qia
机构: School of Computer Science and Technology, Ocean University of China (计算机科学与技术学院, 中国海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of monocular depth estimation is continually evolving with the advent of numerous innovative models and extensions. However, research on monocular depth estimation methods specifically for underwater scenes remains limited, compounded by a scarcity of relevant data and methodological support. This paper proposes a novel approach to address the existing challenges in current monocular depth estimation methods for underwater environments. We construct an economically efficient dataset suitable for underwater scenarios by employing multi-view depth estimation to generate supervisory signals and corresponding enhanced underwater images. we introduces a texture-depth fusion module, designed according to the underwater optical imaging principles, which aims to effectively exploit and integrate depth information from texture cues. Experimental results on the FLSea dataset demonstrate that our approach significantly improves the accuracy and adaptability of models in underwater settings. This work offers a cost-effective solution for monocular underwater depth estimation and holds considerable promise for practical applications.
zh

[CV-27] Unify3D: An Augmented Holistic End-to-end Monocular 3D Human Reconstruction via Anatomy Shaping and Twins Negotiating

【速读】:该论文旨在解决单目彩色人体三维重建任务中因单一RGB图像缺乏几何信息而导致的重建精度受限问题。传统方法通常依赖于前置几何模型来提供显式的几何表示,这限制了重建的整体性,并且容易受到前置模型的约束。为了解决这一问题,论文提出了一种全新的端到端范式,将人体重建视为一个整体过程,直接从2D图像预测完整的3D人体 avatar,无需任何显式的中间几何表示。

解决方案的关键在于两个核心组件:首先,引入了解剖形状提取模块(Anatomy Shaping Extraction Module),该模块考虑了人体解剖学的特性以捕获隐式的形状特征;其次,提出了双模态协商重建U-Net(Twins Negotiating Reconstruction U-Net),通过两种不同模态U-Net之间的特征交互来增强重建效果。此外,还设计了漫画数据增强策略(Comic Data Augmentation)并构建了超过15,000个3D人体扫描数据集,以提高模型在复杂输入条件下的性能。实验结果表明,该方法在多个测试集和真实场景案例中均优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.18215
作者: Nanjie Yao,Gangjian Zhang,Wenhao Shen,Jian Shu,Hao Wang
机构: HKUST(GZ); Nangyang Technological University (南洋理工大学); HKUST(GZ); HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D clothed human reconstruction aims to create a complete 3D avatar from a single image. To tackle the human geometry lacking in one RGB image, current methods typically resort to a preceding model for an explicit geometric representation. For the reconstruction itself, focus is on modeling both it and the input image. This routine is constrained by the preceding model, and overlooks the integrity of the reconstruction task. To address this, this paper introduces a novel paradigm that treats human reconstruction as a holistic process, utilizing an end-to-end network for direct prediction from 2D image to 3D avatar, eliminating any explicit intermediate geometry display. Based on this, we further propose a novel reconstruction framework consisting of two core components: the Anatomy Shaping Extraction module, which captures implicit shape features taking into account the specialty of human anatomy, and the Twins Negotiating Reconstruction U-Net, which enhances reconstruction through feature interaction between two U-Nets of different modalities. Moreover, we propose a Comic Data Augmentation strategy and construct 15k+ 3D human scans to bolster model performance in more complex case input. Extensive experiments on two test sets and many in-the-wild cases show the superiority of our method over SOTA methods. Our demos can be found in : this https URL.
zh

[CV-28] A Data-Centric Approach to 3D Semantic Segmentation of Railway Scenes

【速读】:该论文旨在解决自动驾驶列车领域中基于激光雷达(LiDAR)的语义分割在不同距离下预测准确性的问题,特别是在远距离场景中行人和轨道分割性能不足的挑战。为了解决这一问题,论文提出了两种针对性的数据增强方法:行人实例粘贴方法通过引入真实变化来提升远距离行人的分割精度;轨道稀疏化方法则通过重新分布激光雷达扫描中的点云密度,改善远距离轨道分割效果,同时对近距离分割精度影响较小。解决方案的关键在于结合数据增强技术与先进的3D语义分割网络,验证了数据为中心的方法在应对铁路特定挑战方面的潜力。

链接: https://arxiv.org/abs/2504.18213
作者: Nicolas Münger,Max Peter Ronecker,Xavier Diaz,Michael Karner,Daniel Watzenig,Jan Skaloud
机构: SETLabs Research GmbH; Graz University of Technology (格拉茨技术大学); EPFL - Swiss Federal Technology Institute of Lausanne (瑞士洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 28th Computer Vision Winter Workshop 2025

点击查看摘要

Abstract:LiDAR-based semantic segmentation is critical for autonomous trains, requiring accurate predictions across varying distances. This paper introduces two targeted data augmentation methods designed to improve segmentation performance on the railway-specific OSDaR23 dataset. The person instance pasting method enhances segmentation of pedestrians at distant ranges by injecting realistic variations into the dataset. The track sparsification method redistributes point density in LiDAR scans, improving track segmentation at far distances with minimal impact on close-range accuracy. Both methods are evaluated using a state-of-the-art 3D semantic segmentation network, demonstrating significant improvements in distant-range performance while maintaining robustness in close-range predictions. We establish the first 3D semantic segmentation benchmark for OSDaR23, demonstrating the potential of data-centric approaches to address railway-specific challenges in autonomous train perception.
zh

[CV-29] Gradient Descent as a Shrinkage Operator for Spectral Bias

【速读】:该论文试图解决神经网络中谱偏差(spectral bias)的影响机制及其调控问题。论文的关键在于揭示梯度下降(Gradient Descent, GD)如何通过调整神经网络雅可比矩阵的奇异值来重新解释为一种隐式的频域选择操作,从而控制谱偏差。具体而言,论文提出了一种显式关系,将梯度下降超参数(如学习率和迭代次数)与带宽(活跃成分的数量)关联起来,并指出仅当使用单调激活函数时,梯度下降正则化才有效。此外,论文强调了非单调激活函数(如sinc、高斯函数)在高效模拟谱偏差方面的实用性。

链接: https://arxiv.org/abs/2504.18207
作者: Simon Lucey
机构: Australian Institute for Machine Learning (AIML) (澳大利亚机器学习研究所); University of Adelaide (阿德莱德大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We generalize the connection between activation function and spline regression/smoothing and characterize how this choice may influence spectral bias within a 1D shallow network. We then demonstrate how gradient descent (GD) can be reinterpreted as a shrinkage operator that masks the singular values of a neural network’s Jacobian. Viewed this way, GD implicitly selects the number of frequency components to retain, thereby controlling the spectral bias. An explicit relationship is proposed between the choice of GD hyperparameters (learning rate number of iterations) and bandwidth (the number of active components). GD regularization is shown to be effective only with monotonic activation functions. Finally, we highlight the utility of non-monotonic activation functions (sinc, Gaussian) as iteration-efficient surrogates for spectral bias.
zh

[CV-30] Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Preference Understanding

【速读】:该论文旨在解决生成式 AI 在实现高分辨率图像输出且与用户细粒度偏好一致方面的挑战,并提出通过多轮交互确保生成图像满足预期。传统方法仅基于奖励反馈优化提示,但未利用多轮对话数据进行优化。论文的关键解决方案是引入了一个名为 Visual Co-Adaptation (VCA) 的框架,该框架结合了人机协作反馈,利用与人类偏好对齐的预训练奖励模型。VCA 框架使用多样化的多轮对话数据集,采用多样性、一致性及偏好反馈等多种奖励函数,并通过 LoRA 技术微调扩散模型,从而根据用户输入优化图像生成。此外,论文构建了与用户意图对齐的多轮提示与图像对数据集。实验表明,该方法在提升图像一致性和用户满意度方面显著优于现有技术,尤其在多轮对话场景中表现突出。

链接: https://arxiv.org/abs/2504.18204
作者: Kun Li,Jianhui Wang,Yangfan He,Xinyuan Song,Ruoyu Wang,Hongyang He,Wenxin Zhang,Jiaqi Chen,Keqin Li,Sida Li,Miao Zhang,Tianyu Shi,Xueqian Wang
机构: Xiamen University (厦门大学); University of Electronic Science and Technology of China (电子科技大学); University of Minnesota—Twin Cities (明尼苏达大学双城分校); Emory University (埃默里大学); Tsinghua University (清华大学); University of Warwick (华威大学); University of the Chinese Academy of Sciences (中国科学院大学); George Washington University (乔治华盛顿大学); University of Toronto (多伦多大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2503.17660

点击查看摘要

Abstract:Generative AI has significantly changed industries by enabling text-driven image generation, yet challenges remain in achieving high-resolution outputs that align with fine-grained user preferences. Consequently, multi-round interactions are necessary to ensure the generated images meet expectations. Previous methods enhanced prompts via reward feedback but did not optimize over a multi-round dialogue dataset. In this work, we present a Visual Co-Adaptation (VCA) framework incorporating human-in-the-loop feedback, leveraging a well-trained reward model aligned with human preferences. Using a diverse multi-turn dialogue dataset, our framework applies multiple reward functions, such as diversity, consistency, and preference feedback, while fine-tuning the diffusion model through LoRA, thus optimizing image generation based on user input. We also construct multi-round dialogue datasets of prompts and image pairs aligned with user intent. Experiments demonstrate that our method outperforms state-of-the-art baselines, significantly improving image consistency and alignment with user intent. Our approach consistently surpasses competing models in user satisfaction, especially in multi-turn dialogue scenarios.
zh

[CV-31] LiDAR-Guided Monocular 3D Object Detection for Long-Range Railway Monitoring

【速读】:该论文旨在解决铁路系统中长距离感知的问题,特别是针对德国等地区因老旧基础设施带来的挑战以及提高列车交通安全性需求。传统汽车系统的制动距离约为70米,而列车需要超过1公里的感知范围以实现早期危险检测(如平交道口的障碍物或轨道上的行人)。论文提出了一种基于深度学习的长距离3D目标检测方法,专为自动驾驶列车设计。该方案仅依赖单目图像,并受到Faraway-Frustum方法的启发,在训练过程中结合LiDAR数据以增强深度估计能力。其核心在于由四个关键模块组成的管道:(1) 用于2.5D目标检测的改进YOLOv9模型;(2) 深度估计网络;以及(3-4) 分别针对短程和远程的专用3D检测头。OSDaR23数据集上的评估表明,该方法在检测高达250米的目标时表现出色,展示了其在铁路自动化中的潜力,并指出了未来改进的方向。

链接: https://arxiv.org/abs/2504.18203
作者: Raul David Dominguez Sanchez,Xavier Diaz Ortiz,Xingcheng Zhou,Max Peter Ronecker,Michael Karner,Daniel Watzenig,Alois Knoll
机构: SETLabs Research GmbH (SETLabs Research GmbH); Chair of Robotics, Artificial Intelligence and Real-time Systems, Technical University of Munich (慕尼黑工业大学机器人、人工智能与实时系统讲席); Institute of Visual Computing, Graz University of Technology (格拉茨技术大学视觉计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the Data-Driven Learning for Intelligent Vehicle Applications Workshop at the 36th IEEE Intelligent Vehicles Symposium (IV) 2025

点击查看摘要

Abstract:Railway systems, particularly in Germany, require high levels of automation to address legacy infrastructure challenges and increase train traffic safely. A key component of automation is robust long-range perception, essential for early hazard detection, such as obstacles at level crossings or pedestrians on tracks. Unlike automotive systems with braking distances of ~70 meters, trains require perception ranges exceeding 1 km. This paper presents an deep-learning-based approach for long-range 3D object detection tailored for autonomous trains. The method relies solely on monocular images, inspired by the Faraway-Frustum approach, and incorporates LiDAR data during training to improve depth estimation. The proposed pipeline consists of four key modules: (1) a modified YOLOv9 for 2.5D object detection, (2) a depth estimation network, and (3-4) dedicated short- and long-range 3D detection heads. Evaluations on the OSDaR23 dataset demonstrate the effectiveness of the approach in detecting objects up to 250 meters. Results highlight its potential for railway automation and outline areas for future improvement.
zh

[CV-32] Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

【速读】:该论文致力于解决图像意图识别中由于隐式视觉线索的多样性和主观性,以及抽象概念(如“享受生活”)在不同类别内的表达变异性所导致的挑战。现有方法通过手工设计特征或从全局特征构建类别原型来应对这些问题,但难以处理每个意图类别的大范围视觉多样性。论文提出了一种名为多粒度组成视觉线索学习(Multi-grained Compositional visual Clue Learning, MCCL)的新方法,其关键是利用人类认知的系统组成性,将意图识别分解为视觉线索的组合,并整合多粒度特征,同时采用类别特定原型缓解数据不平衡问题,将意图识别视为多标签分类问题并通过图卷积网络注入先验知识。这一方案不仅提升了现有方法的准确性,还具备良好的可解释性。

链接: https://arxiv.org/abs/2504.18201
作者: Yin Tang,Jiankai Li,Hongyu Yang,Xuan Dong,Lifeng Fan,Weixin Li
机构: Beihang University (北京航空航天大学); Beijing Institute for General Artificial Intelligence (北京通用人工智能研究院); Beijing University of Post and Telecommunication (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g. “enjoy life”. Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.
zh

[CV-33] What is the Added Value of UDA in the VFM Era?

【速读】:该论文旨在解决两个关键问题:(1) 无监督领域适应(UDA)在更具代表性和多样性的数据场景中的表现如何;(2) 视觉基础模型(VFM)仅使用源域自适应能否在这些场景中同样有效。论文的关键解决方案在于评估UDA在合成到真实以及真实到真实的多种源域和目标域数据组合下的性能,并探究少量目标域标注数据对UDA的影响。研究结果表明,当使用更强的合成源数据时,UDA相对于仅源域微调的性能提升从+8 mIoU减少至+2 mIoU;而当使用更丰富的现实源数据时,UDA无明显附加价值。然而,在所有合成数据场景中,UDA的泛化能力始终优于仅源域微调,且在仅利用Cityscapes数据集中1/16标签的情况下,合成UDA仍能达到与全监督模型相同的最佳分割质量(85 mIoU)。

链接: https://arxiv.org/abs/2504.18190
作者: Brunó B. Englert,Tommie Kerssies,Gijs Dubbelman
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) can improve a perception model’s generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully-supervised learning with real target data. However, because VFMs have strong generalization from their pre-training, more straightforward, source-only fine-tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real-world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source-only fine-tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth-to-real and real-to-real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA’s improvement over source-only fine-tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source-only fine-tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state-of-the-art segmentation quality of 85 mIoU as a fully-supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.
zh

[CV-34] Label-independent hyperparameter-free self-supervised single-view deep subspace clustering

【速读】:该论文旨在解决深度子空间聚类(Deep Subspace Clustering, DSC)算法在广泛应用中存在的五个主要挑战:(1) 聚类质量通常仅基于编码器输出层评估,忽视了中间层中的有价值信息;(2) 大多数DSC方法将表征学习与子空间聚类视为独立任务,限制了其有效性;(3) 假设存在预留的数据集用于超参数调优,但在实际场景中往往不可行;(4) 学习终止通常依赖于聚类误差监控并需要外部标签;(5) 性能常依赖于依赖标注数据的后处理技术。为应对这些局限性,论文提出了一种新的单视图DSC方法,其关键是通过联合表示矩阵最小化逐层自表达损失、优化子空间结构范数以提升聚类质量、采用包含预训练和微调的多阶段顺序学习框架以实现无超参数调优的正则化项使用、引入基于相对误差的自停止机制以在无标签情况下终止训练,以及基于先验知识保留学习表示矩阵中的固定数量的领先系数。

链接: https://arxiv.org/abs/2504.18179
作者: Lovro Sindicic,Ivica Kopriva
机构: Division of Computing and Data Science, Ruđer Bošković Institute (鲁德·博斯科维奇研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 35 pages; 1 figure; 10 Tables

点击查看摘要

Abstract:Deep subspace clustering (DSC) algorithms face several challenges that hinder their widespread adoption across variois application domains. First, clustering quality is typically assessed using only the encoder’s output layer, disregarding valuable information present in the intermediate layers. Second, most DSC approaches treat representation learning and subspace clustering as independent tasks, limiting their effectiveness. Third, they assume the availability of a held-out dataset for hyperparameter tuning, which is often impractical in real-world scenarios. Fourth, learning termination is commonly based on clustering error monitoring, requiring external labels. Finally, their performance often depends on post-processing techniques that rely on labeled data. To address this limitations, we introduce a novel single-view DSC approach that: (i) minimizes a layer-wise self expression loss using a joint representation matrix; (ii) optimizes a subspace-structured norm to enhance clustering quality; (iii) employs a multi-stage sequential learning framework, consisting of pre-training and fine-tuning, enabling the use of multiple regularization terms without hyperparameter tuning; (iv) incorporates a relative error-based self-stopping mechanism to terminate training without labels; and (v) retains a fixed number of leading coefficients in the learned representation matrix based on prior knowledge. We evaluate the proposed method on six datasets representing faces, digits, and objects. The results show that our method outperforms most linear SC algorithms with careffulyl tuned hyperparameters while maintaining competitive performance with the best performing linear appoaches.
zh

[CV-35] PerfCam: Digital Twinning for Production Lines Using 3D Gaussian Splatting and Vision Models

【速读】:本文旨在解决工业生产线上数字孪生(Digital Twin)技术在实时性能指标(Key Performance Indicators, KPIs)提取与可视化方面的自动化与精确性不足的问题。为实现这一目标,论文提出了一种名为PerfCam的开源概念验证(Proof-of-Concept, PoC)框架,其关键在于结合相机与传感器数据、3D高斯点 splatting 技术以及卷积神经网络(Convolutional Neural Networks, CNNs),通过半自动化的对象跟踪与空间映射方法,生成能够实时捕捉可用性、性能、整体设备效率(Overall Equipment Effectiveness, OEE)及输送带速率等KPIs的精确数字孪生模型。这种方案的核心创新在于利用3D重建与计算机视觉模型,实现了高效且精准的工业生产线数字孪生应用,从而为智能制造业提供实用的运营分析工具。

链接: https://arxiv.org/abs/2504.18165
作者: Michel Gokan Khan,Renan Guarese,Fabian Johnson,Xi Vincent Wang,Anders Bergman,Benjamin Edvinsson,Mario Romero,Jérémy Vachier,Jan Kronqvist
机构: School of Engineering Sciences, KTH Royal Institute of Technology (瑞典皇家理工学院工程科学学院); Digital Futures, KTH Royal Institute of Technology (瑞典皇家理工学院数字未来研究中心); School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology (瑞典皇家理工学院电气工程与计算机科学学院); AstraZeneca (阿斯利康); School of Industrial Engineering and Management, KTH Royal Institute of Technology (瑞典皇家理工学院工业工程与管理学院); Department of Science and Technology, Linköping University (林雪平大学科学与技术系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce PerfCam, an open source Proof-of-Concept (PoC) digital twinning framework that combines camera and sensory data with 3D Gaussian Splatting and computer vision models for digital twinning, object tracking, and Key Performance Indicators (KPIs) extraction in industrial production lines. By utilizing 3D reconstruction and Convolutional Neural Networks (CNNs), PerfCam offers a semi-automated approach to object tracking and spatial mapping, enabling digital twins that capture real-time KPIs such as availability, performance, Overall Equipment Effectiveness (OEE), and rate of conveyor belts in the production line. We validate the effectiveness of PerfCam through a practical deployment within realistic test production lines in the pharmaceutical industry and contribute an openly published dataset to support further research and development in the field. The results demonstrate PerfCam’s ability to deliver actionable insights through its precise digital twin capabilities, underscoring its value as an effective tool for developing usable digital twins in smart manufacturing environments and extracting operational analytics.
zh

[CV-36] E-InMeMo: Enhanced Prompting for Visual In-Context Learning

【速读】:该论文旨在解决视觉领域中提示学习(In-Context Learning, ICL)效果依赖于提示质量的问题。传统方法在使用大规模预训练模型进行视觉任务时,通过提供固定的输入-输出图像对(即上下文对)作为提示,但其性能受限于提示的质量。为了解决这一问题,论文提出了一种名为增强指令更多(E-InMeMo)的新方法,其关键在于引入可学习的扰动(learnable perturbations)到上下文对中以优化提示生成过程。实验结果表明,E-InMeMo在前景分割和单对象检测等标准视觉任务中显著优于现有最先进的方法,分别提升了7.99和17.04的mIoU分数,证明了其作为一种轻量且有效的策略来提升视觉ICL性能的价值。

链接: https://arxiv.org/abs/2504.18158
作者: Jiahao Zhang,Bowen Wang,Hong Liu,Liangzhi Li,Yuta Nakashima,Hajime Nagahara
机构: D3 Center, The University of Osaka (大阪大学); SANKEN, The University of Osaka (大阪大学); School of Informatics, Xiamen University (厦门大学); Meetyou AI Lab, Xiamen Meet You Co., Ltd (厦门美图有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: this https URL
zh

[CV-37] ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding

【速读】:该论文旨在解决人类动作和姿态在视频中的细粒度理解问题,特别是在人本主义多模态理解领域的研究瓶颈。当前大型多模态模型虽然在多种任务上表现良好,但在实现细粒度理解方面存在不足,主要归因于缺乏高质量的精细标注数据。为应对这一挑战,论文的关键解决方案是引入代理任务(proxy tasks),通过利用现有大规模语言模型(MLLMs)自动生成的数据,减少对昂贵且难以扩展的手动标注的依赖。这些代理任务旨在增强模型在空间和时间维度上的感知能力,从而显著缩小与使用手工标注的细粒度数据所达到的性能差距。

链接: https://arxiv.org/abs/2504.18152
作者: Yi-Xing Peng,Qize Yang,Yu-Ming Tang,Shenghao Fu,Kun-Yu Lin,Xihan Wei,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学, China); Tongyi Lab, Alibaba Group (阿里云通义实验室); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室, 中国深圳); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China (教育部机器智能与先进计算重点实验室, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained understanding of human actions and poses in videos is essential for human-centric AI applications. In this work, we introduce ActionArt, a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios, each accompanied by detailed annotations that meticulously label every limb movement. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions. Experimental results indicate that, while current large multimodal models perform commendably on various tasks, they often fall short in achieving fine-grained understanding. We attribute this limitation to the scarcity of meticulously annotated data, which is both costly and difficult to scale manually. Since manual annotations are costly and hard to scale, we propose proxy tasks to enhance the model perception ability in both spatial and temporal dimensions. These proxy tasks are carefully crafted to be driven by data automatically generated from existing MLLMs, thereby reducing the reliance on costly manual labels. Experimental results show that the proposed proxy tasks significantly narrow the gap toward the performance achieved with manually annotated fine-grained data.
zh

[CV-38] MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View

【速读】:本文针对无人机(UAV)视角下的目标检测难题展开研究,主要聚焦于目标像素比例极小、物体尺度变化显著以及背景信息复杂等挑战,这些因素严重限制了无人机技术的实际应用。为了解决这些问题,论文提出了一种名为多尺度上下文聚合与尺度自适应融合 YOLO(Multi-scale Context Aggregation and Scale-adaptive Fusion YOLO, MASF-YOLO)的新网络架构,基于 YOLOv11 开发。

关键解决方案包括:首先设计了一个多尺度特征聚合模块(MFAM),通过并行多尺度卷积和特征融合显著提升了小目标检测的准确性;其次提出了改进的高效多尺度注意力模块(IEMA),通过特征分组、并行子网络和跨空间学习增强对目标区域的关注;第三引入了维度感知选择性集成模块(DASI),通过自适应加权和融合低维与高维特征进一步提升多尺度特征融合能力。实验表明,相比 YOLOv11-s,MASF-YOLO-s 在 VisDrone2019 验证集上的 mAP@0.5 提升了 4.6%,mAP@0.5:0.95 提升了 3.5%,且仅需其约 60% 的参数量和 65% 的计算成本。此外,与最先进的检测器对比显示,MASF-YOLO-s 在检测精度和模型效率方面均具有明显优势。

链接: https://arxiv.org/abs/2504.18136
作者: Liugang Lu,Dabin He,Congxiang Liu,Zhixiang Deng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of Unmanned Aerial Vehicle (UAV) and computer vision technologies, object detection from UAV perspectives has emerged as a prominent research area. However, challenges for detection brought by the extremely small proportion of target pixels, significant scale variations of objects, and complex background information in UAV images have greatly limited the practical applications of UAV. To address these challenges, we propose a novel object detection network Multi-scale Context Aggregation and Scale-adaptive Fusion YOLO (MASF-YOLO), which is developed based on YOLOv11. Firstly, to tackle the difficulty of detecting small objects in UAV images, we design a Multi-scale Feature Aggregation Module (MFAM), which significantly improves the detection accuracy of small objects through parallel multi-scale convolutions and feature fusion. Secondly, to mitigate the interference of background noise, we propose an Improved Efficient Multi-scale Attention Module (IEMA), which enhances the focus on target regions through feature grouping, parallel sub-networks, and cross-spatial learning. Thirdly, we introduce a Dimension-Aware Selective Integration Module (DASI), which further enhances multi-scale feature fusion capabilities by adaptively weighting and fusing low-dimensional features and high-dimensional features. Finally, we conducted extensive performance evaluations of our proposed method on the VisDrone2019 dataset. Compared to YOLOv11-s, MASFYOLO-s achieves improvements of 4.6% in mAP@0.5 and 3.5% in mAP@0.5:0.95 on the VisDrone2019 validation set. Remarkably, MASF-YOLO-s outperforms YOLOv11-m while requiring only approximately 60% of its parameters and 65% of its computational cost. Furthermore, comparative experiments with state-of-the-art detectors confirm that MASF-YOLO-s maintains a clear competitive advantage in both detection accuracy and model efficiency.
zh

[CV-39] Salient Region-Guided Spacecraft Image Arbitrary-Scale Super-Resolution Network

【速读】:该论文旨在解决航天器图像超分辨率领域中现有任意尺度超分辨率方法存在的问题,这些方法在处理航天器低分辨率图像时,往往因忽视航天器核心区域与大面积黑色背景之间的特征差异,引入无关噪声。为应对这一挑战,论文提出了一种显著区域引导的航天器图像任意尺度超分辨率网络(Salient Region-Guided Spacecraft Image Arbitrary-Scale Super-Resolution Network, SGSASR)。其关键在于利用航天器核心显著区域的特征来指导潜在表示的调制,从而实现任意尺度的超分辨率重建。具体而言,设计了一个航天器核心区域识别块(Spacecraft Core Region Recognition Block, SCRRB),通过预训练的显著性检测模型识别图像中的核心显著区域;同时,提出了自适应加权特征融合增强机制(Adaptive-Weighted Feature Fusion Enhancement Mechanism, AFFEM),通过动态权重参数选择性聚合核心区域特征与一般图像特征,以强化核心显著区域的响应。实验结果表明,所提出的SGSASR方法优于当前最先进的技术。

链接: https://arxiv.org/abs/2504.18127
作者: Jingfan Yang,Hu Gao,Ying Zhang,Depeng Dang
机构: School of Artificial Intelligence, Beijing Normal University (人工智能学院, 北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spacecraft image super-resolution seeks to enhance low-resolution spacecraft images into high-resolution ones. Although existing arbitrary-scale super-resolution methods perform well on general images, they tend to overlook the difference in features between the spacecraft core region and the large black space background, introducing irrelevant noise. In this paper, we propose a salient region-guided spacecraft image arbitrary-scale super-resolution network (SGSASR), which uses features from the spacecraft core salient regions to guide latent modulation and achieve arbitrary-scale super-resolution. Specifically, we design a spacecraft core region recognition block (SCRRB) that identifies the core salient regions in spacecraft images using a pre-trained saliency detection model. Furthermore, we present an adaptive-weighted feature fusion enhancement mechanism (AFFEM) to selectively aggregate the spacecraft core region features with general image features by dynamic weight parameter to enhance the response of the core salient regions. Experimental results demonstrate that the proposed SGSASR outperforms state-of-the-art approaches.
zh

[CV-40] Study on Real-Time Road Surface Reconstruction Using Stereo Vision

【速读】:该论文旨在解决实时边缘设备上的路表重建(road surface reconstruction)效率与精度之间的权衡问题。解决方案的关键在于通过优化网络结构和推理流程来提升性能。具体而言,论文提出了采用等构全局结构剪枝(Isomorphic Global Structured Pruning)方法精简立体特征提取主干网络以降低复杂度,同时保持性能;此外,重新设计了头网络,引入优化的沙漏结构(hourglass structure)、动态注意力机制(dynamic attention heads)、减少特征通道数、混合精度推理(mixed precision inference)以及高效的概率体计算(efficient probability volume computation)。这些改进显著提升了推理速度,并降低了重建误差,使其适用于自动驾驶中的实时路表重建任务。

链接: https://arxiv.org/abs/2504.18112
作者: Deepak Ghimire,Byoungjun Kim,Donghoon Kim,SungHwan Jeong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Stereo Vision, Efficient CNN, Pruning, Optimization. 2025 Intelligent Information and Control Conference (IICC 2025), Jeonju, Korea

点击查看摘要

Abstract:Road surface reconstruction plays a crucial role in autonomous driving, providing essential information for safe and smooth navigation. This paper enhances the RoadBEV [1] framework for real-time inference on edge devices by optimizing both efficiency and accuracy. To achieve this, we proposed to apply Isomorphic Global Structured Pruning to the stereo feature extraction backbone, reducing network complexity while maintaining performance. Additionally, the head network is redesigned with an optimized hourglass structure, dynamic attention heads, reduced feature channels, mixed precision inference, and efficient probability volume computation. Our approach improves inference speed while achieving lower reconstruction error, making it well-suited for real-time road surface reconstruction in autonomous driving.
zh

[CV-41] Disentangle Identity Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation

【速读】:该论文旨在解决现有Talking Head Generation (THG)方法在生成情感表达生动的人脸图像同时保持说话者身份识别方面存在的挑战。具体而言,当前方法存在三个关键局限:未能充分利用音频中的内在情感线索、情感表征中的身份泄露以及情感相关性孤立学习的问题。为了解决这些问题,论文提出了一种名为DICE-Talk的新框架,其核心思想是解耦身份与情感,并使具有相似特征的情感进行协作。关键解决方案包括:开发一种解耦的情感嵌入器,通过跨模态注意力联合建模音视频情感线索,将情感表示为与身份无关的高斯分布;引入带有可学习情感银行的增强型情感条件模块,通过向量量化和基于注意力的特征聚合显式捕捉情感间的相互关系;设计一种情感判别目标,在扩散过程中通过潜在空间分类强制实现情感一致性。实验结果表明,该方法在MEAD和HDTF数据集上的表现优于现有技术,在保持唇同步性能的同时提高了情感准确性,并且能够生成保留身份且具有丰富相关情感表达的逼真人脸图像。

链接: https://arxiv.org/abs/2504.18087
作者: Weipeng Tan,Chuming Lin,Chengming Xu,FeiFan Xu,Xiaobin Hu,Xiaozhong Ji,Junwei Zhu,Chengjie Wang,Yanwei Fu
机构: Fudan University (复旦大学); Youtu Lab, Tencent (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2409.03270

点击查看摘要

Abstract:Recent advances in Talking Head Generation (THG) have achieved impressive lip synchronization and visual quality through diffusion models; yet existing methods struggle to generate emotionally expressive portraits while preserving speaker identity. We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio’s inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. First, we develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention, representing emotions as identity-agnostic Gaussian distributions. Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks that explicitly capture inter-emotion relationships through vector quantization and attention-based feature aggregation. Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process through latent-space classification. Extensive experiments on MEAD and HDTF datasets demonstrate our method’s superiority, outperforming state-of-the-art approaches in emotion accuracy while maintaining competitive lip-sync performance. Qualitative results and user studies further confirm our method’s ability to generate identity-preserving portraits with rich, correlated emotional expressions that naturally adapt to unseen identities.
zh

[CV-42] S3MOT: Monocular 3D Object Tracking with Selective State Space Model

【速读】:该论文旨在解决单目3D多目标跟踪(MOT)中的挑战,即如何从2D视频流中有效挖掘三维时空关联。为实现这一目标,论文提出了三项关键技术:(1) 引入匈牙利状态空间模型(Hungarian State Space Model, HSSM),通过全局感受野和动态权重,在线性复杂度下高效完成复杂的跟踪关联决策;(2) 提出全卷积单阶段嵌入(Fully Convolutional One-stage Embedding, FCOE),利用密集特征图直接进行对比学习以消除感兴趣区域(ROI)池化操作,提升在视角变化和光照条件下的目标再识别精度;(3) 借助VeloSSM编码器-解码器架构增强6自由度姿态估计,通过建模速度的时间依赖性捕捉运动动态,克服基于帧的3D推理局限性。这些创新方法的关键在于结合异构线索的高效融合与利用,从而显著提升了单目3D MOT任务的性能。

链接: https://arxiv.org/abs/2504.18068
作者: Zhuohao Yan,Shaoquan Feng,Xingxing Li,Yuxuan Zhou,Chunxi Xia,Shengyu Li
机构: School of Geodesy and Geomatics, Wuhan University, China (测绘学院,武汉大学,中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at this https URL.
zh

[CV-43] POET: Prompt Offset Tuning for Continual Human Action Adaptation ECCV2024

【速读】:本文旨在解决隐私感知的少样本连续动作识别问题,目标是让用户和开发者能够以个性化的方式向沉浸式计算设备的模型中持续添加新的动作类别,同时确保这一过程无需存储或重放用户的敏感训练数据。论文的关键在于提出了一种名为POET(Prompt-Offset Tuning)的方法,它通过轻量化的主干网络实现了提示调优,而无需依赖大规模预训练的Transformer模型。POET创新性地引入了空间-时间可学习的提示偏移调优方法,并首次将此类提示调优应用于图神经网络。此外,研究还贡献了两个针对新问题设定的动作识别基准数据集:NTU RGB+D用于活动识别,SHREC-2017用于手部手势识别。实验结果表明,POET在多个基准测试中表现出色。

链接: https://arxiv.org/abs/2504.18059
作者: Prachi Garg,Joseph K J,Vineeth N Balasubramanian,Necati Cihan Camgoz,Chengde Wan,Kenrick Kin,Weiguang Si,Shugao Ma,Fernando De La Torre
机构: Carnegie Mellon University (卡内基梅隆大学); Indian Institute of Technology, Hyderabad (印度理工学院海得拉巴校区); Meta Reality Labs (Meta 实景实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024 (Oral), webpage this https URL

点击查看摘要

Abstract:As extended reality (XR) is redefining how users interact with computing devices, research in human action recognition is gaining prominence. Typically, models deployed on immersive computing devices are static and limited to their default set of classes. The goal of our research is to provide users and developers with the capability to personalize their experience by adding new action classes to their device models continually. Importantly, a user should be able to add new classes in a low-shot and efficient manner, while this process should not require storing or replaying any of user’s sensitive training data. We formalize this problem as privacy-aware few-shot continual action recognition. Towards this end, we propose POET: Prompt-Offset Tuning. While existing prompt tuning approaches have shown great promise for continual learning of image, text, and video modalities; they demand access to extensively pretrained transformers. Breaking away from this assumption, POET demonstrates the efficacy of prompt tuning a significantly lightweight backbone, pretrained exclusively on the base class data. We propose a novel spatio-temporal learnable prompt offset tuning approach, and are the first to apply such prompt tuning to Graph Neural Networks. We contribute two new benchmarks for our new problem setting in human action recognition: (i) NTU RGB+D dataset for activity recognition, and (ii) SHREC-2017 dataset for hand gesture recognition. We find that POET consistently outperforms comprehensive benchmarks. Source code at this https URL.
zh

[CV-44] A BERT-Style Self-Supervised Learning CNN for Disease Identification from Retinal Images

【速读】:该论文旨在解决医疗影像领域深度学习方法对大量标注数据依赖的问题,特别是在高质量标签获取成本高且困难的情况下,如何有效利用未标注数据以提升模型性能。论文的关键解决方案是结合轻量级CNN框架(nn-MobileNet)与BERT风格的自监督学习方法,通过在UK Biobank提供的未标注视网膜眼底图像上进行预训练,从而显著提高下游任务(如阿尔茨海默病、帕金森病及多种视网膜疾病识别)的表现。这种方法充分利用了CNN在处理大规模未标注数据方面的潜力,同时规避了Vision Transformer因高计算复杂度和缺乏定位特性带来的局限性。

链接: https://arxiv.org/abs/2504.18049
作者: Xin Li,Wenhui Zhu,Peijie Qiu,Oana M. Dumitrascu,Amal Youssef,Yalin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of medical imaging, the advent of deep learning, especially the application of convolutional neural networks (CNNs) has revolutionized the analysis and interpretation of medical images. Nevertheless, deep learning methods usually rely on large amounts of labeled data. In medical imaging research, the acquisition of high-quality labels is both expensive and difficult. The introduction of Vision Transformers (ViT) and self-supervised learning provides a pre-training strategy that utilizes abundant unlabeled data, effectively alleviating the label acquisition challenge while broadening the breadth of data utilization. However, ViT’s high computational density and substantial demand for computing power, coupled with the lack of localization characteristics of its operations on image patches, limit its efficiency and applicability in many application scenarios. In this study, we employ nn-MobileNet, a lightweight CNN framework, to implement a BERT-style self-supervised learning approach. We pre-train the network on the unlabeled retinal fundus images from the UK Biobank to improve downstream application performance. We validate the results of the pre-trained model on Alzheimer’s disease (AD), Parkinson’s disease (PD), and various retinal diseases identification. The results show that our approach can significantly improve performance in the downstream tasks. In summary, this study combines the benefits of CNNs with the capabilities of advanced self-supervised learning in handling large-scale unlabeled data, demonstrating the potential of CNNs in the presence of label scarcity.
zh

[CV-45] DMS-Net:Dual-Modal Multi-Scale Siamese Network for Binocular Fundus Image Classification

【速读】:该论文旨在解决眼科疾病诊断中传统方法及现有单眼深度学习方法难以有效捕捉双眼病理相关性的问题。为应对这一挑战,论文提出了一种名为DMS-Net的双模态多尺度暹罗网络用于双眼眼底图像分类。该框架的关键在于利用权重量共享的暹罗ResNet-152主干网络从配对的眼底图像中提取深层语义特征,并通过引入多尺度上下文感知模块(MSCAM)解决病变边界模糊和病理分布分散等问题,该模块结合自适应池化与注意力机制实现多分辨率特征聚合。此外,双模态特征融合(DMFF)模块通过空间-语义重校准和双向注意力机制增强跨模态交互,有效整合全局上下文与局部边缘特征,从而显著提升对对称性病理检测的能力以及临床决策支持效果。

链接: https://arxiv.org/abs/2504.18046
作者: Guohao Huo,Zibo Lin,Zitong Wang,Ruiting Dai,Hao Tang
机构: School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学信息与软件工程学院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ophthalmic diseases pose a significant global health challenge, yet traditional diagnosis methods and existing single-eye deep learning approaches often fail to account for binocular pathological correlations. To address this, we propose DMS-Net, a dual-modal multi-scale Siamese network for binocular fundus image classification. Our framework leverages weight-shared Siamese ResNet-152 backbones to extract deep semantic features from paired fundus images. To tackle challenges such as lesion boundary ambiguity and scattered pathological distributions, we introduce a Multi-Scale Context-Aware Module (MSCAM) that integrates adaptive pooling and attention mechanisms for multi-resolution feature aggregation. Additionally, a Dual-Modal Feature Fusion (DMFF) module enhances cross-modal interaction through spatial-semantic recalibration and bidirectional attention, effectively combining global context and local edge features. Evaluated on the ODIR-5K dataset, DMS-Net achieves state-of-the-art performance with 80.5% accuracy, 86.1% recall, and 83.8% Cohen’s kappa, demonstrating superior capability in detecting symmetric pathologies and advancing clinical decision-making for ocular diseases.
zh

[CV-46] Cabbage: A Differential Growth Framework for Open Surfaces

【速读】:该论文试图解决自然界中三维开放曲面(如花瓣卷曲)的屈曲行为建模问题。解决方案的关键在于提出了一种名为Cabbage的微分生长框架,其通过边细分实现离散分辨率的差异化增加(Cabbage-Shell),并结合特征感知的平滑与重网格化确保高质量三角网格生成,同时利用校正碰撞有效防止自碰撞。此外,论文还提供了近似替代方案Cabbage-Collision,并实现了可直接用于计算机辅助设计(CAD)的表面生成,从而在形态表达能力、网格质量以及长时间稳定模拟复杂模式方面超越现有技术。

链接: https://arxiv.org/abs/2504.18040
作者: Xiaoyi Liu,Hao Tang
机构: Washington University in St. Louis (圣路易斯华盛顿大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Cabbage, a differential growth framework to model buckling behavior in 3D open surfaces found in nature-like the curling of flower petals. Cabbage creates high-quality triangular meshes free of self-intersection. Cabbage-Shell is driven by edge subdivision which differentially increases discretization resolution. Shell forces expands the surface, generating buckling over time. Feature-aware smoothing and remeshing ensures mesh quality. Corrective collision effectively prevents self-collision even in tight spaces. We additionally provide Cabbage-Collision, and approximate alternative, followed by CAD-ready surface generation. Cabbage is the first open-source effort with this calibre and robustness, outperforming SOTA methods in its morphological expressiveness, mesh quality, and stably generates large, complex patterns over hundreds of simulation steps. It is a source not only of computational modeling, digital fabrication, education, but also high-quality, annotated data for geometry processing and shape analysis.
zh

[CV-47] Enhancing Privacy-Utility Trade-offs to Mitigate Memorization in Diffusion Models CVPR2025

【速读】:该论文旨在解决文本到图像扩散模型在生成高度符合用户提示的图像时,因倾向于记忆训练集图像而导致生成图像原创性下降及隐私泄露的问题,这可能引发模型所有者和用户的法律风险,特别是当记忆的图像包含专有内容时。此外,虽然已有方法尝试缓解这些问题,但往往以显著降低输出效用为代价,如通过文本对齐分数体现。

论文的关键解决方案在于提出了一种名为PRSS的新方法,它通过结合提示重锚定(Prompt Re-anchoring, PR)来改进去分类器引导方法,从而提升隐私保护,并引入语义提示搜索(Semantic Prompt Search, SS)以增强输出效用。实验结果表明,该方法在不同隐私水平下均能有效改善隐私与效用之间的权衡,达到了新的技术水平。

链接: https://arxiv.org/abs/2504.18032
作者: Chen Chen,Daochang Liu,Mubarak Shah,Chang Xu
机构: School of Computer Science, Faculty of Engineering, The University of Sydney (悉尼大学); School of Physics, Mathematics and Computing, The University of Western Australia (西澳大利亚大学); Center for Research in Computer Vision, University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:Text-to-image diffusion models have demonstrated remarkable capabilities in creating images highly aligned with user prompts, yet their proclivity for memorizing training set images has sparked concerns about the originality of the generated images and privacy issues, potentially leading to legal complications for both model owners and users, particularly when the memorized images contain proprietary content. Although methods to mitigate these issues have been suggested, enhancing privacy often results in a significant decrease in the utility of the outputs, as indicated by text-alignment scores. To bridge the research gap, we introduce a novel method, PRSS, which refines the classifier-free guidance approach in diffusion models by integrating prompt re-anchoring (PR) to improve privacy and incorporating semantic prompt search (SS) to enhance utility. Extensive experiments across various privacy levels demonstrate that our approach consistently improves the privacy-utility trade-off, establishing a new state-of-the-art.
zh

[CV-48] A Large Vision-Language Model based Environment Perception System for Visually Impaired People IROS2024

【速读】:该论文旨在解决视障人士感知复杂自然场景周围环境的挑战性任务,这一难题限制了他们的个人和社会活动。论文提出的关键解决方案是基于大型视觉-语言模型(Large Vision-Language Model, LVLM)构建了一个环境感知系统。该系统通过可穿戴设备捕捉当前场景,并允许用户通过设备获取分析结果。其关键创新在于结合RGB图像分割结果作为外部知识输入到LVLM中,以减少模型的幻觉现象(hallucination),从而提高描述场景的准确性。实验结果表明,与Qwen-VL-Chat相比,该系统在POPE、MME和LLaVA-QA90数据集上的表现更为精准,初步探索还显示该系统能够有效帮助视障人士感知周围环境。

链接: https://arxiv.org/abs/2504.18027
作者: Zezhou Chen,Zhaoxiang Liu,Kai Wang,Kohou Wang,Shiguo Lian
机构: AI Innovation Center, China Unicom (中国联合网络通信有限公司人工智能创新中心); Unicom Digital Technology, China Unicom (中国联合网络通信有限公司数字科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted by IROS2024(9 pages, 8 figures)

点击查看摘要

Abstract:It is a challenging task for visually impaired people to perceive their surrounding environment due to the complexity of the natural scenes. Their personal and social activities are thus highly limited. This paper introduces a Large Vision-Language Model(LVLM) based environment perception system which helps them to better understand the surrounding environment, by capturing the current scene they face with a wearable device, and then letting them retrieve the analysis results through the device. The visually impaired people could acquire a global description of the scene by long pressing the screen to activate the LVLM output, retrieve the categories of the objects in the scene resulting from a segmentation model by tapping or swiping the screen, and get a detailed description of the objects they are interested in by double-tapping the screen. To help visually impaired people more accurately perceive the world, this paper proposes incorporating the segmentation result of the RGB image as external knowledge into the input of LVLM to reduce the LVLM’s hallucination. Technical experiments on POPE, MME and LLaVA-QA90 show that the system could provide a more accurate description of the scene compared to Qwen-VL-Chat, exploratory experiments show that the system helps visually impaired people to perceive the surrounding environment effectively.
zh

[CV-49] ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification

【速读】:该论文旨在解决跨模态可见光-红外行人再识别(VIReID)中的模态差异性和身份特征复杂性带来的挑战。现有方法仅依赖身份标签监督,难以充分提取高层语义信息;而引入视觉-语言预训练模型的方法虽通过生成文本描述增强语义建模,但未能显式建模关键的体型特征。为此,论文提出了一种有效的基于体型文本对齐(Body Shape-aware Textual Alignment, BSaTa)框架,其关键是设计了一个体型文本对齐(Body Shape Textual Alignment, BSTA)模块,利用人体解析模型提取体型信息并通过CLIP转换为结构化文本表示,并进一步通过文本-视觉一致性正则化(Text-Visual Consistency Regularizer, TVCR)确保体型文本表示与视觉体型特征的一致性。此外,引入形状感知表征学习(Shape-aware Representation Learning, SRL)机制,结合多文本监督与分布一致性约束,引导视觉编码器学习模态不变且具有判别性的身份特征,从而提升模态不变性。实验结果表明,该方法在SYSU-MM01和RegDB数据集上取得了优异性能,验证了其有效性。

链接: https://arxiv.org/abs/2504.18025
作者: Shuanglin Yan,Neng Dong,Shuang Li,Rui Yan,Hao Tang,Jing Qin
机构: Nanjing University of Science and Technology (南京理工大学); Chongqing University of Posts and Telecommunications (重庆邮电大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visible-Infrared Person Re-identification (VIReID) aims to match visible and infrared pedestrian images, but the modality differences and the complexity of identity features make it challenging. Existing methods rely solely on identity label supervision, which makes it difficult to fully extract high-level semantic information. Recently, vision-language pre-trained models have been introduced to VIReID, enhancing semantic information modeling by generating textual descriptions. However, such methods do not explicitly model body shape features, which are crucial for cross-modal matching. To address this, we propose an effective Body Shape-aware Textual Alignment (BSaTa) framework that explicitly models and utilizes body shape information to improve VIReID performance. Specifically, we design a Body Shape Textual Alignment (BSTA) module that extracts body shape information using a human parsing model and converts it into structured text representations via CLIP. We also design a Text-Visual Consistency Regularizer (TVCR) to ensure alignment between body shape textual representations and visual body shape features. Furthermore, we introduce a Shape-aware Representation Learning (SRL) mechanism that combines Multi-text Supervision and Distribution Consistency Constraints to guide the visual encoder to learn modality-invariant and discriminative identity features, thus enhancing modality invariance. Experimental results demonstrate that our method achieves superior performance on the SYSU-MM01 and RegDB datasets, validating its effectiveness.
zh

[CV-50] Federated Client-tailored Adapter for Medical Image Segmentation

【速读】:该论文旨在解决在分布式医疗数据孤岛场景下,传统集中式学习方法无法适用以及联邦学习在客户端领域异构(包括分布多样性与类别不平衡)情况下训练稳定性差的问题。论文提出了一种名为Federated Client-tailored Adapter (FCA) 的新框架,通过在预训练的医疗基础模型中引入联邦适配器来稳定联邦训练过程,并设计了两种自适应更新策略,将适配器分解为通用组件和特定客户端组件,分别对全局不变单元和特定客户端单元的参数进行独立更新。这种关键设计不仅提升了联邦学习的稳定性,还实现了针对各客户端最优而非全局折中分割模型的目标。实验结果验证了FCA框架在联邦医学图像分割任务中的有效性和优越性。

链接: https://arxiv.org/abs/2504.18020
作者: Guyue Hu,Siyuan Song,Yukun Kang,Zhu Yin,Gangming Zhao,Chenglong Li,Jin Tang
机构: Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui Provincial Key Laboratory of Security Artificial Intelligence, and School of Artificial Intelligence, Anhui University, Hefei, 230601, China (合肥工业大学智能计算与信号处理教育部重点实验室, 安徽省人工智能安全性重点实验室, 安徽大学人工智能学院, 中国安徽省合肥市, 邮编: 230601); School of Internet, Anhui University, Hefei, 230601, China (安徽大学互联网学院, 中国安徽省合肥市, 邮编: 230601); Department of Computer Science, The University of Hong Kong, Hong Kong (香港大学计算机科学系, 中国香港)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation in X-ray images is beneficial for computer-aided diagnosis and lesion localization. Existing methods mainly fall into a centralized learning paradigm, which is inapplicable in the practical medical scenario that only has access to distributed data islands. Federated Learning has the potential to offer a distributed solution but struggles with heavy training instability due to client-wise domain heterogeneity (including distribution diversity and class imbalance). In this paper, we propose a novel Federated Client-tailored Adapter (FCA) framework for medical image segmentation, which achieves stable and client-tailored adaptive segmentation without sharing sensitive local data. Specifically, the federated adapter stirs universal knowledge in off-the-shelf medical foundation models to stabilize the federated training process. In addition, we develop two client-tailored federated updating strategies that adaptively decompose the adapter into common and individual components, then globally and independently update the parameter groups associated with common client-invariant and individual client-specific units, respectively. They further stabilize the heterogeneous federated learning process and realize optimal client-tailored instead of sub-optimal global-compromised segmentation models. Extensive experiments on three large-scale datasets demonstrate the effectiveness and superiority of the proposed FCA framework for federated medical segmentation.
zh

[CV-51] Diffusion-Driven Universal Model Inversion Attack for Face Recognition

【速读】:本文旨在解决面部识别系统中的隐私保护问题,特别是模型反演攻击对隐私的威胁。现有方法通常需要为每个目标模型训练特定的生成器,这在计算上非常昂贵。为此,论文提出了一种名为DiffUMI的新方法,这是一种无需训练的基于扩散驱动的通用模型反演攻击技术。DiffUMI的关键创新在于首次将扩散模型应用于模型反演中的无条件图像生成,使其成为一种通用方法,无需针对每个目标模型单独训练生成器。通过在固定的框架和预训练的扩散模型内运行,DiffUMI能够无缝适应不同的目标身份和模型,从而以最先进的成功率突破所谓的隐私保护面部识别系统的防御,并实现高效且高保真的面部重建。此外,论文还展示了如何利用模型反演技术进行域外检测(OODD),这是首次仅基于嵌入向量区分人脸输入与非人脸输入的应用。

链接: https://arxiv.org/abs/2504.18015
作者: Hanrui Wang,Shuo Wang,Chun-Shien Lu,Isao Echizen
机构: National Institute of Informatics (日本国立情报学研究所); Shanghai Jiao Tong University (上海交通大学); Academia Sinica (中央研究院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Facial recognition technology poses significant privacy risks, as it relies on biometric data that is inherently sensitive and immutable if compromised. To mitigate these concerns, face recognition systems convert raw images into embeddings, traditionally considered privacy-preserving. However, model inversion attacks pose a significant privacy threat by reconstructing these private facial images, making them a crucial tool for evaluating the privacy risks of face recognition systems. Existing methods usually require training individual generators for each target model, a computationally expensive process. In this paper, we propose DiffUMI, a training-free diffusion-driven universal model inversion attack for face recognition systems. DiffUMI is the first approach to apply a diffusion model for unconditional image generation in model inversion. Unlike other methods, DiffUMI is universal, eliminating the need for training target-specific generators. It operates within a fixed framework and pretrained diffusion model while seamlessly adapting to diverse target identities and models. DiffUMI breaches privacy-preserving face recognition systems with state-of-the-art success, demonstrating that an unconditional diffusion model, coupled with optimized adversarial search, enables efficient and high-fidelity facial reconstruction. Additionally, we introduce a novel application of out-of-domain detection (OODD), marking the first use of model inversion to distinguish non-face inputs from face inputs based solely on embeddings.
zh

[CV-52] Back to Fundamentals: Low-Level Visual Features Guided Progressive Token Pruning

【速读】:该论文旨在解决视觉Transformer (Vision Transformer, ViT) 在语义分割任务中计算开销过大,难以部署于资源受限设备的问题。现有基于token剪枝的方法往往未能充分考虑视觉数据的基本特性。论文提出了一种名为“LVTP”的渐进式token剪枝框架,其关键在于结合多尺度Tsallis熵与低级视觉特征的双聚类机制,同时引入基于多尺度Tsallis熵加权的新型动态评分机制,以平衡高级语义信息与基础视觉属性的融合,确保精确分割。此外,该框架通过分析低级特征保留关键边缘信息,在不改变网络架构或增加额外训练的情况下优化计算成本。实验表明,LVTP在多个数据集上实现了20%-45%的计算成本降低,且性能损失可忽略不计,在复杂边缘区域的表现尤为突出。

链接: https://arxiv.org/abs/2504.17996
作者: Yuanbing Ouyang,Yizhuo Liang,Qingpeng Li,Xinfei Guo,Yiming Luo,Di Wu,Hao Wang,Yushan Pan
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource-constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces ‘LVTP’, a progressive token pruning framework guided by multi-scale Tsallis entropy and low-level visual features with twice clustering. It integrates high-level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi-scale Tsallis entropy weighting overcomes limitations of traditional single-parameter entropy. The framework also incorporates low-level feature analysis to preserve critical edge information while optimizing computational cost. As a plug-and-play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%-45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.
zh

[CV-53] RSRNav: Reasoning Spatial Relationship for Image-Goal Navigation

【速读】:该论文旨在解决图像目标导航(ImageNav)中的两个主要挑战:(1) 目标语义特征难以提供精确的方向信息,导致冗余动作;(2) 当训练与应用之间的视角不一致时,性能显著下降。为了解决这些问题,论文提出了一种名为RSRNav的方法,其关键是通过构建目标与当前观测之间的空间关系相关性,并利用细粒度的交叉相关性和方向感知的相关性逐步优化这些相关性,从而为导航策略网络提供更精准的导航指导。这种基于空间关系推理的方案显著提升了导航性能,特别是在用户匹配目标场景下表现出色,展现了其在实际应用中的潜力。

链接: https://arxiv.org/abs/2504.17991
作者: Zheng Qin,Le Wang,Yabing Wang,Sanping Zhou,Gang Hua,Wei Tang
机构: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence (人机混合增强智能国家重点实验室), National Engineering Research Center for Visual Information and Applications (视觉信息处理与应用国家工程研究中心), Institute of Artificial Intelligence and Robotics (人工智能与机器人研究所), Xi’an Jiaotong University (西安交通大学), Xi’an, Shaanxi 710049, China; Dolby Laboratories (杜比实验室), Bellevue, WA 98004, USA; Department of Computer Science (计算机科学系), University of Illinois (伊利诺伊大学), Chicago, IL 60607, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent image-goal navigation (ImageNav) methods learn a perception-action policy by separately capturing semantic features of the goal and egocentric images, then passing them to a policy network. However, challenges remain: (1) Semantic features often fail to provide accurate directional information, leading to superfluous actions, and (2) performance drops significantly when viewpoint inconsistencies arise between training and application. To address these challenges, we propose RSRNav, a simple yet effective method that reasons spatial relationships between the goal and current observations as navigation guidance. Specifically, we model the spatial relationship by constructing correlations between the goal and current observations, which are then passed to the policy network for action prediction. These correlations are progressively refined using fine-grained cross-correlation and direction-aware correlation for more precise navigation. Extensive evaluation of RSRNav on three benchmark datasets demonstrates superior navigation performance, particularly in the “user-matched goal” setting, highlighting its potential for real-world applications.
zh

[CV-54] From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval

【速读】:本文旨在解决零样本 (Zero-Shot, ZS) 组成图像检索 (Composed Image Retrieval, CIR) 中存在的三个关键挑战:(1) 假词伪标签表示能力不足;(2) 训练与推理阶段之间的不一致性;(3) 对大规模合成数据的依赖。为了解决这些问题,论文提出了一种两阶段框架,从映射到组合的角度实现训练。关键在于第一阶段通过引入视觉语义注入模块和软文本对齐目标增强图像到假词伪标签的学习,使伪标签能够捕获更丰富且细粒度的图像信息;第二阶段利用少量合成三元组数据优化文本编码器,结合伪标签和修改文本提取组合语义以实现精确的目标图像检索。第一阶段建立的强视觉到伪标签映射为第二阶段提供了坚实基础,使得方法能够兼容高质量和低质量的合成数据,并仅需少量合成数据即可显著提升性能。

链接: https://arxiv.org/abs/2504.17990
作者: Yabing Wang,Zhuotao Tian,Qingpei Guo,Zheng Qin,Sanping Zhou,Ming Yang,Le Wang
机构: Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University (西安交通大学人工智能与机器人研究所); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学(深圳)); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text. Due to the high cost of annotating CIR triplet datasets, zero-shot (ZS) CIR has gained traction as a promising alternative. Existing studies mainly focus on projection-based methods, which map an image to a single pseudo-word token. However, these methods face three critical challenges: (1) insufficient pseudo-word token representation capacity, (2) discrepancies between training and inference phases, and (3) reliance on large-scale synthetic data. To address these issues, we propose a two-stage framework where the training is accomplished from mapping to composing. In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module and a soft text alignment objective, enabling the token to capture richer and fine-grained image information. In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics by combining pseudo-word tokens with modification text for accurate target image retrieval. The strong visual-to-pseudo mapping established in the first stage provides a solid foundation for the second stage, making our approach compatible with both high- and low-quality synthetic data, and capable of achieving significant performance gains with only a small amount of synthetic data. Extensive experiments were conducted on three public datasets, achieving superior performance compared to existing approaches.
zh

[CV-55] VR-GS: Inverse Volume Rendering for Explorable Visualization via Editable 3D Gaussian Splatting

【速读】:该论文旨在解决在体积可视化中,传统方法因固定预设的转移函数(Transfer Function, TF)限制用户交互探索的问题,同时降低大尺度体积数据实时渲染所需的硬件需求。论文提出的解决方案是逆向体积渲染通过高斯点阵化(Inverse Volume Rendering via Gaussian Splatting, iVR-GS),其关键是将多个与基本TF关联的iVR-GS模型组合,覆盖整个体积场景的可见部分,并利用每个基本模型中的三维可编辑高斯点(Gaussian),以支持实时场景渲染和编辑。这种方法不仅降低了渲染成本,还实现了场景编辑功能,从而提升了用户交互探索的能力。

链接: https://arxiv.org/abs/2504.17954
作者: Kaiyuan Tang,Siyuan Yao,Chaoli Wang
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Visualization and Computer Graphics (TVCG)

点击查看摘要

Abstract:In volume visualization, users can interactively explore the three-dimensional data by specifying color and opacity mappings in the transfer function (TF) or adjusting lighting parameters, facilitating meaningful interpretation of the underlying structure. However, rendering large-scale volumes demands powerful GPUs and high-speed memory access for real-time performance. While existing novel view synthesis (NVS) methods offer faster rendering speeds with lower hardware requirements, the visible parts of a reconstructed scene are fixed and constrained by preset TF settings, significantly limiting user exploration. This paper introduces inverse volume rendering via Gaussian splatting (iVR-GS), an innovative NVS method that reduces the rendering cost while enabling scene editing for interactive volume exploration. Specifically, we compose multiple iVR-GS models associated with basic TFs covering disjoint visible parts to make the entire volumetric scene visible. Each basic model contains a collection of 3D editable Gaussians, where each Gaussian is a 3D spatial point that supports real-time scene rendering and editing. We demonstrate the superior reconstruction quality and composability of iVR-GS against other NVS solutions (Plenoxels, CCNeRF, and base 3DGS) on various volume datasets. The code is available at this https URL.
zh

[CV-56] Masked strategies for images with small objects

【速读】:该论文旨在解决血细胞等小尺寸血细胞成分检测与分类的挑战,特别是在图像中这些对象以像素级大小存在且被大量相似对象包围的场景下。传统深度学习方法(如基于预训练权重的残差网络和视觉变换器)在处理超出其学习表示域的图像时,性能往往难以令人满意。为克服这一局限,论文提出利用自监督模型来学习表征,并将所学权重应用于下游任务。然而,现有方法(如掩码自动编码器,MAE)在处理小尺寸对象时会因全局上下文信息丢失而面临困难。为此,论文的关键创新在于通过调整掩码比例和分块大小优化掩码自动编码器的重建能力,并结合ViT编码器提取局部和全局上下文信息,进一步利用预训练权重提升小尺寸血细胞成分的语义分割效果。实验结果表明,较小的掩码比例和分块大小显著提升了图像重建质量,并在语义分割任务中有效改善了小尺寸目标的分类性能。

链接: https://arxiv.org/abs/2504.17935
作者: H. Martin Gillis,Ming Hill,Paul Hollensen,Alan Fine,Thomas Trappenberg
机构: Faculty of Computer Science, Dalhousie University (达尔豪西大学); Department of Computer Science, Boston University (波士顿大学); Alentic Microscience Inc. (Alentic 微科学公司); Department of Physiology and Biophysics, Dalhousie University (达尔豪西大学); School of Biomedical Engineering, Dalhousie University (达尔豪西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The hematology analytics used for detection and classification of small blood components is a significant challenge. In particular, when objects exists as small pixel-sized entities in a large context of similar objects. Deep learning approaches using supervised models with pre-trained weights, such as residual networks and vision transformers have demonstrated success for many applications. Unfortunately, when applied to images outside the domain of learned representations, these methods often result with less than acceptable performance. A strategy to overcome this can be achieved by using self-supervised models, where representations are learned and weights are then applied for downstream applications. Recently, masked autoencoders have proven to be effective to obtain representations that captures global context information. By masking regions of an image and having the model learn to reconstruct both the masked and non-masked regions, weights can be used for various applications. However, if the sizes of the objects in images are less than the size of the mask, the global context information is lost, making it almost impossible to reconstruct the image. In this study, we investigated the effect of mask ratios and patch sizes for blood components using a MAE to obtain learned ViT encoder representations. We then applied the encoder weights to train a U-Net Transformer for semantic segmentation to obtain both local and global contextual information. Our experimental results demonstrates that both smaller mask ratios and patch sizes improve the reconstruction of images using a MAE. We also show the results of semantic segmentation with and without pre-trained weights, where smaller-sized blood components benefited with pre-training. Overall, our proposed method offers an efficient and effective strategy for the segmentation and classification of small objects.
zh

[CV-57] DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

【速读】:该论文旨在解决通过扩散模型进行文本引导的图像编辑所引发的图像安全问题,特别是当攻击者利用这些工具对用户图像进行恶意编辑时。现有的防御方法尝试在像素空间中添加有限噪声来干扰扩散模型的功能,但这些方法引入的对抗噪声容易被肉眼察觉,并且在可行的像素预算下对JPEG等净化技术缺乏鲁棒性。

论文的关键解决方案在于提出了一种新颖的优化方法,直接在频率域中引入对抗扰动,具体是通过修改输入图像的离散余弦变换(Discrete Cosine Transform, DCT)系数实现。利用JPEG压缩的处理流程,该方法生成的对抗图像能够有效防止恶意图像编辑。实验结果表明,与现有方法相比,该方法在引入较少视觉伪影的同时,保持了相似的编辑防护能力,并增强了对噪声净化技术的鲁棒性。

链接: https://arxiv.org/abs/2504.17894
作者: Aniruddha Bala,Rohit Chowdhury,Rohan Jaiswal,Siddharth Roheda
机构: Samsung R&D Institute, Bangalore (三星研究院,班加罗尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.
zh

[CV-58] Set Phasers to Stun: Beaming Power and Control to Mobile Robots with Laser Light IROS2025

【速读】:本文献旨在解决移动机器人在多米级范围内同时实现无线功率传输与通信的问题。解决方案的关键在于Phaser系统的设计,它结合了立体视觉追踪与高功率光束转向技术,并通过半自动校准流程实现了两者融合。此外,Phaser还利用激光光束作为数据通道,提出了一种低功耗的光学通信方案。这一设计不仅支持超过110 mW/cm²的光功率密度和无错误的数据传输,而且显著减少了接收电流消耗(仅为蓝牙低功耗的3%),从而有效提升了无电池机器人的运行速度和导航能力。

链接: https://arxiv.org/abs/2504.17865
作者: Charles J. Carver,Hadleigh Schwartz,Toma Itagaki,Zachary Englhardt,Kechen Liu,Megan Graciela Nauli Manik,Chun-Cheng Chang,Vikram Iyer,Brian Plancher,Xia Zhou
机构: Department of Computer Science, Columbia University (哥伦比亚大学); Lincoln Laboratory, Massachusetts Institute of Technology (麻省理工学院); Paul G. Allen School of Computer Science & Engineering, University of Washington (华盛顿大学); Barnard College, Columbia University (巴纳德学院, 哥伦比亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, submitted to IROS 2025

点击查看摘要

Abstract:We present Phaser, a flexible system that directs narrow-beam laser light to moving robots for concurrent wireless power delivery and communication. We design a semi-automatic calibration procedure to enable fusion of stereo-vision-based 3D robot tracking with high-power beam steering, and a low-power optical communication scheme that reuses the laser light as a data channel. We fabricate a Phaser prototype using off-the-shelf hardware and evaluate its performance with battery-free autonomous robots. Phaser delivers optical power densities of over 110 mW/cm ^2 and error-free data to mobile robots at multi-meter ranges, with on-board decoding drawing 0.3 mA (97% less current than Bluetooth Low Energy). We demonstrate Phaser fully powering gram-scale battery-free robots to nearly 2x higher speeds than prior work while simultaneously controlling them to navigate around obstacles and along paths. Code, an open-source design guide, and a demonstration video of Phaser is available at this https URL.
zh

[CV-59] Fine-Tuning Adversarially-Robust Transformers for Single-Image Dehazing

【速读】:该论文旨在解决单图像去雾模型在面对对抗性噪声时鲁棒性不足的问题,特别是分析了最先进的图像到图像去雾Transformer对微小扰动(如单像素变化)的敏感性,并证明其峰值信噪比(PSNR)可能显著下降。论文的关键解决方案是提出两种轻量级微调策略,用于增强预训练Transformer模型的抗干扰能力。这些方法在保持清洁数据性能的同时,大幅提高了模型对对抗性数据的防御能力,并展示了其在遥感领域的适用性,特别是在处理分布外数据时的稳健行为。

链接: https://arxiv.org/abs/2504.17829
作者: Vlad Vasilescu,Ana Neacsu,Daniela Faur
机构: CAMPUS Research Institute (CAMPUS 研究所); Politehnica Bucharest (布加勒斯特理工大学), Romania
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single-image dehazing is an important topic in remote sensing applications, enhancing the quality of acquired images and increasing object detection precision. However, the reliability of such structures has not been sufficiently analyzed, which poses them to the risk of imperceptible perturbations that can significantly hinder their performance. In this work, we show that state-of-the-art image-to-image dehazing transformers are susceptible to adversarial noise, with even 1 pixel change being able to decrease the PSNR by as much as 2.8 dB. Next, we propose two lightweight fine-tuning strategies aimed at increasing the robustness of pre-trained transformers. Our methods results in comparable clean performance, while significantly increasing the protection against adversarial data. We further present their applicability in two remote sensing scenarios, showcasing their robust behavior for out-of-distribution data. The source code for adversarial fine-tuning and attack algorithms can be found at this http URL.
zh

[CV-60] VEU-Bench: Towards Comprehensive Understanding of Video Editing CVPR2025

【速读】:该论文试图解决视频编辑理解(Video Editing Understanding, VEU)任务中现有Video Large Language Models (Vid-LLMs) 能力不足的问题。具体而言,尽管Vid-LLMs在通用视频理解任务上取得了显著进展,但它们在捕捉视频编辑组件及其复杂语义方面的表现尚未被充分探索。为填补这一空白,论文提出了VEU-Bench,一个涵盖从帧内特征到跨镜头属性的全面基准,包含识别、推理和判断三个阶段的19个细粒度任务。解决方案的关键在于开发了一个名为Oscars的专家模型,该模型基于精心设计的VEU-Bench数据集进行微调,显著提升了Vid-LLMs在VEU任务上的性能,相比现有开源模型提升了超过28.3%的准确率,并在多个推理任务中实现了与商用模型(如GPT-4o)相当的表现。此外,通过将VEU数据融入Vid-LLMs,还显著提高了其在通用视频理解任务中的性能,平均提升了8.3%。

链接: https://arxiv.org/abs/2504.17828
作者: Bozheng Li,Yongliang Wu,Yi Lu,Jiashuo Yu,Licheng Tang,Jiawang Cao,Wenqing Zhu,Yuyang Sun,Jay Wu,Wenbo Zhu
机构: Opus AI Research (Opus AI 研究院); Brown University (布朗大学); Southeast University (东南大学); University of Toronto (多伦多大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR2025

点击查看摘要

Abstract:Widely shared videos on the internet are often edited. Recently, although Video Large Language Models (Vid-LLMs) have made great progress in general video understanding tasks, their capabilities in video editing understanding (VEU) tasks remain unexplored. To address this gap, in this paper, we introduce VEU-Bench (Video Editing Understanding Benchmark), a comprehensive benchmark that categorizes video editing components across various dimensions, from intra-frame features like shot size to inter-shot attributes such as cut types and transitions. Unlike previous video editing understanding benchmarks that focus mainly on editing element classification, VEU-Bench encompasses 19 fine-grained tasks across three stages: recognition, reasoning, and judging. To enhance the annotation of VEU automatically, we built an annotation pipeline integrated with an ontology-based knowledge base. Through extensive experiments with 11 state-of-the-art Vid-LLMs, our findings reveal that current Vid-LLMs face significant challenges in VEU tasks, with some performing worse than random choice. To alleviate this issue, we develop Oscars, a VEU expert model fine-tuned on the curated VEU-Bench dataset. It outperforms existing open-source Vid-LLMs on VEU-Bench by over 28.3% in accuracy and achieves performance comparable to commercial models like GPT-4o. We also demonstrate that incorporating VEU data significantly enhances the performance of Vid-LLMs on general video understanding benchmarks, with an average improvement of 8.3% across nine reasoning tasks.
zh

[CV-61] FashionM3: Multimodal Multitask and Multiround Fashion Assistant based on Unified Vision-Language Model

【速读】:该论文旨在解决时尚个性化推荐及多轮交互场景下的语义理解与匹配问题,同时满足用户对多样化穿搭建议的需求。为实现这一目标,论文提出的关键解决方案是FashionM3,这是一种基于视觉-语言模型(Vision-Language Model, VLM)微调的多模态、多任务、多轮次时尚助手。其核心在于通过在FashionRec数据集上的微调,使系统能够提供上下文相关的个性化建议,并结合迭代优化机制,在基本推荐、个性化推荐及替代方案推荐等任务中实现高效且精准的服务。

链接: https://arxiv.org/abs/2504.17826
作者: Kaicheng Pang,Xingxing Zou,Waikeung Wong
机构: Laboratory for Artificial Intelligence in Design (人工智能设计实验室), Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fashion styling and personalized recommendations are pivotal in modern retail, contributing substantial economic value in the fashion industry. With the advent of vision-language models (VLM), new opportunities have emerged to enhance retailing through natural language and visual interactions. This work proposes FashionM3, a multimodal, multitask, and multiround fashion assistant, built upon a VLM fine-tuned for fashion-specific tasks. It helps users discover satisfying outfits by offering multiple capabilities including personalized recommendation, alternative suggestion, product image generation, and virtual try-on simulation. Fine-tuned on the novel FashionRec dataset, comprising 331,124 multimodal dialogue samples across basic, personalized, and alternative recommendation tasks, FashionM3 delivers contextually personalized suggestions with iterative refinement through multiround interactions. Quantitative and qualitative evaluations, alongside user studies, demonstrate FashionM3’s superior performance in recommendation effectiveness and practical value as a fashion assistant.
zh

[CV-62] Dual Prompting Image Restoration with Diffusion Transformers CVPR2025

【速读】:该论文试图解决现有最先进的图像恢复方法在高质量图像恢复方面受限的问题。尽管基于U-Net骨干网络的隐变量扩散模型(Latent Diffusion Models)已取得显著进展,但其能力仍有局限性。论文提出DPIR(Dual Prompting Image Restoration),一种通过多视角有效提取低质量图像条件信息的新方法作为解决方案。关键在于引入了包含轻量级模块的低质量图像条件分支与提供额外视觉提示的双提示控制分支组成的双分支架构:前者高效地将图像先验知识融入扩散变换器(Diffusion Transformers, DiTs),后者设计了一种双提示模块,结合全局上下文和局部外观的视觉线索,与文本提示共同形成双重提示,显著提升了图像恢复的质量。

链接: https://arxiv.org/abs/2504.17825
作者: Dehong Kong,Fan Li,Zhixin Wang,Jiaqi Xu,Renjing Pei,Wenbo Li,WenQi Ren
机构: School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区网络科学与技术学院); MoE Key Laboratory of Information Technology (教育部信息技术重点实验室); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The Chinese University of Hong Kong (香港中文大学); Guangdong Provincial Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2025

点击查看摘要

Abstract:Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.
zh

[CV-63] A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw

【速读】:该论文旨在解决北极地区退化融滑坡(Retrogressive Thaw Slumps, RTS)在遥感影像中的精确实时检测问题,其核心挑战在于RTS尺度小、边界模糊以及时空变化显著。为应对这些挑战,论文提出的关键解决方案是采用先进的深度学习模型——基于多尺度视觉变换器主干网络的级联掩膜区域卷积神经网络(Cascade Mask R-CNN),并通过引入两种创新策略优化多模态学习:一是特征级残差跨模态注意力融合策略,用于有效整合多模态数据的互补信息以提升模型对复杂模式与关系的理解能力;二是先进行单模态预训练再进行多模态微调的方法,以降低计算需求同时保持高性能。实验结果表明,该方法在多模态数据利用效率方面优于现有采用数据级融合、特征级卷积融合及多种注意力融合策略的模型。

链接: https://arxiv.org/abs/2504.17822
作者: Wenwen Li,Chia-Yu Hsu,Sizhe Wang,Zhining Gu,Yili Yang,Brendan M. Rogers,Anna Liljedahl
机构: School of Geographical Sciences and Urban Planning (地理科学与城市规划学院), Arizona State University (亚利桑那州立大学); Woodwell Climate Research Center (伍德韦尔气候研究中心), Falmouth, MA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrogressive Thaw Slumps (RTS) in Arctic regions are distinct permafrost landforms with significant environmental impacts. Mapping these RTS is crucial because their appearance serves as a clear indication of permafrost thaw. However, their small scale compared to other landform features, vague boundaries, and spatiotemporal variation pose significant challenges for accurate detection. In this paper, we employed a state-of-the-art deep learning model, the Cascade Mask R-CNN with a multi-scale vision transformer-based backbone, to delineate RTS features across the Arctic. Two new strategies were introduced to optimize multimodal learning and enhance the model’s predictive performance: (1) a feature-level, residual cross-modality attention fusion strategy, which effectively integrates feature maps from multiple modalities to capture complementary information and improve the model’s ability to understand complex patterns and relationships within the data; (2) pre-trained unimodal learning followed by multimodal fine-tuning to alleviate high computing demand while achieving strong model performance. Experimental results demonstrated that our approach outperformed existing models adopting data-level fusion, feature-level convolutional fusion, and various attention fusion strategies, providing valuable insights into the efficient utilization of multimodal data for RTS mapping. This research contributes to our understanding of permafrost landforms and their environmental implications.
zh

[CV-64] Learning Underwater Active Perception in Simulation

【速读】:该论文旨在解决水下自主检测任务中因水质条件(如浊度和背向散射)导致的图像质量下降问题,这些问题会严重影响视觉感知和操作精度。论文的关键在于提出了一种基于主动感知框架的高效解决方案,通过训练一个多层感知机 (Multi-Layer Perceptron, MLP),使其能够根据目标距离和人造光源强度预测图像质量。为构建训练数据集,研究者修改了建模软件 Blender,以更准确地模拟不同浊度和背向散射条件下的水下光传播特性,并生成包含十种不同类型水体的大规模合成数据集。实验验证表明,该方法在仿真环境中显著提升了视觉覆盖范围和图像质量,优于传统方法。

链接: https://arxiv.org/abs/2504.17817
作者: Alexandre Cardaillac,Donald G. Dansereau
机构: Australian Centre for Robotics, School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:When employing underwater vehicles for the autonomous inspection of assets, it is crucial to consider and assess the water conditions. Indeed, they have a significant impact on the visibility, which also affects robotic operations. Turbidity can jeopardise the whole mission as it may prevent correct visual documentation of the inspected structures. Previous works have introduced methods to adapt to turbidity and backscattering, however, they also include manoeuvring and setup constraints. We propose a simple yet efficient approach to enable high-quality image acquisition of assets in a broad range of water conditions. This active perception framework includes a multi-layer perceptron (MLP) trained to predict image quality given a distance to a target and artificial light intensity. We generated a large synthetic dataset including ten water types with different levels of turbidity and backscattering. For this, we modified the modelling software Blender to better account for the underwater light propagation properties. We validated the approach in simulation and showed significant improvements in visual coverage and quality of imagery compared to traditional approaches. The project code is available on our project page at this https URL.
zh

[CV-65] Subject-driven Video Generation via Disentangled Identity and Motion

【速读】:该论文旨在解决在零样本(zero-shot)条件下通过解耦主体特定学习与时间动态来训练个性化视频生成模型的问题。传统无调参(tuning-free)的视频定制方法通常依赖于大规模标注视频数据集,这些数据集计算成本高昂且需要大量人工标注。为克服这些问题,论文提出了一种创新方法,利用图像定制数据集直接训练视频定制模型,并将视频定制分解为两个部分:(1) 通过图像定制数据集注入身份信息;(2) 借助少量未标注视频通过图像到视频的训练方法保留时间建模。此外,在图像到视频的微调过程中采用随机图像令牌丢弃与随机图像初始化以缓解复制粘贴问题。为进一步增强学习效果,引入了在主体特定特征和时间特征联合优化中的随机切换机制,减轻灾难性遗忘现象。关键在于将主体身份注入与时间动态建模分离,并结合图像定制数据集与少量未标注视频实现高效、可扩展的视频定制,从而在零样本设置下超越现有方法,验证了所提框架的有效性。

链接: https://arxiv.org/abs/2504.17816
作者: Daneul Kim,Jingxu Zhang,Wonjoon Jin,Sunghyun Cho,Qi Dai,Jaesik Park,Chong Luo
机构: Seoul National University (首尔国立大学); Microsoft Research Asia (微软亚洲研究院); POSTECH (浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project Page : this https URL

点击查看摘要

Abstract:We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
zh

[CV-66] Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning ICCV

【速读】:该论文旨在解决3D场景中基于多视角输入的高保真三维高斯点云(3D Gaussian Splatting, 3DGS)修复问题,即通过3D Gaussian Inpainting (3DGI) 替换被遮挡或掩蔽的目标物体,并确保新生成的内容与周围环境无缝融合。不同于二维图像修复,3DGI 的挑战在于如何有效地利用来自多个输入视角的互补视觉和语义线索,特别是当某一视角中的遮挡区域在其他视角中可能可见时。

解决方案的关键在于引入一种方法,通过测量不同输入视角下三维点的可见性不确定性(VISibility-uncerTainty),以此引导3DGI 利用这些互补视觉线索。此外,论文还利用不确定性学习场景中未被掩蔽对象的语义概念(scene conceptuAl learning),并通过扩散模型(diffusion model)基于该概念填充输入图像中的掩蔽目标。最终,论文提出了一个名为VISTA 的新型框架,将基于可见性不确定性的引导修复与场景语义学习相结合,生成高质量的3DGS 模型,实现无伪影且自然的新型视图合成。此外,该方法进一步扩展到处理由时间动态变化引起的干扰物,增强了其在多样化场景重建中的适用性。

链接: https://arxiv.org/abs/2504.17815
作者: Mingxuan Cui,Qing Guo,Yuyi Wang,Hongkai Yu,Di Lin,Qin Zou,Ming-Ming Cheng,Xi Li
机构: Zhejiang University; CFAR and IHPC, A*STAR (新加坡 Agency for Science, Technology and Research), Singapore; CRRC Zhuzhou Institute & Tengen Intelligence Institute; Cleveland State University; Tianjin University; Wuhan University; Nankai University; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures, ICCV

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.
zh

[CV-67] CLOC: Contrastive Learning for Ordinal Classification with Multi-Margin N-pair Loss CVPR2025

【速读】:该论文试图解决传统序数分类方法未能充分考虑相邻类别误分类后果差异性的问题。现有方法通常将所有相邻类别视为同等重要,而忽略了不同类别边界的重要程度差异(如临床决策中的关键阈值)。为了解决这一局限性,论文提出了一种基于边界的对比学习方法CLOC(Contrastive Learning for Ordinal Classification),其关键是引入了一种新颖的多边距n元损失函数(multi-margin n-pair loss, MMNP)。通过优化多个边距,CLOC能够在关键相邻类别间构建灵活的决策边界,实现平滑的类别过渡,并降低过拟合训练数据中偏差的风险。实验结果表明,CLOC在多种真实图像数据集及模拟临床偏倚的合成数据集上优于现有方法,并展示了其在学习符合临床与实际需求的有意义有序表示方面的可解释性和可控性。

链接: https://arxiv.org/abs/2504.17813
作者: Dileepa Pitawela,Gustavo Carneiro,Hsiang-Ting Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025

点击查看摘要

Abstract:In ordinal classification, misclassifying neighboring ranks is common, yet the consequences of these errors are not the same. For example, misclassifying benign tumor categories is less consequential, compared to an error at the pre-cancerous to cancerous threshold, which could profoundly influence treatment choices. Despite this, existing ordinal classification methods do not account for the varying importance of these margins, treating all neighboring classes as equally significant. To address this limitation, we propose CLOC, a new margin-based contrastive learning method for ordinal classification that learns an ordered representation based on the optimization of multiple margins with a novel multi-margin n-pair loss (MMNP). CLOC enables flexible decision boundaries across key adjacent categories, facilitating smooth transitions between classes and reducing the risk of overfitting to biases present in the training data. We provide empirical discussion regarding the properties of MMNP and show experimental results on five real-world image datasets (Adience, Historical Colour Image Dating, Knee Osteoarthritis, Indian Diabetic Retinopathy Image, and Breast Carcinoma Subtyping) and one synthetic dataset simulating clinical decision bias. Our results demonstrate that CLOC outperforms existing ordinal classification methods and show the interpretability and controllability of CLOC in learning meaningful, ordered representations that align with clinical and practical needs.
zh

[CV-68] Object Learning and Robust 3D Reconstruction

【速读】:该论文致力于解决无监督条件下从图像中解析出感兴趣对象(Foreground Objects)的问题,特别是在二维(2D)场景中区分前景对象与背景的挑战。论文的关键在于提出了一种名为FlowCapsules的方法,利用运动作为线索来识别二维场景中的兴趣对象,并通过三维(3D)场景的几何一致性检测动态不一致的对象以实现感兴趣对象的检测与去除。此外,论文设计了瞬态对象掩模(Transient Object Masks),用于构建鲁棒的优化核,从而改进非专业捕捉设置下的三维建模。核心解决方案的关键在于结合运动线索和几何一致性来实现无监督的显式对象表示,同时探索无需人工标注即可定义兴趣对象的新方向。

链接: https://arxiv.org/abs/2504.17812
作者: Sara Sabour
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: PhD Thesis

点击查看摘要

Abstract:In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.
zh

[CV-69] SmallGS: Gaussian Splatting-based Camera Pose Estimation for Small-Baseline Videos CVPR

【速读】:该论文旨在解决小基线运动动态视频中的相机位姿估计问题,这类视频因特征模糊、漂移累积及三角化约束不足而对现有位姿估计算法构成挑战。论文的关键创新在于提出SmallGS框架,其核心解决方案是利用高斯点撒(Gaussian splatting)技术优化连续帧的相机位姿,通过从视频段的第一帧重建场景以提供稳定的参考,并借助高维特征图渲染增强鲁棒性。此外,通过冻结高斯点撒模型并基于渲染特征优化相机视角,SmallGS无需显式的特征对应关系或强视差运动即可有效学习相机位姿,从而显著降低了传统方法对深度变化的需求。

链接: https://arxiv.org/abs/2504.17810
作者: Yuxin Yao,Yan Zhang,Zhening Huang,Joan Lasenby
机构: University of Cambridge (剑桥大学); Meshcapade (Meshcapade)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 4 figures, Accepted by CVPR workshop

点击查看摘要

Abstract:Dynamic videos with small baseline motions are ubiquitous in daily life, especially on social media. However, these videos present a challenge to existing pose estimation frameworks due to ambiguous features, drift accumulation, and insufficient triangulation constraints. Gaussian splatting, which maintains an explicit representation for scenes, provides a reliable novel view rasterization when the viewpoint change is small. Inspired by this, we propose SmallGS, a camera pose estimation framework that is specifically designed for small-baseline videos. SmallGS optimizes sequential camera poses using Gaussian splatting, which reconstructs the scene from the first frame in each video segment to provide a stable reference for the rest. The temporal consistency of Gaussian splatting within limited viewpoint differences reduced the requirement of sufficient depth variations in traditional camera pose estimation. We further incorporate pretrained robust visual features, e.g. DINOv2, into Gaussian splatting, where high-dimensional feature map rendering enhances the robustness of camera pose estimation. By freezing the Gaussian splatting and optimizing camera viewpoints based on rasterized features, SmallGS effectively learns camera poses without requiring explicit feature correspondences or strong parallax motion. We verify the effectiveness of SmallGS in small-baseline videos in TUM-Dynamics sequences, which achieves impressive accuracy in camera pose estimation compared to MonST3R and DORID-SLAM for small-baseline videos in dynamic scenes. Our project page is at: this https URL
zh

[CV-70] Spectral Dictionary Learning for Generative Image Modeling

【速读】:该论文旨在解决传统生成模型(如变分、对抗和扩散模型)在图像合成中的局限性,提出了一种基于频谱的全新生成模型。解决方案的关键在于引入了一种确定性的字典学习方法,通过将图像展平为一维信号后,将其重建为一组学习到的频谱基函数的线性组合。每个基函数明确参数化为频率、相位和振幅,并联合学习全局频谱字典及其时间变化调制和每幅图像的混合系数,以量化各频谱成分的贡献。此外,通过拟合简单的概率模型到这些混合系数上,实现了从潜在空间采样生成新图像的能力。该方法不仅提供了比依赖随机推理或对抗训练更具可解释性和物理意义的表示,还通过引入基于短时傅里叶变换(STFT)的频域损失函数,确保了合成图像能够捕捉全局结构和精细的频谱细节。实验表明,该模型在CIFAR-10数据集上的重建质量和感知保真度具有竞争力,同时展现出更优的训练稳定性和计算效率。

链接: https://arxiv.org/abs/2504.17804
作者: Andrew Kiruluta
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel spectral generative model for image synthesis that departs radically from the common variational, adversarial, and diffusion paradigms. In our approach, images, after being flattened into one-dimensional signals, are reconstructed as linear combinations of a set of learned spectral basis functions, where each basis is explicitly parameterized in terms of frequency, phase, and amplitude. The model jointly learns a global spectral dictionary with time-varying modulations and per-image mixing coefficients that quantify the contributions of each spectral component. Subsequently, a simple probabilistic model is fitted to these mixing coefficients, enabling the deterministic generation of new images by sampling from the latent space. This framework leverages deterministic dictionary learning, offering a highly interpretable and physically meaningful representation compared to methods relying on stochastic inference or adversarial training. Moreover, the incorporation of frequency-domain loss functions, computed via the short-time Fourier transform (STFT), ensures that the synthesized images capture both global structure and fine-grained spectral details, such as texture and edge information. Experimental evaluations on the CIFAR-10 benchmark demonstrate that our approach not only achieves competitive performance in terms of reconstruction quality and perceptual fidelity but also offers improved training stability and computational efficiency. This new type of generative model opens up promising avenues for controlled synthesis, as the learned spectral dictionary affords a direct handle on the intrinsic frequency content of the images, thus providing enhanced interpretability and potential for novel applications in image manipulation and analysis.
zh

[CV-71] RSFR: A Coarse-to-Fine Reconstruction Framework for Diffusion Tensor Cardiac MRI with Semantic-Aware Refinement

【速读】:该论文旨在解决心脏扩散张量成像(Cardiac Diffusion Tensor Imaging, DTI)在临床应用中因低信噪比、混叠伪影以及对精确定量保真度需求所带来的技术挑战。论文提出了一种名为RSFR(Reconstruction, Segmentation, Fusion Refinement)的新框架,作为解决方案的关键,RSFR采用从粗到精的策略,通过Segment Anything Model引入零样本语义先验,并基于Vision Mamba构建稳健的重建主干网络。该框架有效整合语义特征以减轻伪影并提升成像精度,在高欠采样率下实现了最先进的重建质量和精确的扩散张量参数估计。实验结果表明,RSFR在性能上显著优于现有方法,展现出其鲁棒性、可扩展性以及向临床转化的潜力。

链接: https://arxiv.org/abs/2504.18520
作者: Jiahao Huang,Fanwen Wang,Pedro F. Ferreira,Haosen Zhang,Yinzhe Wu,Zhifan Gao,Lei Zhu,Angelica I. Aviles-Rivero,Carola-Bibiane Schonlieb,Andrew D. Scott,Zohya Khalique,Maria Dwornik,Ramyah Rajakulasingam,Ranil De Silva,Dudley J. Pennell,Guang Yang,Sonia Nielles-Vallespin
机构: Department of Bioengineering and Imperial-X, Imperial College London, London, United Kingdom; National Heart and Lung Institute, Imperial College London, London, United Kingdom; Cardiovascular Research Centre, Royal Brompton Hospital, London, United Kingdom; School of Biomedical Engineering, Shenzhen Campus of Sun Yat-sen University, Guangzhou, China; Robotics and Autonomous Systems Thrust & Data Science and Analytics Thrust, HKUST (GZ), Guangzhou, China; Yau Mathematical Sciences Centre, Tsinghua University, Beijing, China; Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom; School of Biomedical Engineering and Imaging Sciences, King’s College London, London, United Kingdom
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac diffusion tensor imaging (DTI) offers unique insights into cardiomyocyte arrangements, bridging the gap between microscopic and macroscopic cardiac function. However, its clinical utility is limited by technical challenges, including a low signal-to-noise ratio, aliasing artefacts, and the need for accurate quantitative fidelity. To address these limitations, we introduce RSFR (Reconstruction, Segmentation, Fusion Refinement), a novel framework for cardiac diffusion-weighted image reconstruction. RSFR employs a coarse-to-fine strategy, leveraging zero-shot semantic priors via the Segment Anything Model and a robust Vision Mamba-based reconstruction backbone. Our framework integrates semantic features effectively to mitigate artefacts and enhance fidelity, achieving state-of-the-art reconstruction quality and accurate DT parameter estimation under high undersampling rates. Extensive experiments and ablation studies demonstrate the superior performance of RSFR compared to existing methods, highlighting its robustness, scalability, and potential for clinical translation in quantitative cardiac DTI.
zh

[CV-72] Nearly isotropic segmentation for medial temporal lobe subregions in multi-modality MRI

【速读】:该论文旨在解决基于 T2 加权 (T2w) 磁共振成像 (MRI) 的内侧颞叶 (MTL) 亚区厚度测量精度受限的问题。由于 T2w MRI 在海马区域具有较高的平面内分辨率,常用于海马亚区分割,但其较差的层厚分辨率降低了亚区厚度测量的准确性。为了解决这一问题,研究的关键在于开发了一种近乎各向同性的分割流水线,该流水线结合了图像与标签的上采样以及 T2w MRI 中的高分辨率分割技术。具体而言,首先基于现有各向异性图谱创建了一个高分辨率图谱,并通过非局部均值方法将图像和标签的分辨率提升至近乎各向同性的 0.4×0.4×0.52 mm³;其次,在此高分辨率图谱基础上训练了一种多模态深度学习分割模型。最终实验表明,这种近乎各向同性的亚区分割显著提高了 T2w MRI 中作为神经变性成像生物标志物的皮质厚度测量的准确性。

链接: https://arxiv.org/abs/2504.18442
作者: Yue Li,Pulkit Khandelwal,Long Xie,Laura E. M. Wisse,Nidhi Mundada,Christopher A. Brown,Emily McGrew,Amanda Denning,Sandhitsu R. Das,David A. Wolk,Paul A. Yushkevich
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Morphometry of medial temporal lobe (MTL) subregions in brain MRI is sensitive biomarker to Alzheimers Disease and other related conditions. While T2-weighted (T2w) MRI with high in-plane resolution is widely used to segment hippocampal subfields due to its higher contrast in hippocampus, its lower out-of-plane resolution reduces the accuracy of subregion thickness measurements. To address this issue, we developed a nearly isotropic segmentation pipeline that incorporates image and label upsampling and high-resolution segmentation in T2w MRI. First, a high-resolution atlas was created based on an existing anisotropic atlas derived from 29 individuals. Both T1-weighted and T2w images in the atlas were upsampled from their original resolution to a nearly isotropic resolution 0.4x0.4x0.52mm3 using a non-local means approach. Manual segmentations within the atlas were also upsampled to match this resolution using a UNet-based neural network, which was trained on a cohort consisting of both high-resolution ex vivo and low-resolution anisotropic in vivo MRI with manual segmentations. Second, a multi-modality deep learning-based segmentation model was trained within this nearly isotropic atlas. Finally, experiments showed the nearly isotropic subregion segmentation improved the accuracy of cortical thickness as an imaging biomarker for neurodegeneration in T2w MRI.
zh

[CV-73] HepatoGEN: Generating Hepatobiliary Phase MRI with Perceptual and Adversarial Models

【速读】:该论文旨在解决动态对比增强磁共振成像(Dynamic Contrast-Enhanced Magnetic Resonance Imaging, DCE-MRI)中获取肝胆期(Hepatobiliary Phase, HBP)图像需要较长扫描时间的问题,这可能会影响患者舒适度和设备使用效率。论文的关键解决方案是提出了一种基于深度学习的方法,利用早期对比剂相位(如预对比剂和过渡期)的数据合成HBP图像。为此,研究比较了三种生成模型:感知U-Net、感知GAN(perceptual GAN, pGAN)以及去噪扩散概率模型(denoising diffusion probabilistic model, DDPM),并通过引入对比演化评分(Contrast Evolution Score, CES)来优化训练数据质量,从而提升模型性能。这一方法展示了在不牺牲诊断价值的前提下减少扫描时间的可行性,突显了深度学习在肝脏MRI动态对比增强中的临床潜力。

链接: https://arxiv.org/abs/2504.18405
作者: Jens Hooge,Gerard Sanroma-Guell,Faidra Stavropoulou,Alexander Ullmann,Gesine Knobloch,Mark Klemens,Carola Schmidt,Sabine Weckbach,Andreas Bolz
机构: Bayer AG (拜耳公司); Bayer Inc. (拜耳加拿大公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays a crucial role in the detection and characterization of focal liver lesions, with the hepatobiliary phase (HBP) providing essential diagnostic information. However, acquiring HBP images requires prolonged scan times, which may compromise patient comfort and scanner throughput. In this study, we propose a deep learning based approach for synthesizing HBP images from earlier contrast phases (precontrast and transitional) and compare three generative models: a perceptual U-Net, a perceptual GAN (pGAN), and a denoising diffusion probabilistic model (DDPM). We curated a multi-site DCE-MRI dataset from diverse clinical settings and introduced a contrast evolution score (CES) to assess training data quality, enhancing model performance. Quantitative evaluation using pixel-wise and perceptual metrics, combined with qualitative assessment through blinded radiologist reviews, showed that pGAN achieved the best quantitative performance but introduced heterogeneous contrast in out-of-distribution cases. In contrast, the U-Net produced consistent liver enhancement with fewer artifacts, while DDPM underperformed due to limited preservation of fine structural details. These findings demonstrate the feasibility of synthetic HBP image generation as a means to reduce scan time without compromising diagnostic utility, highlighting the clinical potential of deep learning for dynamic contrast enhancement in liver MRI. A project demo is available at: this https URL
zh

[CV-74] A Multimodal Deep Learning Approach for White Matter Shape Prediction in Diffusion MRI Tractography

【速读】:本文旨在解决基于体素表示的传统白质束形状测量计算方法在大规模数据集上因计算复杂度高、耗时长而难以高效应用的问题。为应对这一挑战,论文提出了一种名为Tract2Shape的新型多模态深度学习框架,其关键在于利用几何(点云)特征和标量(表格)特征来预测十种白质束形状测量指标,并通过主成分分析(PCA)进行降维以提升模型效率。此外,通过在HCP-YA和PPMI两个独立获取的数据集上的训练与评估,验证了该框架在性能和泛化能力上的优越性,特别是在平均皮尔逊相关系数(Pearson’s r)和归一化均方误差(nMSE)方面的表现。

链接: https://arxiv.org/abs/2504.18400
作者: Yui Lo,Yuqian Chen,Dongnan Liu,Leo Zekelman,Jarrett Rushmore,Yogesh Rathi,Nikos Makris,Alexandra J. Golby,Fan Zhang,Weidong Cai,Lauren J. O’Donnell
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 3 figures, 6 tables

点击查看摘要

Abstract:Shape measures have emerged as promising descriptors of white matter tractography, offering complementary insights into anatomical variability and associations with cognitive and clinical phenotypes. However, conventional methods for computing shape measures are computationally expensive and time-consuming for large-scale datasets due to reliance on voxel-based representations. We propose Tract2Shape, a novel multimodal deep learning framework that leverages geometric (point cloud) and scalar (tabular) features to predict ten white matter tractography shape measures. To enhance model efficiency, we utilize a dimensionality reduction algorithm for the model to predict five primary shape components. The model is trained and evaluated on two independently acquired datasets, the HCP-YA dataset, and the PPMI dataset. We evaluate the performance of Tract2Shape by training and testing it on the HCP-YA dataset and comparing the results with state-of-the-art models. To further assess its robustness and generalization ability, we also test Tract2Shape on the unseen PPMI dataset. Tract2Shape outperforms SOTA deep learning models across all ten shape measures, achieving the highest average Pearson’s r and the lowest nMSE on the HCP-YA dataset. The ablation study shows that both multimodal input and PCA contribute to performance gains. On the unseen testing PPMI dataset, Tract2Shape maintains a high Pearson’s r and low nMSE, demonstrating strong generalizability in cross-dataset evaluation. Tract2Shape enables fast, accurate, and generalizable prediction of white matter shape measures from tractography data, supporting scalable analysis across datasets. This framework lays a promising foundation for future large-scale white matter shape analysis.
zh

[CV-75] Partition Map-Based Fast Block Partitioning for VVC Inter Coding

【速读】:该论文旨在解决Versatile Video Coding (VVC) 编码器中因采用四叉树加嵌套多类型树(Quadtree with Nested Multi-Type Tree, QT+MTT)块结构而显著增加的编码复杂度问题。为应对这一挑战,论文提出了一种基于分割图的快速块分区算法,以加速运动补偿预测中的块划分过程。方案的关键在于通过分析VVC运动补偿编码的特点,改进了分割图设计,引入了多类型树掩码(MTT Mask)以实现早期终止,并进一步开发了一种结合空间与时间特征的神经网络来预测分割图。该网络包含堆叠的自顶向下和自底向上处理、量化参数调制层以及基于分区的自适应扭曲等特殊设计。此外,论文还提出了双阈值决策方案,以在降低计算复杂度与保持率失真性能之间实现精细平衡。实验结果显示,所提方法在随机接入配置下平均可节省51.30%的编码时间,同时仅增加2.12%的Bjontegaard Delta比特率(BDBR)。

链接: https://arxiv.org/abs/2504.18398
作者: Xinmin Feng,Zhuoyuan Li,Li Li,Dong Liu,Feng Wu
机构: MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 26 figures. Project page: this https URL

点击查看摘要

Abstract:Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.
zh

[CV-76] NUDF: Neural Unsigned Distance Fields for high resolution 3D medical image segmentation

【速读】:该论文旨在解决医学图像分割中高分辨率处理与内存需求之间的矛盾,同时克服下采样导致的重要细节丢失的问题。传统方法通常使用离散标签图(discrete labelmap)来表示解剖结构,但对于复杂且高度可变的形状(如左心耳,Left Atrial Appendage, LAA),这些方法难以准确捕捉细节。论文的关键创新在于提出了一种直接从图像学习神经无符号距离场(Neural Unsigned Distance Field, NUDF)的方法。NUDF 具有较小的内存需求,支持高分辨率处理,同时其连续场的特性能够生成任意拓扑结构的高分辨率三维网格模型(mesh model)。通过在 CT 图像中进行左心耳分割任务的实验验证,该方法成功预测了能够捕获 LAA 细节的 3D 网格模型,并达到了与 CT 图像体素间距相当的精度。

链接: https://arxiv.org/abs/2504.18344
作者: Kristine Sørensen,Oscar Camara,Ole de Backer,Klaus Kofoed,Rasmus Paulsen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation is often considered as the task of labelling each pixel or voxel as being inside or outside a given anatomy. Processing the images at their original size and resolution often result in insuperable memory requirements, but downsampling the images leads to a loss of important details. Instead of aiming to represent a smooth and continuous surface in a binary voxel-grid, we propose to learn a Neural Unsigned Distance Field (NUDF) directly from the image. The small memory requirements of NUDF allow for high resolution processing, while the continuous nature of the distance field allows us to create high resolution 3D mesh models of shapes of any topology (i.e. open surfaces). We evaluate our method on the task of left atrial appendage (LAA) segmentation from Computed Tomography (CT) images. The LAA is a complex and highly variable shape, being thus difficult to represent with traditional segmentation methods using discrete labelmaps. With our proposed method, we are able to predict 3D mesh models that capture the details of the LAA and achieve accuracy in the order of the voxel spacing in the CT images.
zh

[CV-77] owards a deep learning approach for classifying treatment response in glioblastomas

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastomas)治疗反应评估的复杂性和耗时性问题,通过构建基于深度学习(Deep Learning, DL)的自动化分类管道,实现对Response Assessment in Neuro-Oncology (RANO)标准的自动分类。论文的关键在于设计了一个包含多种实验方案的深度学习框架,包括输入图像处理方式(如减法操作)、模态组合、模型架构选择(如Densenet264)、预训练任务以及临床数据的融合,并在开放数据集LUMIERE上进行验证。最终,采用Densenet264模型,在仅使用T1加权、T2加权和Fluid Attenuated Inversion Recovery (FLAIR)三种MRI模态作为输入且不进行任何预训练的情况下,实现了50.96%的平衡准确率(Balanced Accuracy)。此外,论文还结合解释性方法(如Saliency Maps和Grad-CAM),验证了模型在识别肿瘤区域方面的有效性与局限性。这一研究为基于RANO标准的胶质母细胞瘤治疗反应评估提供了基准,并强调了评估过程中潜在的异质性因素的重要性。

链接: https://arxiv.org/abs/2504.18268
作者: Ana Matoso,Catarina Passarinho,Marta P. Loureiro,José Maria Moreira,Patrícia Figueiredo,Rita G. Nunes
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glioblastomas are the most aggressive type of glioma, having a 5-year survival rate of 6.9%. Treatment typically involves surgery, followed by radiotherapy and chemotherapy, and frequent magnetic resonance imaging (MRI) scans to monitor disease progression. To assess treatment response, radiologists use the Response Assessment in Neuro-Oncology (RANO) criteria to categorize the tumor into one of four labels based on imaging and clinical features: complete response, partial response, stable disease, and progressive disease. This assessment is very complex and time-consuming. Since deep learning (DL) has been widely used to tackle classification problems, this work aimed to implement the first DL pipeline for the classification of RANO criteria based on two consecutive MRI acquisitions. The models were trained and tested on the open dataset LUMIERE. Five approaches were tested: 1) subtraction of input images, 2) different combinations of modalities, 3) different model architectures, 4) different pretraining tasks, and 5) adding clinical data. The pipeline that achieved the best performance used a Densenet264 considering only T1-weighted, T2-weighted, and Fluid Attenuated Inversion Recovery (FLAIR) images as input without any pretraining. A median Balanced Accuracy of 50.96% was achieved. Additionally, explainability methods were applied. Using Saliency Maps, the tumor region was often successfully highlighted. In contrast, Grad-CAM typically failed to highlight the tumor region, with some exceptions observed in the Complete Response and Progressive Disease classes, where it effectively identified the tumor region. These results set a benchmark for future studies on glioblastoma treatment response assessment based on the RANO criteria while emphasizing the heterogeneity of factors that might play a role when assessing the tumor’s response to treatment.
zh

[CV-78] Physics-Driven Neural Compensation For Electrical Impedance Tomography

【速读】:该论文旨在解决电学阻抗断层成像(Electrical Impedance Tomography, EIT)面临的两个主要挑战:其逆问题的不适定性以及空间上敏感度分布的可变性和位置依赖性。传统基于模型的方法通过正则化缓解不适定性问题,但未能充分考虑敏感度的可变性;而监督深度学习方法虽然能够处理数据驱动的问题,却需要大量标注数据且泛化能力有限。此外,近期基于神经场的方法虽引入隐式正则化技术,但通常忽视了EIT背后的物理原理,从而限制了其实效性。

论文提出了一种名为PhyNC(Physics-driven Neural Compensation)的无监督深度学习框架,该框架结合了EIT的物理原理。PhyNC的关键在于通过动态分配神经网络在低敏感区域的表示能力来同时应对逆问题的不适定性和敏感度分布不均的问题,从而实现精确且平衡的电导率重建。实验结果表明,无论是在模拟数据还是真实实验数据上,PhyNC在细节保留和抗伪影方面均优于现有方法,特别是在低敏感区域表现尤为突出。此方法不仅提升了EIT图像重建的鲁棒性,还提供了一个灵活的框架,可以推广应用于其他具有类似挑战的成像模态。

链接: https://arxiv.org/abs/2504.18067
作者: Chuyu Wang,Huiting Deng,Dong Liu
机构: School of Biomedical Engineering and Suzhou Institute for Advanced Research, University of Science and Technology of China (中国科学技术大学生物医学工程学院和苏州先进技术研究院); CAS Key Laboratory of Microscale Magnetic Resonance, University of Science and Technology of China (中国科学技术大学微观磁共振重点实验室); Synergetic Innovation Center of Quantum Information and Quantum Physics, University of Science and Technology of China (中国科学技术大学量子信息与量子物理协同创新中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electrical Impedance Tomography (EIT) provides a non-invasive, portable imaging modality with significant potential in medical and industrial applications. Despite its advantages, EIT encounters two primary challenges: the ill-posed nature of its inverse problem and the spatially variable, location-dependent sensitivity distribution. Traditional model-based methods mitigate ill-posedness through regularization but overlook sensitivity variability, while supervised deep learning approaches require extensive training data and lack generalization. Recent developments in neural fields have introduced implicit regularization techniques for image reconstruction, but these methods typically neglect the physical principles underlying EIT, thus limiting their effectiveness. In this study, we propose PhyNC (Physics-driven Neural Compensation), an unsupervised deep learning framework that incorporates the physical principles of EIT. PhyNC addresses both the ill-posed inverse problem and the sensitivity distribution by dynamically allocating neural representational capacity to regions with lower sensitivity, ensuring accurate and balanced conductivity reconstructions. Extensive evaluations on both simulated and experimental data demonstrate that PhyNC outperforms existing methods in terms of detail preservation and artifact resistance, particularly in low-sensitivity regions. Our approach enhances the robustness of EIT reconstructions and provides a flexible framework that can be adapted to other imaging modalities with similar challenges.
zh

[CV-79] Spectral Bias Correction in PINNs for Myocardial Image Registration of Pathological Data

【速读】:该论文致力于解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在心肌图像配准中因频谱偏倚(spectral bias)导致的建模高频率变形能力不足的问题,这种不足会引发不准确且生物力学上不可信的结果,特别是在病理性数据中。论文的关键解决方案在于通过引入Fourier特征映射(Fourier Feature mappings)以及在PINN框架中加入调制策略(modulation strategies),从而提升PINN捕捉心肌病理性数据中复杂高频率变形的能力,同时保持结果的生物力学合理性。实验表明,所提出的方法显著提高了配准精度,并具备跨多个患者和病理情况的可扩展性。

链接: https://arxiv.org/abs/2504.17945
作者: Bastien C. Baluyot,Marta Varela,Chen Qin
机构: Imperial College London (帝国理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Accurate myocardial image registration is essential for cardiac strain analysis and disease diagnosis. However, spectral bias in neural networks impedes modeling high-frequency deformations, producing inaccurate, biomechanically implausible results, particularly in pathological data. This paper addresses spectral bias in physics-informed neural networks (PINNs) by integrating Fourier Feature mappings and introducing modulation strategies into a PINN framework. Experiments on two distinct datasets demonstrate that the proposed methods enhance the PINN’s ability to capture complex, high-frequency deformations in cardiomyopathies, achieving superior registration accuracy while maintaining biomechanical plausibility - thus providing a foundation for scalable cardiac image registration and generalization across multiple patients and pathologies.
zh

[CV-80] Predicting Dairy Calf Body Weight from Depth Images Using Deep Learning (YOLOv8) and Threshold Segmentation with Cross-Validation and Longitudinal Analysis

【速读】:该论文旨在解决奶牛犊体重(BW)监测中存在的劳动密集、时间消耗大以及设施限制等问题,同时探索基于图像的非接触式方法在早期预测后期体重的可能性。论文的关键在于开发基于深度学习的分割模型以提取奶牛犊身体指标,并通过与基于阈值的方法进行比较,评估使用线性回归(LR)、极端梯度提升(XGBoost)及线性混合模型(LMM)进行单次和多次时间点交叉验证的体重预测性能。研究结果表明,You Only Look Once版本8(YOLOv8)深度学习分割模型的表现优于阈值法,而LMM在多次时间点交叉验证中的纵向体重预测最为准确。因此,深度学习技术在自动化体重预测中的应用潜力被揭示,这将有助于提高农场管理水平。

链接: https://arxiv.org/abs/2504.17943
作者: Mingsi Liao,Gota Morota,Ye Bi,Rebecca R. Cockrum
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published on Animals, 18 March 2025

点击查看摘要

Abstract:Monitoring calf body weight (BW) before weaning is essential for assessing growth, feed efficiency, health, and weaning readiness. However, labor, time, and facility constraints limit BW collection. Additionally, Holstein calf coat patterns complicate image-based BW estimation, and few studies have explored non-contact measurements taken at early time points for predicting later BW. The objectives of this study were to (1) develop deep learning-based segmentation models for extracting calf body metrics, (2) compare deep learning segmentation with threshold-based methods, and (3) evaluate BW prediction using single-time-point cross-validation with linear regression (LR) and extreme gradient boosting (XGBoost) and multiple-time-point cross-validation with LR, XGBoost, and a linear mixed model (LMM). Depth images from Holstein (n = 63) and Jersey (n = 5) pre-weaning calves were collected, with 20 Holstein calves being weighed manually. Results showed that You Only Look Once version 8 (YOLOv8) deep learning segmentation (intersection over union = 0.98) outperformed threshold-based methods (0.89). In single-time-point cross-validation, XGBoost achieved the best BW prediction (R^2 = 0.91, mean absolute percentage error (MAPE) = 4.37%), while LMM provided the most accurate longitudinal BW prediction (R^2 = 0.99, MAPE = 2.39%). These findings highlight the potential of deep learning for automated BW prediction, enhancing farm management.
zh

[CV-81] Material Identification Via RFID For Smart Shopping

【速读】:该论文旨在解决无人收银商店中因商品被藏匿于背包、口袋或包内而导致的防损挑战。现有技术依赖计算机视觉和RFID标签来关联购物者与商品,但这些方法在面对隐藏物品时存在局限性。论文提出的解决方案通过利用不同容器对射频信号的衰减和散射特性,将现有的RFID标签物品转化为材质传感器。其关键是结合接收信号强度指示(RSSI)和相位角训练神经网络,以分类常见的七种容器,并进一步引入距离测量以提高系统准确性。该系统能够在零售环境中实时识别可疑事件,从而实现主动的无收银员零售防损,同时充分利用现有基础设施。

链接: https://arxiv.org/abs/2504.17898
作者: David Wang,Derek Goh,Jiale Zhang
机构: University of Michigan (密歇根大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 7 figures

点击查看摘要

Abstract:Cashierless stores rely on computer vision and RFID tags to associate shoppers with items, but concealed items placed in backpacks, pockets, or bags create challenges for theft prevention. We introduce a system that turns existing RFID tagged items into material sensors by exploiting how different containers attenuate and scatter RF signals. Using RSSI and phase angle, we trained a neural network to classify seven common containers. In a simulated retail environment, the model achieves 89% accuracy with one second samples and 74% accuracy from single reads. Incorporating distance measurements, our system achieves 82% accuracy across 0.3-2m tag to reader separations. When deployed at aisle or doorway choke points, the system can flag suspicious events in real time, prompting camera screening or staff intervention. By combining material identification with computer vision tracking, our system provides proactive loss prevention for cashierless retail while utilizing existing infrastructure.
zh

[CV-82] A Deep Bayesian Convolutional Spiking Neural Network-based CAD system with Uncertainty Quantification for Medical Images Classification

【速读】:该论文旨在解决基于深度尖峰神经网络(Deep Spiking Neural Network, SNN)的计算机辅助诊断(CAD)系统由于无法量化预测不确定性而导致的可靠性不足问题。论文的关键解决方案是提出了一种基于深度贝叶斯卷积尖峰神经网络的CAD系统,并引入了带有不确定性感知模块的设计。具体而言,通过使用蒙特卡洛丢弃法(Monte Carlo Dropout)作为贝叶斯近似方法来量化不确定性,该方法在多种医学图像分类任务中进行了验证。实验结果表明,所提出的模型不仅准确而且可靠,可作为传统深度学习方法在医学图像分类中的有效替代方案。

链接: https://arxiv.org/abs/2504.17819
作者: Mohaddeseh Chegini,Ali Mahloojifar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Computer_Aided Diagnosis (CAD) systems facilitate accurate diagnosis of diseases. The development of CADs by leveraging third generation neural network, namely, Spiking Neural Network (SNN), is essential to utilize of the benefits of SNNs, such as their event_driven processing, parallelism, low power consumption, and the ability to process sparse temporal_spatial information. However, Deep SNN as a deep learning model faces challenges with unreliability. To deal with unreliability challenges due to inability to quantify the uncertainty of the predictions, we proposed a deep Bayesian Convolutional Spiking Neural Network based_CADs with uncertainty_aware module. In this study, the Monte Carlo Dropout method as Bayesian approximation is used as an uncertainty quantification method. This method was applied to several medical image classification tasks. Our experimental results demonstrate that our proposed model is accurate and reliable and will be a proper alternative to conventional deep learning for medical image classification.
zh

人工智能

[AI-0] Generalization Capability for Imitation Learning

【速读】:该论文试图解决模仿学习(Imitation Learning)在有限数据集训练下难以泛化到训练分布之外的问题。论文从信息论和数据分布特性出发,提供了一个统一视角来分析模仿学习的泛化能力。解决方案的关键在于通过理论推导指出泛化差距可以由中间表征的条件信息瓶颈(conditional information bottleneck)以及模型参数与训练数据之间的互信息(mutual information)来上限约束,并进一步揭示输入到输出的高条件熵能够平滑似然景观(likelihood landscape),从而减小泛化差距的上限,同时缩短随机梯度下降(SGD)逃离尖锐局部最优的时间,增加达到全局最优的可能性。这些见解强调了不仅需要扩展输入数据的多样性,还需要丰富针对相同输入的输出标签的变异性的重要性。

链接: https://arxiv.org/abs/2504.18538
作者: Yixiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Imitation learning holds the promise of equipping robots with versatile skills by learning from expert demonstrations. However, policies trained on finite datasets often struggle to generalize beyond the training distribution. In this work, we present a unified perspective on the generalization capability of imitation learning, grounded in both information theorey and data distribution property. We first show that the generalization gap can be upper bounded by (i) the conditional information bottleneck on intermediate representations and (ii) the mutual information between the model parameters and the training dataset. This characterization provides theoretical guidance for designing effective training strategies in imitation learning, particularly in determining whether to freeze, fine-tune, or train large pretrained encoders (e.g., vision-language models or vision foundation models) from scratch to achieve better generalization. Furthermore, we demonstrate that high conditional entropy from input to output induces a flatter likelihood landscape, thereby reducing the upper bound on the generalization gap. In addition, it shortens the stochastic gradient descent (SGD) escape time from sharp local minima, which may increase the likelihood of reaching global optima under fixed optimization budgets. These insights explain why imitation learning often exhibits limited generalization and underscore the importance of not only scaling the diversity of input data but also enriching the variability of output labels conditioned on the same input.
zh

[AI-1] Adapting Probabilistic Risk Assessment for AI

【速读】:该论文旨在解决现代通用人工智能系统带来的紧迫风险管理挑战,其核心问题是现有风险评估方法往往依赖于选择性测试和未记录的风险优先级假设,未能系统性地识别人工智能系统对社会和生物圈构成直接或间接风险的所有潜在路径。论文的关键解决方案是引入一种基于概率风险评估(PRA)的人工智能框架,该框架借鉴了高可靠性行业(如核能、航天)的成熟PRA技术,并针对先进人工智能的新挑战进行了适配。该框架通过三个方面实现其关键突破:(1) 面向方面的危害分析提供由人工智能系统方面(如能力、领域知识、可用性等)第一性原理分类指导的系统性危害覆盖;(2) 风险路径建模利用双向分析和前瞻性技术,分析从系统方面到社会影响的因果链;(3) 不确定性管理采用情景分解、参考尺度和显式追踪协议,以结构化方式处理新颖性或数据有限情况下的可信预测。此外,该框架通过整合证据生成量化且可比的绝对风险估计,从而协调多样化的评估方法,最终形成汇总所有评估风险的综合风险报告卡,为开发者、评估者和监管者提供实用工具。

链接: https://arxiv.org/abs/2504.18536
作者: Anna Katariina Wisakanto,Joe Rogero,Avyay M. Casheekar,Richard Mallah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
备注: for project website, see this https URL

点击查看摘要

Abstract:Modern general-purpose artificial intelligence (AI) systems present an urgent risk management challenge, as their rapidly evolving capabilities and potential for catastrophic harm outpace our ability to reliably assess their risks. Current methods often rely on selective testing and undocumented assumptions about risk priorities, frequently failing to make a serious attempt at assessing the set of pathways through which Al systems pose direct or indirect risks to society and the biosphere. This paper introduces the probabilistic risk assessment (PRA) for AI framework, adapting established PRA techniques from high-reliability industries (e.g., nuclear power, aerospace) for the new challenges of advanced AI. The framework guides assessors in identifying potential risks, estimating likelihood and severity, and explicitly documenting evidence, underlying assumptions, and analyses at appropriate granularities. The framework’s implementation tool synthesizes the results into a risk report card with aggregated risk estimates from all assessed risks. This systematic approach integrates three advances: (1) Aspect-oriented hazard analysis provides systematic hazard coverage guided by a first-principles taxonomy of AI system aspects (e.g. capabilities, domain knowledge, affordances); (2) Risk pathway modeling analyzes causal chains from system aspects to societal impacts using bidirectional analysis and incorporating prospective techniques; and (3) Uncertainty management employs scenario decomposition, reference scales, and explicit tracing protocols to structure credible projections with novelty or limited data. Additionally, the framework harmonizes diverse assessment methods by integrating evidence into comparable, quantified absolute risk estimates for critical decisions. We have implemented this as a workbook tool for AI developers, evaluators, and regulators, available on the project website.
zh

[AI-2] Scaling Laws For Scalable Oversight

【速读】:该论文试图解决如何量化和优化“可扩展监督”(Scalable Oversight)这一策略在控制未来超级智能系统中的有效性问题。具体而言,论文关注的是监督者与被监督系统之间能力差距所带来的挑战,并探索如何提高监督的成功概率。解决方案的关键在于提出一个框架,将监督建模为能力不匹配玩家之间的博弈,并引入监督特定和欺骗特定的Elo评分机制来量化监督效果。通过分析不同监督场景下的规模定律(Scaling Laws),论文进一步研究了嵌套式可扩展监督(Nested Scalable Oversight, NSO)的条件及其最优层级数,以最大化监督成功概率。

链接: https://arxiv.org/abs/2504.18530
作者: Joshua Engels,David D. Baek,Subhash Kantamneni,Max Tegmark
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 34 pages, 17 figures

点击查看摘要

Abstract:Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific and deception-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: “Mafia”, “Debate”, “Backdoor Code” and “Wargames”. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability (using Chatbot Arena Elo as a proxy for general capability). We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. In our numerical examples, the NSO success rate is below 52% when overseeing systems that are 400 Elo points stronger than the baseline overseer, and it declines further for overseeing even stronger systems.
zh

[AI-3] DeSIA: Attribute Inference Attacks Against Limited Fixed Aggregate Statistics

【速读】:本文旨在解决通过固定聚合统计数据评估数据发布机制隐私风险的问题,尤其是在仅发布有限数量统计的情况下缺乏有效方法的挑战。论文的关键解决方案是提出了一种针对固定聚合统计数据的推理攻击框架以及一种名为DeSIA的属性推理攻击方法。DeSIA通过利用US Census PPMF数据集进行实例化验证,证明其在识别易受攻击用户方面显著优于基于重建的攻击方法,实现了在10⁻³误报率下的0.14真正阳性率。此外,研究还展示了DeSIA在处理不可验证属性、不同数量的聚合统计数据及噪声添加水平下的表现,并通过广泛的消融研究证明其可成功应用于成员推理任务。最终结果表明,即使发布的聚合数量较少,单纯的数据聚合不足以保护隐私,强调了在发布前采用正式隐私机制的重要性。

链接: https://arxiv.org/abs/2504.18497
作者: Yifeng Mao,Bozhidar Stevanoski,Yves-Alexandre de Montjoye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Empirical inference attacks are a popular approach for evaluating the privacy risk of data release mechanisms in practice. While an active attack literature exists to evaluate machine learning models or synthetic data release, we currently lack comparable methods for fixed aggregate statistics, in particular when only a limited number of statistics are released. We here propose an inference attack framework against fixed aggregate statistics and an attribute inference attack called DeSIA. We instantiate DeSIA against the U.S. Census PPMF dataset and show it to strongly outperform reconstruction-based attacks. In particular, we show DeSIA to be highly effective at identifying vulnerable users, achieving a true positive rate of 0.14 at a false positive rate of 10^-3 . We then show DeSIA to perform well against users whose attributes cannot be verified and when varying the number of aggregate statistics and level of noise addition. We also perform an extensive ablation study of DeSIA and show how DeSIA can be successfully adapted to the membership inference task. Overall, our results show that aggregation alone is not sufficient to protect privacy, even when a relatively small number of aggregates are being released, and emphasize the need for formal privacy mechanisms and testing before aggregate statistics are released.
zh

[AI-4] Action Flow Matching for Continual Robot Learning

【速读】:本文旨在解决机器人连续学习中的核心挑战,包括动态模型的持续精化、安全适应、灾难性遗忘、异常值管理、数据效率优化以及在任务约束和资源限制下平衡探索与利用。论文的关键创新在于提出了一种基于生成框架的流匹配方法,用于在线对齐机器人动力学模型。不同于传统通过一个不准确的模型进行探索,该方法通过调整计划动作本身,使其更接近于假设模型已对齐时的动作,从而以更高效的方式收集信息数据,加速学习过程。此外,该方法能够在模型不断演化且可能不完美时减少对回放缓冲区或历史模型快照的依赖。关键在于通过直接改进动作而非依赖误配模型来实现更高效的探索与数据利用。

链接: https://arxiv.org/abs/2504.18471
作者: Alejandro Murillo-Gonzalez,Lantao Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Robotics: Science and Systems 2025

点击查看摘要

Abstract:Continual learning in robotics seeks systems that can constantly adapt to changing environments and tasks, mirroring human adaptability. A key challenge is refining dynamics models, essential for planning and control, while addressing issues such as safe adaptation, catastrophic forgetting, outlier management, data efficiency, and balancing exploration with exploitation – all within task and onboard resource constraints. Towards this goal, we introduce a generative framework leveraging flow matching for online robot dynamics model alignment. Rather than executing actions based on a misaligned model, our approach refines planned actions to better match with those the robot would take if its model was well aligned. We find that by transforming the actions themselves rather than exploring with a misaligned model – as is traditionally done – the robot collects informative data more efficiently, thereby accelerating learning. Moreover, we validate that the method can handle an evolving and possibly imperfect model while reducing, if desired, the dependency on replay buffers or legacy model snapshots. We validate our approach using two platforms: an unmanned ground vehicle and a quadrotor. The results highlight the method’s adaptability and efficiency, with a record 34.2% higher task success rate, demonstrating its potential towards enabling continual robot learning. Code: this https URL.
zh

[AI-5] Pseudo-Boolean Proof Logging for Optimal Classical Planning ICAPS’2025

【速读】:该论文旨在解决经典规划任务中证明任务不可解性或计划最优性的验证问题,提出了一种通用框架,通过伪布尔约束生成下界证书(lower-bound certificates),以实现由独立第三方验证。解决方案的关键在于将证明过程与规划算法解耦,使得任何启发式方法只要其推理能够高效表达为伪布尔约束,均可利用此框架记录证明日志,从而实现优化证明。文中以A*算法为例,结合模式数据库启发式和hmaxh^\textit{max}等具体实例,展示了如何在适度开销下生成最优性证明,并强调此方法适用于广泛的启发式策略。

链接: https://arxiv.org/abs/2504.18443
作者: Simon Dold,Malte Helmert,Jakob Nordström,Gabriele Röger,Tanja Schindler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35th International Conference on Automated Planning and Scheduling (ICAPS’2025)

点击查看摘要

Abstract:We introduce lower-bound certificates for classical planning tasks, which can be used to prove the unsolvability of a task or the optimality of a plan in a way that can be verified by an independent third party. We describe a general framework for generating lower-bound certificates based on pseudo-Boolean constraints, which is agnostic to the planning algorithm used. As a case study, we show how to modify the A^* algorithm to produce proofs of optimality with modest overhead, using pattern database heuristics and h^\textitmax as concrete examples. The same proof logging approach works for any heuristic whose inferences can be efficiently expressed as reasoning over pseudo-Boolean constraints. Comments: 35th International Conference on Automated Planning and Scheduling (ICAPS’2025) Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2504.18443 [cs.AI] (or arXiv:2504.18443v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.18443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-6] Enhancing Pre-Trained Model-Based Class-Incremental Learning through Neural Collapse

【速读】:该论文旨在解决在基于预训练模型(Pre-Trained Models, PTMs)的 Class-Incremental Learning (CIL) 中,理解特征如何在增量任务中演化与分布这一开放性挑战。论文的关键在于通过神经坍塌(Neural Collapse, NC)现象的视角,揭示特征空间的几何结构如何影响 CIL 的有效性,并提出一种新方法来动态调整特征空间以符合 NC 的优雅结构。具体而言,所提出的 Neural Collapse-inspired Pre-Trained Model-based CIL (NCPTM-CIL) 方法通过优化特征分布与 NC 几何的一致性,显著提升了连续学习过程中的性能。实验结果表明,NCPTM-CIL 在四个基准数据集上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2504.18437
作者: Kun He,Zijian Song,Shuoxi Zhang,John E. Hopcroft
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class-Incremental Learning (CIL) is a critical capability for real-world applications, enabling learning systems to adapt to new tasks while retaining knowledge from previous ones. Recent advancements in pre-trained models (PTMs) have significantly advanced the field of CIL, demonstrating superior performance over traditional methods. However, understanding how features evolve and are distributed across incremental tasks remains an open challenge. In this paper, we propose a novel approach to modeling feature evolution in PTM-based CIL through the lens of neural collapse (NC), a striking phenomenon observed in the final phase of training, which leads to a well-separated, equiangular feature space. We explore the connection between NC and CIL effectiveness, showing that aligning feature distributions with the NC geometry enhances the ability to capture the dynamic behavior of continual learning. Based on this insight, we introduce Neural Collapse-inspired Pre-Trained Model-based CIL (NCPTM-CIL), a method that dynamically adjusts the feature space to conform to the elegant NC structure, thereby enhancing the continual learning process. Extensive experiments demonstrate that NCPTM-CIL outperforms state-of-the-art methods across four benchmark datasets. Notably, when initialized with ViT-B/16-IN1K, NCPTM-CIL surpasses the runner-up method by 6.73% on VTAB, 1.25% on CIFAR-100, and 2.5% on OmniBenchmark.
zh

[AI-7] LLM patronous: Harnessing the Power of LLM s For Vulnerability Detection

【速读】:该论文试图解决传统静态和动态分析工具在网络安全领域因高误报率和浅层代码理解能力而存在的局限性,以及利用大语言模型(Large Language Models, LLMs)进行漏洞检测时面临的独特挑战,如幻觉现象、上下文长度限制和知识截止等问题。此外,以往基于机器学习的漏洞检测方法由于实际应用范围有限、特征工程难题、缺乏上下文理解和难以跟上威胁态势演变速度等原因未能取得理想效果。

论文的关键解决方案在于提出一种以人工智能驱动的稳健方法,通过结合检索增强生成(Retrieval-Augmented Generation, RAG)与多智能体混合(Mixture-of-Agents, MoA)等创新技术,旨在充分发挥LLMs的优势同时克服其固有缺陷,从而实现可靠且高效的AI赋能软件安全防护体系。

链接: https://arxiv.org/abs/2504.18423
作者: Rajesh Yarra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the transformative impact of Artificial Intelligence (AI) across various sectors, cyber security continues to rely on traditional static and dynamic analysis tools, hampered by high false positive rates and superficial code comprehension. While generative AI offers promising automation capabilities for software development, leveraging Large Language Models (LLMs) for vulnerability detection presents unique challenges. This paper explores the potential and limitations of LLMs in identifying vulnerabilities, acknowledging inherent weaknesses such as hallucinations, limited context length, and knowledge cut-offs. Previous attempts employing machine learning models for vulnerability detection have proven ineffective due to limited real-world applicability, feature engineering challenges, lack of contextual understanding, and the complexities of training models to keep pace with the evolving threat landscape. Therefore, we propose a robust AI-driven approach focused on mitigating these limitations and ensuring the quality and reliability of LLM based vulnerability detection. Through innovative methodologies combining Retrieval-Augmented Generation (RAG) and Mixtureof-Agents (MoA), this research seeks to leverage the strengths of LLMs while addressing their weaknesses, ultimately paving the way for dependable and efficient AI-powered solutions in securing the ever-evolving software landscape.
zh

[AI-8] Paradigm shift on Coding Productivity Using GenAI

【速读】:该论文试图解决工业环境中生成式 AI (GenAI) 编码助手在提升生产力方面的实证证据不足的问题。研究聚焦于电信和金融科技领域的 GenAI 工具应用,通过调查与专家访谈识别影响生产力的关键因素,包括任务复杂度、编码技能、领域知识及工具集成方式。论文指出,尽管 GenAI 工具在常规编码任务(如重构与 Javadoc 生成)中能够提高效率,但在复杂的领域特定任务中面临挑战,主要由于代码库上下文感知能力有限以及对定制化设计规则的支持不足。解决方案的关键在于引入新的编码范式,强调迭代提示优化、沉浸式开发环境以及自动化代码评估,以实现 GenAI 的有效应用。

链接: https://arxiv.org/abs/2504.18404
作者: Liang Yu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) applications are transforming software engineering by enabling automated code co-creation. However, empirical evidence on GenAI’s productivity effects in industrial settings remains limited. This paper investigates the adoption of GenAI coding assistants (e.g., Codeium, Amazon Q) within telecommunications and FinTech domains. Through surveys and interviews with industrial domain-experts, we identify primary productivity-influencing factors, including task complexity, coding skills, domain knowledge, and GenAI integration. Our findings indicate that GenAI tools enhance productivity in routine coding tasks (e.g., refactoring and Javadoc generation) but face challenges in complex, domain-specific activities due to limited context-awareness of codebases and insufficient support for customized design rules. We highlight new paradigms for coding transfer, emphasizing iterative prompt refinement, immersive development environment, and automated code evaluation as essential for effective GenAI usage.
zh

[AI-9] Bridge the Domains: Large Language Models Enhanced Cross-domain Sequential Recommendation SIGIR’25

【速读】:该论文旨在解决跨域顺序推荐(Cross-domain Sequential Recommendation, CDSR)中的两个核心问题:重叠困境(overlap dilemma)和过渡复杂性(transition complexity)。重叠困境指的是现有方法严重依赖于在所有领域都有交互记录的用户,从而限制了其实用性;过渡复杂性则涉及从混合行为序列中学习复杂的转换模式的困难。为了解决这些问题,论文提出了一个基于大语言模型(Large Language Models, LLMs)增强的跨域顺序推荐模型(LLM4CDSR)。该方案的关键在于首先通过基于LLMs的统一表示模块捕捉语义层面的物品关系,然后设计了一个可训练的适配器结合对比正则化来适应CDSR任务,并进一步利用分层LLMs画像模块总结用户的跨域偏好。最终,这些模块被整合进提出的三线程框架中以生成推荐结果。实验验证了LLM4CDSR的有效性,并已公开代码。

链接: https://arxiv.org/abs/2504.18383
作者: Qidong Liu,Xiangyu Zhao,Yejing Wang,Zijian Zhang,Howard Zhong,Chong Chen,Xiang Li,Wei Huang,Feng Tian
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: accepted by SIGIR’25

点击查看摘要

Abstract:Cross-domain Sequential Recommendation (CDSR) aims to extract the preference from the user’s historical interactions across various domains. Despite some progress in CDSR, two problems set the barrier for further advancements, i.e., overlap dilemma and transition complexity. The former means existing CDSR methods severely rely on users who own interactions on all domains to learn cross-domain item relationships, compromising the practicability. The latter refers to the difficulties in learning the complex transition patterns from the mixed behavior sequences. With powerful representation and reasoning abilities, Large Language Models (LLMs) are promising to address these two problems by bridging the items and capturing the user’s preferences from a semantic view. Therefore, we propose an LLMs Enhanced Cross-domain Sequential Recommendation model (LLM4CDSR). To obtain the semantic item relationships, we first propose an LLM-based unified representation module to represent items. Then, a trainable adapter with contrastive regularization is designed to adapt the CDSR task. Besides, a hierarchical LLMs profiling module is designed to summarize user cross-domain preferences. Finally, these two modules are integrated into the proposed tri-thread framework to derive recommendations. We have conducted extensive experiments on three public cross-domain datasets, validating the effectiveness of LLM4CDSR. We have released the code online.
zh

[AI-10] Spatial Reason er: A 3D Inference Pipeline for XR Applications

【速读】:该论文致力于解决现代扩展现实(XR)系统中语义理解与空间推理能力不足的问题,旨在支持能够以符号化方式推理三维场景的应用需求。论文提出了一种空间推理框架,通过将几何事实与符号谓词和关系相结合,处理诸如确定三维物体间相对位置(如“在……之上”、“在……之后”、“靠近”等)的关键任务。其核心解决方案在于基于定向三维包围盒表示法,并结合一套全面的空间谓词集(涵盖拓扑、连通性、方向性和方向),这些谓词以接近自然语言的形式表达。由此产生的谓词构成空间知识图谱,结合基于流水线的推理模型,实现了空间查询和动态规则评估。客户端和服务端实现展示了该框架能够高效地将几何数据转化为可操作的知识,确保复杂三维环境中的可扩展且技术无关的空间推理能力。关键之处在于其创新性的空间表示方法及其与机器学习、自然语言处理及规则系统的无缝集成,从而促进了XR应用中空间本体论的构建。

链接: https://arxiv.org/abs/2504.18380
作者: Steven Häsler,Philipp Ackermann
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 11 pages, preprint of ICVARS 2025 paper

点击查看摘要

Abstract:Modern extended reality XR systems provide rich analysis of image data and fusion of sensor input and demand AR/VR applications that can reason about 3D scenes in a semantic manner. We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks such as determining how 3D objects are arranged among each other (‘on’, ‘behind’, ‘near’, etc.). Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates, ranging from topology and connectivity to directionality and orientation, expressed in a formalism related to natural language. The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation. Implementations for client- and server-side processing demonstrate the framework’s capability to efficiently translate geometric data into actionable knowledge, ensuring scalable and technology-independent spatial reasoning in complex 3D environments. The Spatial Reasoner framework is fostering the creation of spatial ontologies, and seamlessly integrates with and therefore enriches machine learning, natural language processing, and rule systems in XR applications.
zh

[AI-11] sting Individual Fairness in Graph Neural Networks

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中个体公平性(individual fairness)的问题。尽管已有大量研究关注于诊断和缓解各类人工智能模型中的偏差,但针对GNNs的个体公平性研究相对匮乏。与传统模型不同,GNNs通过捕捉节点间的关系来建模数据的图结构,这种特性虽然能够建模复杂依赖关系,但也导致偏差可能通过节点间的连接传播,从而增加检测和缓解个体公平性违规的难度。
论文的关键解决方案在于开发一个测试框架,用于评估和保障GNNs中的个体公平性。具体而言,该框架首先系统性回顾个体公平性的相关文献,分类现有定义、度量、测试及缓解模型偏差的方法,并构建个体公平性的分类学;其次,通过调整和扩展现有的公平性测试与缓解技术,设计适用于GNNs的公平性测试与保障方法,并通过工业案例研究进行验证,重点关注基于图的大规模语言模型。

链接: https://arxiv.org/abs/2504.18353
作者: Roya Nasiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:The biases in artificial intelligence (AI) models can lead to automated decision-making processes that discriminate against groups and/or individuals based on sensitive properties such as gender and race. While there are many studies on diagnosing and mitigating biases in various AI models, there is little research on individual fairness in Graph Neural Networks (GNNs). Unlike traditional models, which treat data features independently and overlook their inter-relationships, GNNs are designed to capture graph-based structure where nodes are interconnected. This relational approach enables GNNs to model complex dependencies, but it also means that biases can propagate through these connections, complicating the detection and mitigation of individual fairness violations. This PhD project aims to develop a testing framework to assess and ensure individual fairness in GNNs. It first systematically reviews the literature on individual fairness, categorizing existing approaches to define, measure, test, and mitigate model biases, creating a taxonomy of individual fairness. Next, the project will develop a framework for testing and ensuring fairness in GNNs by adapting and extending current fairness testing and mitigation techniques. The framework will be evaluated through industrial case studies, focusing on graph-based large language models.
zh

[AI-12] PHEATPRUNER: Interpretable Data-centric Feature Selection for Multivariate Time Series Classification through Persistent Homology

【速读】:该论文旨在解决多变量时间序列分类中模型性能与可解释性之间的平衡难题,特别是在数据复杂性和高维特性下。论文提出了一种名为PHeatPruner的方法,其关键在于结合持久同调(Persistent Homology)层论(Sheaf Theory)。持久同调通过修剪高达45%的变量来简化数据,同时保持或提升多种模型(如Random Forest、CatBoost、XGBoost和LightGBM)的准确性,且无需依赖后验概率或监督优化算法;层论则提供了解释向量,揭示数据结构的深层次特征。这种方法在UEA Archive及奶牛乳腺炎检测数据集上的验证表明,PHeatPruner不仅有效维持了模型精度,还实现了复杂性的降低与可解释性的增强,为多个领域的应用提供了潜在价值。

链接: https://arxiv.org/abs/2504.18329
作者: Anh-Duy Pham,Olivier Basole Kashongwe,Martin Atzmueller,Tim Römer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Balancing performance and interpretability in multivariate time series classification is a significant challenge due to data complexity and high dimensionality. This paper introduces PHeatPruner, a method integrating persistent homology and sheaf theory to address these challenges. Persistent homology facilitates the pruning of up to 45% of the applied variables while maintaining or enhancing the accuracy of models such as Random Forest, CatBoost, XGBoost, and LightGBM, all without depending on posterior probabilities or supervised optimization algorithms. Concurrently, sheaf theory contributes explanatory vectors that provide deeper insights into the data’s structural nuances. The approach was validated using the UEA Archive and a mastitis detection dataset for dairy cows. The results demonstrate that PHeatPruner effectively preserves model accuracy. Furthermore, our results highlight PHeatPruner’s key features, i.e. simplifying complex data and offering actionable insights without increasing processing time or complexity. This method bridges the gap between complexity reduction and interpretability, suggesting promising applications in various fields.
zh

[AI-13] owards Adaptive Software Agents for Debugging

【速读】:该论文试图解决在利用多个大型语言模型(Large Language Models, LLM)代理进行错误调试时,随着代理数量增加所带来的运行成本上升以及注意力分散风险加剧的问题。论文的关键解决方案在于提出了一种自适应代理设计,即通过动态分析任务特性来确定代理的数量及其角色,而非预先定义代理的角色。这种设计使得生成的代理数量能够根据待修复代码的复杂度自动调整,在简单语法问题上通常仅需一个代理即可解决问题,而在更复杂的场景下则会生成更多代理。实验结果显示,与一次性提示方法相比,该方法的修复效果平均提升了11%,从而验证了自适应设计的有效性。

链接: https://arxiv.org/abs/2504.18316
作者: Yacine Majdoub,Eya Ben Charrada,Haifa Touati
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, FSE2025

点击查看摘要

Abstract:Using multiple agents was found to improve the debugging capabilities of Large Language Models. However, increasing the number of LLM-agents has several drawbacks such as increasing the running costs and rising the risk for the agents to lose focus. In this work, we propose an adaptive agentic design, where the number of agents and their roles are determined dynamically based on the characteristics of the task to be achieved. In this design, the agents roles are not predefined, but are generated after analyzing the problem to be solved. Our initial evaluation shows that, with the adaptive design, the number of agents that are generated depends on the complexity of the buggy code. In fact, for simple code with mere syntax issues, the problem was usually fixed using one agent only. However, for more complex problems, we noticed the creation of a higher number of agents. Regarding the effectiveness of the fix, we noticed an average improvement of 11% compared to the one-shot prompting. Given these promising results, we outline future research directions to improve our design for adaptive software agents that can autonomously plan and conduct their software goals.
zh

[AI-14] LEAM: A Prompt-only Large Language Model-enabled Antenna Modeling Method

【速读】:该论文试图解决天线建模耗时且复杂的问题,这限制了天线分析与设计的速度。论文提出的解决方案是基于大型语言模型(Large Language Model, LLM)的天线建模方法LEAM。LEAM的关键在于通过提示输入(prompt input)、图像以及从学术论文、专利和技术报告中提取的描述(可以是单一来源或多源融合),实现基于自然语言描述的天线模型自动生成功能。实验结果表明,LEAM能够在几分钟内生成正确的天线模型,从而显著提高天线建模的效率。

链接: https://arxiv.org/abs/2504.18271
作者: Tao Wu,Kexue Fu,Qiang Hua,Xinxin Liu,Muhammad Ali Imran,Bo Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注: Code are available: this https URL

点击查看摘要

Abstract:Antenna modeling is a time-consuming and complex process, decreasing the speed of antenna analysis and design. In this paper, a large language model (LLM)- enabled antenna modeling method, called LEAM, is presented to address this challenge. LEAM enables automatic antenna model generation based on language descriptions via prompt input, images, descriptions from academic papers, patents, and technical reports (either one or multiple). The effectiveness of LEAM is demonstrated by three examples: a Vivaldi antenna generated from a complete user description, a slotted patch antenna generated from an incomplete user description and the operating frequency, and a monopole slotted antenna generated from images and descriptions scanned from the literature. For all the examples, correct antenna models are generated in a few minutes. The code can be accessed via this https URL.
zh

[AI-15] Neural operators struggle to learn complex PDEs in pedestrian mobility: Hughes model case study

【速读】:该论文旨在研究神经算子(neural operators)在学习Hughes模型解方面的局限性,该模型是一个用于人群动力学的一阶双曲守恒律系统,结合了描述行人密度的Fokker-Planck方程与Hamilton-Jacobi型(Eikonal)方程。论文的关键在于评估三种最先进的神经算子(傅里叶神经算子、小波神经算子和多小波神经算子)在处理具有挑战性场景时的表现,包括不连续和高斯初始条件以及多样化的边界条件,并考察不同数值格式的影响。研究发现,尽管这些神经算子在初始条件较少不连续的情况下表现良好,但在存在多个初始不连续点和动态边界条件的复杂场景中,即使经过专门训练,它们仍表现出困难。预测结果通常更为平滑,导致总变差减少且丢失重要的物理特征,这类似于Daganzo(1995)讨论的问题,即引入人工扩散的模型会遗漏如双曲系统中的冲击波等重要特性。这些结果表明,当前的神经算子架构可能无意中引入了正则化效应,限制了其捕捉由不连续性驱动的传输动力学的能力,同时也引发了对其在交通应用中推广使用的担忧,特别是在冲击波保持至关重要的情况下。

链接: https://arxiv.org/abs/2504.18267
作者: Prajwal Chauhan,Salah Eddine Choutri,Mohamed Ghattassi,Nader Masmoudi,Saif Eddin Jabari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 15 figures, 6 tables, under review at Artificial Intelligence for Transportation | Journal

点击查看摘要

Abstract:This paper investigates the limitations of neural operators in learning solutions for a Hughes model, a first-order hyperbolic conservation law system for crowd dynamics. The model couples a Fokker-Planck equation representing pedestrian density with a Hamilton-Jacobi-type (eikonal) equation. This Hughes model belongs to the class of nonlinear hyperbolic systems that often exhibit complex solution structures, including shocks and discontinuities. In this study, we assess the performance of three state-of-the-art neural operators (Fourier Neural Operator, Wavelet Neural Operator, and Multiwavelet Neural Operator) in various challenging scenarios. Specifically, we consider (1) discontinuous and Gaussian initial conditions and (2) diverse boundary conditions, while also examining the impact of different numerical schemes. Our results show that these neural operators perform well in easy scenarios with fewer discontinuities in the initial condition, yet they struggle in complex scenarios with multiple initial discontinuities and dynamic boundary conditions, even when trained specifically on such complex samples. The predicted solutions often appear smoother, resulting in a reduction in total variation and a loss of important physical features. This smoothing behavior is similar to issues discussed by Daganzo (1995), where models that introduce artificial diffusion were shown to miss essential features such as shock waves in hyperbolic systems. These results suggest that current neural operator architectures may introduce unintended regularization effects that limit their ability to capture transport dynamics governed by discontinuities. They also raise concerns about generalizing these methods to traffic applications where shock preservation is essential. Comments: 26 pages, 15 figures, 6 tables, under review at Artificial Intelligence for Transportation | Journal Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.18267 [cs.LG] (or arXiv:2504.18267v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18267 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-16] Depth-Constrained ASV Navigation with Deep RL and Limited Sensing

【速读】:该论文旨在解决自主水面艇(ASVs)在浅水环境中因动态干扰和深度限制而面临的导航挑战,特别是传统导航策略在有限传感器信息下的安全与高效运行难题。论文的关键解决方案是提出了一种结合高斯过程(Gaussian Process, GP)回归的强化学习(Reinforcement Learning, RL)框架。该框架允许ASV仅基于每次时间步长从下视单波束回声测深仪(Single Beam Echosounder, SBES)获取的一个深度测量值,逐步构建海底地形图,并通过增强环境感知能力优化决策过程,从而实现目标到达的同时避免危险区域,同时确保在模拟到现实迁移中的泛化性能。

链接: https://arxiv.org/abs/2504.18253
作者: Amirhossein Zhalehmehrabi,Daniele Meli,Francesco Dal Santo,Francesco Trotti,Alessandro Farinelli
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:Autonomous Surface Vehicles (ASVs) play a crucial role in maritime operations, yet their navigation in shallow-water environments remains challenging due to dynamic disturbances and depth constraints. Traditional navigation strategies struggle with limited sensor information, making safe and efficient operation difficult. In this paper, we propose a reinforcement learning (RL) framework for ASV navigation under depth constraints, where the vehicle must reach a target while avoiding unsafe areas with only a single depth measurement per timestep from a downward-facing Single Beam Echosounder (SBES). To enhance environmental awareness, we integrate Gaussian Process (GP) regression into the RL framework, enabling the agent to progressively estimate a bathymetric depth map from sparse sonar readings. This approach improves decision-making by providing a richer representation of the environment. Furthermore, we demonstrate effective sim-to-real transfer, ensuring that trained policies generalize well to real-world aquatic conditions. Experimental results validate our method’s capability to improve ASV navigation performance while maintaining safety in challenging shallow-water environments.
zh

[AI-17] me and Frequency Domain-based Anomaly Detection in Smart Meter Data for Distribution Network Studies

【速读】:本文旨在解决现有基于数据驱动的算法未能充分考虑智能电表采集数据质量的问题,特别是缺乏对异常值的有效检测与区分能力,无法区分基于数值偏离或上下文异常的数据偏差。论文的关键在于提出了一种基于Isolation Forest机器学习算法和快速傅里叶变换(Fast Fourier Transform, FFT)滤波的异常检测框架,该框架能够在时域和频域同时工作,并且不受点异常值或上下文异常值的影响,从而有效识别并减轻异常对有功和无功功率数据集的影响。通过这种方法,论文强调了在高比例智能电表的配电网中集成异常检测方法的重要性。

链接: https://arxiv.org/abs/2504.18231
作者: Petar Labura,Tomislav Antic,Tomislav Capuder
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread integration of new technologies in low-voltage distribution networks on the consumer side creates the need for distribution system operators to perform advanced real-time calculations to estimate network conditions. In recent years, data-driven models based on machine learning and big data analysis have emerged for calculation purposes, leveraging the information available in large datasets obtained from smart meters and other advanced measurement infrastructure. However, existing data-driven algorithms do not take into account the quality of data collected from smart meters. They lack built-in anomaly detection mechanisms and fail to differentiate anomalies based on whether the value or context of anomalous data instances deviates from the norm. This paper focuses on methods for detecting and mitigating the impact of anomalies on the consumption of active and reactive power datasets. It proposes an anomaly detection framework based on the Isolation Forest machine learning algorithm and Fast Fourier Transform filtering that works in both the time and frequency domain and is unaffected by point anomalies or contextual anomalies of the power consumption data. The importance of integrating anomaly detection methods is demonstrated in the analysis important for distribution networks with a high share of smart meters.
zh

[AI-18] Learning to fuse: dynamic integration of multi-source data for accurate battery lifespan prediction

【速读】:该论文旨在解决锂离子电池寿命精确预测的问题,这对于确保电动汽车和智能电网等应用中的运行可靠性和降低维护成本至关重要。论文提出了一种结合动态多源数据融合与堆叠集成(Stacked Ensemble, SE)建模方法的混合学习框架。解决方案的关键在于采用熵基动态加权机制来减轻异构数据集之间的变异性,并通过将Ridge回归、长短期记忆网络(LSTM)和极端梯度提升(XGBoost)相结合的SE模型,有效捕捉时间依赖性和非线性退化模式,最终实现高精度预测(MAE=0.0058,RMSE=0.0092,R²=0.9839)。此外,基于Shapley加性解释(SHAP)的分析进一步确认了放电容量差(Qdlin)和测量温度(Temp_m)作为关键老化指标的重要性。

链接: https://arxiv.org/abs/2504.18230
作者: He Shanxuan,Lin Zuhong,Yu Bolun,Gao Xu,Long Biao,Yao Jingjing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of lithium-ion battery lifespan is vital for ensuring operational reliability and reducing maintenance costs in applications like electric vehicles and smart grids. This study presents a hybrid learning framework for precise battery lifespan prediction, integrating dynamic multi-source data fusion with a stacked ensemble (SE) modeling approach. By leveraging heterogeneous datasets from the National Aeronautics and Space Administration (NASA), Center for Advanced Life Cycle Engineering (CALCE), MIT-Stanford-Toyota Research Institute (TRC), and nickel cobalt aluminum (NCA) chemistries, an entropy-based dynamic weighting mechanism mitigates variability across heterogeneous datasets. The SE model combines Ridge regression, long short-term memory (LSTM) networks, and eXtreme Gradient Boosting (XGBoost), effectively capturing temporal dependencies and nonlinear degradation patterns. It achieves a mean absolute error (MAE) of 0.0058, root mean square error (RMSE) of 0.0092, and coefficient of determination (R2) of 0.9839, outperforming established baseline models with a 46.2% improvement in R2 and an 83.2% reduction in RMSE. Shapley additive explanations (SHAP) analysis identifies differential discharge capacity (Qdlin) and temperature of measurement (Temp_m) as critical aging indicators. This scalable, interpretable framework enhances battery health management, supporting optimized maintenance and safety across diverse energy storage systems, thereby contributing to improved battery health management in energy storage systems.
zh

[AI-19] Offline Learning of Controllable Diverse Behaviors ICLR2025

【速读】:该论文旨在解决传统模仿学习(Imitation Learning, IL)方法在处理多样化行为数据集时存在的两个主要问题:1) 无法充分再现演示的实际多样性;2) 缺乏对轨迹生成的有效控制。为克服这些局限性,论文提出了一种新的方法,其关键是结合时序一致性(Temporal Consistency)可控性(Controllability) 两个核心特性。时序一致性确保在整个情节(episode)中行为的一致性,而不仅仅是过渡层面的多样性;可控性通过构建行为潜在空间实现,使用户能够根据需求选择性激活特定行为,从而实现对轨迹生成的精确控制。

链接: https://arxiv.org/abs/2504.18160
作者: Mathieu Petitbois,Rémy Portelas,Sylvain Lamprier,Ludovic Denoyer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Generative Models for Robot Learning Workshop at ICLR 2025

点击查看摘要

Abstract:Imitation Learning (IL) techniques aim to replicate human behaviors in specific tasks. While IL has gained prominence due to its effectiveness and efficiency, traditional methods often focus on datasets collected from experts to produce a single efficient policy. Recently, extensions have been proposed to handle datasets of diverse behaviors by mainly focusing on learning transition-level diverse policies or on performing entropy maximization at the trajectory level. While these methods may lead to diverse behaviors, they may not be sufficient to reproduce the actual diversity of demonstrations or to allow controlled trajectory generation. To overcome these drawbacks, we propose a different method based on two key features: a) Temporal Consistency that ensures consistent behaviors across entire episodes and not just at the transition level as well as b) Controllability obtained by constructing a latent space of behaviors that allows users to selectively activate specific behaviors based on their requirements. We compare our approach to state-of-the-art methods over a diverse set of tasks and environments. Project page: this https URL
zh

[AI-20] Learning from Less: SINDy Surrogates in RL ICLR2025

【速读】:该论文旨在解决在强化学习(Reinforcement Learning, RL)中构建代理环境(surrogate environment)的问题。传统方法通常需要较高的计算成本,而该研究提出利用稀疏非线性动力学识别(Sparse Identification of Nonlinear Dynamics, SINDy)算法来高效构建代理环境。关键在于通过SINDy算法能够以较低的计算开销(减少20%-35%)准确捕捉原始环境的动力学特性,同时实现极高的状态相关性(超过0.997)和极低的均方误差(如Mountain Car速度的3.11e-06和Lunar Lander位置的1.42e-06)。这种高效的代理环境构建方法显著减少了RL智能体所需的训练步数,同时保持了与原始环境相当的性能和收敛特性。

链接: https://arxiv.org/abs/2504.18113
作者: Aniket Dixit,Muhammad Ibrahim Khan,Faizan Ahmed,James Brusey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: World Models @ ICLR 2025

点击查看摘要

Abstract:This paper introduces an approach for developing surrogate environments in reinforcement learning (RL) using the Sparse Identification of Nonlinear Dynamics (SINDy) algorithm. We demonstrate the effectiveness of our approach through extensive experiments in OpenAI Gym environments, particularly Mountain Car and Lunar Lander. Our results show that SINDy-based surrogate models can accurately capture the underlying dynamics of these environments while reducing computational costs by 20-35%. With only 75 interactions for Mountain Car and 1000 for Lunar Lander, we achieve state-wise correlations exceeding 0.997, with mean squared errors as low as 3.11e-06 for Mountain Car velocity and 1.42e-06 for LunarLander position. RL agents trained in these surrogate environments require fewer total steps (65,075 vs. 100,000 for Mountain Car and 801,000 vs. 1,000,000 for Lunar Lander) while achieving comparable performance to those trained in the original environments, exhibiting similar convergence patterns and final performance metrics. This work contributes to the field of model-based RL by providing an efficient method for generating accurate, interpretable surrogate environments.
zh

[AI-21] Combating the Bucket Effect:Multi-Knowledge Alignment for Medication Recommendation

【速读】:该论文旨在解决药物推荐中的“桶效应”(bucket effect)问题,即由于不同药物的数据模态(如文本描述与结构化数据)分布不均衡,导致现有模型性能受限。论文的关键解决方案在于提出了一种跨模态药物编码器(cross-modal medication encoder),能够将来自不同模态的知识数据无缝对齐到统一空间,并进一步结合多模态知识表示与患者电子健康记录(EHR)进行推荐。具体而言,通过在五种知识模态上利用对比学习预训练编码器实现模态对齐,然后将多模态药物表示与患者数据相结合,从而有效缓解“桶效应”,显著提升了推荐的准确性和安全性。

链接: https://arxiv.org/abs/2504.18096
作者: Xiang Li,Haixu Ma,Guanyong Wu,Shi Mu,Chen Li,Shunpan Liang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Medication recommendation is crucial in healthcare, offering effective treatments based on patient’s electronic health records (EHR). Previous studies show that integrating more medication-related knowledge improves medication representation accuracy. However, not all medications encompass multiple types of knowledge data simultaneously. For instance, some medications provide only textual descriptions without structured data. This imbalance in data availability limits the performance of existing models, a challenge we term the “bucket effect” in medication recommendation. Our data analysis uncovers the severity of the “bucket effect” in medication recommendation. To fill this gap, we introduce a cross-modal medication encoder capable of seamlessly aligning data from different modalities and propose a medication recommendation framework to integrate Multiple types of Knowledge, named MKMed. Specifically, we first pre-train a cross-modal encoder with contrastive learning on five knowledge modalities, aligning them into a unified space. Then, we combine the multi-knowledge medication representations with patient records for recommendations. Extensive experiments on the MIMIC-III and MIMIC-IV datasets demonstrate that MKMed mitigates the “bucket effect” in data, and significantly outperforms state-of-the-art baselines in recommendation accuracy and safety.
zh

[AI-22] Efficient GNN Training Through Structure-Aware Randomized Mini-Batching

【速读】:该论文旨在解决现有图神经网络(Graph Neural Networks, GNNs)小批量训练方法中存在的效率与准确性之间的权衡问题。当前的小批量构建策略大多忽视了GNN训练的效率考量,随机化方案虽能提高精度和收敛性,但忽略了图的结构特性(如社区结构),导致内存访问模式不规则且GPU缓存利用率低;而仅基于图结构的确定性小批量方法虽然运行速度快,但缺乏随机性会影响模型最终精度和训练收敛速度。论文提出了一种名为Community-structure-aware Randomized Mini-batching(COMM-RAND)的新方法,其关键在于通过结合图的社区结构特性和适度的随机性,在小批量构建过程中探索纯随机性和纯结构感知之间的空间,从而在保持相似精度的同时显著提升GNN训练效率。实验结果显示,COMM-RAND将GNN训练时间减少了多达2.76倍(平均1.8倍),同时达到的精度与主流随机小批量方法相比仅相差1.79个百分点(平均0.42个百分点)。

链接: https://arxiv.org/abs/2504.18082
作者: Vignesh Balaji,Christos Kozyrakis,Gal Chechik,Haggai Maron
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) enable learning on realworld graphs and mini-batch training has emerged as the de facto standard for training GNNs because it can scale to very large graphs and improve convergence. Current mini-batch construction policies largely ignore efficiency considerations of GNN training. Specifically, existing mini-batching techniques employ randomization schemes to improve accuracy and convergence. However, these randomization schemes are often agnostic to the structural properties of the graph (for eg. community structure), resulting in highly irregular memory access patterns during GNN training that make suboptimal use of on-chip GPU caches. On the other hand, while deterministic mini-batching based solely on graph structure delivers fast runtime performance, the lack of randomness compromises both the final model accuracy and training convergence speed. In this paper, we present Community-structure-aware Randomized Mini-batching (COMM-RAND), a novel methodology that bridges the gap between the above extremes. COMM-RAND allows practitioners to explore the space between pure randomness and pure graph structural awareness during mini-batch construction, leading to significantly more efficient GNN training with similar accuracy. We evaluated COMM-RAND across four popular graph learning benchmarks. COMM-RAND cuts down GNN training time by up to 2.76x (1.8x on average) while achieving an accuracy that is within 1.79% points (0.42% on average) compared to popular random mini-batching approaches.
zh

[AI-23] Privacy-Preserving Personalized Federated Learning for Distributed Photovoltaic Disaggregation under Statistical Heterogeneity

【速读】:该论文旨在解决分布式光伏(Distributed Photovoltaic, PV)发电量估算的问题,特别是针对“表后”系统难以直接观测的挑战。随着全球范围内分布式光伏系统的快速扩展,净负荷(Net Load)中的光伏成分分离(即PV disaggregation)变得尤为重要,以维持电网的供需平衡并优化能源管理。然而,由于隐私保护需求以及统计异质性(Statistical Heterogeneity)的增加,传统方法面临困难。统计异质性源于不同产消者(Prosumer)在地理位置和行为模式上的差异。

为了解决这些问题,论文提出了一种基于个性化联邦学习(Personalized Federated Learning, PFL)的隐私保护分布式光伏分解框架。该方案的关键在于结合本地模型与全局模型的双层架构:本地层面采用基于Transformer的模型生成太阳能辐照嵌入(Solar Irradiance Embeddings),用于表示局部光伏条件,并通过自适应本地聚合机制缓解统计异质性的影响;全局层面通过中央服务器实现跨数据中心的知识共享,同时确保数据隐私。实验结果表明,该框架相比基准方法在准确性和鲁棒性方面均有所提升。

链接: https://arxiv.org/abs/2504.18078
作者: Xiaolu Chen,Chenghao Huang,Yanru Zhang,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:The rapid expansion of distributed photovoltaic (PV) installations worldwide, many being behind-the-meter systems, has significantly challenged energy management and grid operations, as unobservable PV generation further complicates the supply-demand balance. Therefore, estimating this generation from net load, known as PV disaggregation, is critical. Given privacy concerns and the need for large training datasets, federated learning becomes a promising approach, but statistical heterogeneity, arising from geographical and behavioral variations among prosumers, poses new challenges to PV disaggregation. To overcome these challenges, a privacy-preserving distributed PV disaggregation framework is proposed using Personalized Federated Learning (PFL). The proposed method employs a two-level framework that combines local and global modeling. At the local level, a transformer-based PV disaggregation model is designed to generate solar irradiance embeddings for representing local PV conditions. A novel adaptive local aggregation mechanism is adopted to mitigate the impact of statistical heterogeneity on the local model, extracting a portion of global information that benefits the local model. At the global level, a central server aggregates information uploaded from multiple data centers, preserving privacy while enabling cross-center knowledge sharing. Experiments on real-world data demonstrate the effectiveness of this proposed framework, showing improved accuracy and robustness compared to benchmark methods.
zh

[AI-24] LLM -Guided Open RAN: Empowering Hierarchical RAN Intelligent Control

【速读】:该论文试图解决无线通信网络中传统资源管理方法在灵活性和效率方面的不足问题,特别是在开放无线电接入网络(O-RAN)架构下,通过引入生成式AI(Generative AI)与强化学习(Reinforcement Learning, RL)的协同机制提升分布式无线资源管理能力。论文的关键解决方案在于提出了一种基于大语言模型(Large Language Models, LLMs)赋能的分层智能无线接入网控制器(hierarchical Radio Access Network Intelligent Controller, hRIC)框架,即LLM-hRIC。该框架通过将LLMs与RL相结合,使非实时(non-real-time, non-RT)RIC提供基于环境上下文的战略性指导和高层策略,而强化学习赋能的近实时(near-real-time, near-RT)RIC则执行低延迟任务。这种分层协作机制显著提升了跨时间尺度的资源调度效率与系统性能。

链接: https://arxiv.org/abs/2504.18062
作者: Lingyan Bao,Sinwoong Yun,Jemin Lee,Tony Q.S. Quek
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have led to a significant interest in deploying LLMempowered algorithms for wireless communication networks. Meanwhile, open radio access network (O-RAN) techniques offer unprecedented flexibility, with the non-real-time (non-RT) radio access network (RAN) intelligent controller (RIC) (non-RT RIC) and near-real-time (near-RT) RIC (near-RT RIC) components enabling intelligent resource management across different time scales. In this paper, we propose the LLM empowered hierarchical RIC (LLM-hRIC) framework to improve the collaboration between RICs. This framework integrates LLMs with reinforcement learning (RL) for efficient network resource management. In this framework, LLMs-empowered non-RT RICs provide strategic guidance and high-level policies based on environmental context. Concurrently, RL-empowered near-RT RICs perform low-latency tasks based on strategic guidance and local near-RT observation. We evaluate the LLM-hRIC framework in an integrated access and backhaul (IAB) network setting. Simulation results demonstrate that the proposed framework achieves superior performance. Finally, we discuss the key future challenges in applying LLMs to O-RAN.
zh

[AI-25] Opportunistic Collaborative Planning with Large Vision Model Guided Control and Joint Query-Service Optimization

【速读】:该论文旨在解决在开放场景中自动驾驶车辆导航时处理未知对象所面临的挑战。现有方法要么依赖于泛化能力较弱的小型模型,要么依赖于资源消耗大的大型模型,而两者的协作虽提供了一种有前景的解决方案,但关键在于如何决定何时以及如何启用大型模型。为此,论文提出了一种机会性协同规划(Opportunistic Collaborative Planning, OCP)方法,通过两项关键创新将高效的本地模型与强大的云模型无缝集成。首先,提出基于大型视觉模型引导的模型预测控制(Large Vision Model-Guided Model Predictive Control, LVM-MPC),利用云端进行感知和决策,并将云端输出作为本地模型预测控制(MPC)的全局指导,形成闭环的感知到控制系统。其次,为了优化大型模型查询和服务的最佳时机,提出了协作时机优化(Collaboration Timing Optimization, CTO),包括目标检测置信度阈值设定(Object Detection Confidence Thresholding, ODCT)和云端前向仿真(Cloud Forward Simulation, CFS),以决定何时请求云端帮助及何时提供云端服务。大量实验表明,所提出的OCP方法在导航时间和成功率方面均优于现有方法。

链接: https://arxiv.org/abs/2504.18057
作者: Jiayi Chen,Shuai Wang,Guoliang Li,Wei Xu,Guangxu Zhu,Derrick Wing Kwan Ng,Chengzhong Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Navigating autonomous vehicles in open scenarios is a challenge due to the difficulties in handling unseen objects. Existing solutions either rely on small models that struggle with generalization or large models that are resource-intensive. While collaboration between the two offers a promising solution, the key challenge is deciding when and how to engage the large model. To address this issue, this paper proposes opportunistic collaborative planning (OCP), which seamlessly integrates efficient local models with powerful cloud models through two key innovations. First, we propose large vision model guided model predictive control (LVM-MPC), which leverages the cloud for LVM perception and decision making. The cloud output serves as a global guidance for a local MPC, thereby forming a closed-loop perception-to-control system. Second, to determine the best timing for large model query and service, we propose collaboration timing optimization (CTO), including object detection confidence thresholding (ODCT) and cloud forward simulation (CFS), to decide when to seek cloud assistance and when to offer cloud service. Extensive experiments show that the proposed OCP outperforms existing methods in terms of both navigation time and success rate.
zh

[AI-26] Validating Network Protocol Parsers with Traceable RFC Document Interpretation

【速读】:该论文旨在解决网络协议实现正确性验证中的两个核心挑战:oracle问题和可追溯性问题。oracle问题在于确定何时可以将协议实现视为存在错误,尤其是在错误未导致任何可观察症状的情况下;可追溯性问题则允许开发者理解实现如何违反协议规范,从而促进错误修复。论文的关键创新在于利用大型语言模型(Large Language Models, LLMs)的最新进展,同时考虑这两个问题并提供有效解决方案。其核心方法是通过LLMs系统地将结构化的RFC文档翻译成形式化的协议消息规范,这些规范作为准oracle用于验证协议解析器,同时验证结果逐步完善oracle。由于oracle源自文档,发现的任何错误都可以追溯到文档本身,从而解决了可追溯性问题。实验评估表明,该方法在九种不同编程语言实现的网络协议中检测到69个错误,其中36个已被确认,显著优于现有技术。

链接: https://arxiv.org/abs/2504.18050
作者: Mingwei Zheng,Danning Xie,Qingkai Shi,Chengpeng Wang,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Validating the correctness of network protocol implementations is highly challenging due to the oracle and traceability problems. The former determines when a protocol implementation can be considered buggy, especially when the bugs do not cause any observable symptoms. The latter allows developers to understand how an implementation violates the protocol specification, thereby facilitating bug fixes. Unlike existing works that rarely take both problems into account, this work considers both and provides an effective solution using recent advances in large language models (LLMs). Our key observation is that network protocols are often released with structured specification documents, a.k.a. RFC documents, which can be systematically translated to formal protocol message specifications via LLMs. Such specifications, which may contain errors due to the hallucination of LLMs, are used as a quasi-oracle to validate protocol parsers, while the validation results in return gradually refine the oracle. Since the oracle is derived from the document, any bugs we find in a protocol implementation can be traced back to the document, thus addressing the traceability problem. We have extensively evaluated our approach using nine network protocols and their implementations written in C, Python, and Go. The results show that our approach outperforms the state-of-the-art and has detected 69 bugs, with 36 confirmed. The project also demonstrates the potential for fully automating software validation based on natural language specifications, a process previously considered predominantly manual due to the need to understand specification documents and derive expected outputs for test inputs.
zh

[AI-27] AI Ethics and Social Norms: Exploring ChatGPT s Capabilities From What to How

【速读】:该论文试图解决的问题是如何确保大型语言模型(LLMs)在医疗、计算机支持的协同工作以及社会计算领域的安全应用,具体聚焦于ChatGPT作为日常工具时是否遵循伦理和社会规范。论文的关键在于通过混合研究方法(包括问卷调查与专家访谈),从六个重要方面(偏见、可信度、安全性、毒性、社会规范及伦理数据)评估ChatGPT的伦理表现,并识别出无监督数据收集方法中透明性和偏见带来的显著障碍,以此推动实现机器伦理的理解与实践。

链接: https://arxiv.org/abs/2504.18044
作者: Omid Veisi,Sasan Bahrami,Roman Englert,Claudia Müller
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
备注: Accepted for presentation at the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2025. To appear in Proceedings of the ACM on Human-Computer Interaction (PACM HCI)

点击查看摘要

Abstract:Using LLMs in healthcare, Computer-Supported Cooperative Work, and Social Computing requires the examination of ethical and social norms to ensure safe incorporation into human life. We conducted a mixed-method study, including an online survey with 111 participants and an interview study with 38 experts, to investigate the AI ethics and social norms in ChatGPT as everyday life tools. This study aims to evaluate whether ChatGPT in an empirical context operates following ethics and social norms, which is critical for understanding actions in industrial and academic research and achieving machine ethics. The findings of this study provide initial insights into six important aspects of AI ethics, including bias, trustworthiness, security, toxicology, social norms, and ethical data. Significant obstacles related to transparency and bias in unsupervised data collection methods are identified as ChatGPT’s ethical concerns.
zh

[AI-28] MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

【速读】:该论文旨在解决现有大型语言模型(LLM)社交推理游戏(SDGs)代理在处理多模态信息方面的局限性。具体而言,当前方法主要依赖文本信息,忽略了人类自然沟通中至关重要的面部表情和语音语调等多模态线索,并且现有代理大多仅关注推断其他玩家的身份,而未建模玩家如何感知自己及他人。为了解决这些问题,论文提出MultiMind框架,其关键在于将多模态信息整合到SDG代理中,通过处理面部表情、语音语调以及言语内容,并结合心理理论(Theory of Mind, ToM)模型来表示每位玩家对他人的怀疑程度,再利用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)识别最小化自身被怀疑的沟通策略,从而实现更接近人类的社会推理能力。

链接: https://arxiv.org/abs/2504.18039
作者: Zheng Zhang,Nuoqian Xiao,Qi Chai,Deheng Ye,Hao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have demonstrated impressive capabilities in social deduction games (SDGs) like Werewolf, where strategic reasoning and social deception are essential. However, current approaches remain limited to textual information, ignoring crucial multimodal cues such as facial expressions and tone of voice that humans naturally use to communicate. Moreover, existing SDG agents primarily focus on inferring other players’ identities without modeling how others perceive themselves or fellow players. To address these limitations, we use One Night Ultimate Werewolf (ONUW) as a testbed and present MultiMind, the first framework integrating multimodal information into SDG agents. MultiMind processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model to represent each player’s suspicion levels toward others. By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself. Through comprehensive evaluation in both agent-versus-agent simulations and studies with human players, we demonstrate MultiMind’s superior performance in gameplay. Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains.
zh

[AI-29] Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

【速读】:该论文旨在解决概念瓶颈模型(CBMs)在实际应用中因数据集中概念标签存在误标而显著影响性能的问题。论文的关键解决方案是引入了一种名为概念偏好优化(CPO)的新目标函数,该函数基于直接偏好优化(DPO),通过一种新的损失函数有效减轻概念误标对CBMs性能的负面影响。分析表明,CPO能够直接优化概念后验分布,并且相较于二元交叉熵(BCE),CPO对概念噪声具有固有的较低敏感性。实证结果验证了这一方法的有效性,在带噪和无噪的三个真实世界数据集上,CPO始终优于BCE。

链接: https://arxiv.org/abs/2504.18026
作者: Emiliano Penaloza,Tianyue H. Zhan,Laurent Charlin,Mateo Espinosa Zarlenga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human understandable concepts. However, CBMs typically assume that datasets contains accurate concept labels an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis on some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real world datasets with and without added label noise.
zh

[AI-30] Sky-Drive: A Distributed Multi-Agent Simulation Platform for Socially-Aware and Human-AI Collaborative Future Transportation

【速读】:该论文旨在解决现有自主系统仿真平台在建模社会感知驾驶代理以及实现有效的人机协作方面未能充分满足未来交通研究需求的问题。为了解决这些问题,论文提出的关键方案包括:(a) 分布式架构以实现多终端同步仿真;(b) 多模态人机交互框架集成多样化传感器以采集丰富的行为数据;© 支持连续和自适应知识交换的人机协作机制;以及 (d) 数字孪生 (Digital Twin, DT) 框架用于构建高保真的虚拟世界交通环境复刻。这些创新共同构成了Sky-Drive平台,使其能够支持多样化的应用,如自动驾驶车辆与弱势道路使用者的交互建模、人机协同训练、社会感知强化学习、个性化驾驶策略及定制化场景生成。未来还将扩展上下文感知决策支持的基础模型和硬件在环测试功能。通过整合场景生成、数据采集、算法训练与硬件集成,Sky-Drive有望成为下一代社会感知和以人为中心的自主交通研究的基石平台。

链接: https://arxiv.org/abs/2504.18010
作者: Zilin Huang,Zihao Sheng,Zhengyang Wan,Yansong Qu,Yuhao Luo,Boyue Wang,Pei Li,Yen-Jung Chen,Jiancong Chen,Keke Long,Jiayi Meng,Yue Leng,Sikai Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Recent advances in autonomous system simulation platforms have significantly enhanced the safe and scalable testing of driving policies. However, existing simulators do not yet fully meet the needs of future transportation research, particularly in modeling socially-aware driving agents and enabling effective human-AI collaboration. This paper introduces Sky-Drive, a novel distributed multi-agent simulation platform that addresses these limitations through four key innovations: (a) a distributed architecture for synchronized simulation across multiple terminals; (b) a multi-modal human-in-the-loop framework integrating diverse sensors to collect rich behavioral data; © a human-AI collaboration mechanism supporting continuous and adaptive knowledge exchange; and (d) a digital twin (DT) framework for constructing high-fidelity virtual replicas of real-world transportation environments. Sky-Drive supports diverse applications such as autonomous vehicle (AV)-vulnerable road user (VRU) interaction modeling, human-in-the-loop training, socially-aware reinforcement learning, personalized driving policy, and customized scenario generation. Future extensions will incorporate foundation models for context-aware decision support and hardware-in-the-loop (HIL) testing for real-world validation. By bridging scenario generation, data collection, algorithm training, and hardware integration, Sky-Drive has the potential to become a foundational platform for the next generation of socially-aware and human-centered autonomous transportation research. The demo video and code are available at:this https URL
zh

[AI-31] Differential Privacy-Driven Framework for Enhancing Heart Disease Prediction

【速读】:该论文旨在解决在医疗系统数字化进程中因大量私密健康数据生成与共享所带来的隐私保护挑战,同时确保机器学习技术能够有效应用于医疗场景以提取有价值的信息。论文的关键解决方案在于结合差分隐私(Differential Privacy)和联邦学习(Federated Learning)方法,构建既能保障患者隐私又能实现高效分析的隐私保护模型。差分隐私通过向数据引入噪声来保证统计意义上的隐私性,而联邦学习则允许在分布式数据集上进行协作式的模型训练。通过将这些技术应用于心脏病数据集,论文展示了如何在不泄露个体隐私的前提下提供有价值的洞察和全面分析。最终结果表明,采用融合差分隐私的联邦学习模型实现了85%的测试准确率,确保了患者数据在整个过程中的安全性与隐私性。

链接: https://arxiv.org/abs/2504.18007
作者: Yazan Otoum,Amiya Nayak
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: \c{opyright} 2025 IEEE. Accepted to IEEE International Conference on Communications ICC 2025. Final version to appear in IEEE Xplore

点击查看摘要

Abstract:With the rapid digitalization of healthcare systems, there has been a substantial increase in the generation and sharing of private health data. Safeguarding patient information is essential for maintaining consumer trust and ensuring compliance with legal data protection regulations. Machine learning is critical in healthcare, supporting personalized treatment, early disease detection, predictive analytics, image interpretation, drug discovery, efficient operations, and patient monitoring. It enhances decision-making, accelerates research, reduces errors, and improves patient outcomes. In this paper, we utilize machine learning methodologies, including differential privacy and federated learning, to develop privacy-preserving models that enable healthcare stakeholders to extract insights without compromising individual privacy. Differential privacy introduces noise to data to guarantee statistical privacy, while federated learning enables collaborative model training across decentralized datasets. We explore applying these technologies to Heart Disease Data, demonstrating how they preserve privacy while delivering valuable insights and comprehensive analysis. Our results show that using a federated learning model with differential privacy achieved a test accuracy of 85%, ensuring patient data remained secure and private throughout the process.
zh

[AI-32] Fuzzy-RRT for Obstacle Avoidance in a 2-DOF Semi-Autonomous Surgical Robotic Arm

【速读】:该论文旨在解决在长期星际任务中因有限乘组规模和通信延迟导致的传统外科手术方法不可行的问题。为应对这一挑战,论文提出了一种将模糊快速探索随机树(Fuzzy Rapidly-exploring Random Tree, F-RRT)算法应用于两自由度机械臂的新方法,该机械臂基于迷你机器人辅助外科系统进行建模。这种方法的关键在于通过引入模糊逻辑优化快速探索随机树算法,实现障碍物规避与人机协作控制,从而显著提升了路径搜索效率(提高743%)和降低了路径成本(降低43%)。

链接: https://arxiv.org/abs/2504.17979
作者: Kaaustaaub Shankar,Wilhelm Louw,Bharadwaj Dogga,Nick Ernest,Tim Arnett,Kelly Cohen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. Submitted to NAFIPS 2025 Conference (North American Fuzzy Information Processing Society). Includes results on Fuzzy-RRT performance in surgical robotics path planning

点击查看摘要

Abstract:AI-driven semi-autonomous robotic surgery is essential for addressing the medical challenges of long-duration interplanetary missions, where limited crew sizes and communication delays restrict traditional surgical approaches. Current robotic surgery systems require full surgeon control, demanding extensive expertise and limiting feasibility in space. We propose a novel adaptation of the Fuzzy Rapidly-exploring Random Tree algorithm for obstacle avoidance and collaborative control in a two-degree-of-freedom robotic arm modeled on the Miniaturized Robotic-Assisted surgical system. It was found that the Fuzzy Rapidly-exploring Random Tree algorithm resulted in an 743 percent improvement to path search time and 43 percent improvement to path cost.
zh

[AI-33] LLM Agent Swarm for Hypothesis-Driven Drug Discovery

【速读】:该论文试图解决药物发现领域中高失败率(超过90%的候选分子在临床评估中失败)和高昂研发成本(每种获批疗法往往超过十亿美元)的问题。传统方法因数据流的多样性(如基因组学、转录组学、化学库及临床记录等)难以形成连贯的机制洞见,而大型语言模型虽擅长推理与工具整合,但缺乏受监管的假设驱动工作流程所需的模块化特化和迭代记忆能力。论文的关键解决方案是引入PharmaSwarm,这是一种统一的多智能体框架,通过协调专门的大语言模型“智能体”提出、验证并优化新药靶点和先导化合物的假设。其核心在于每个智能体访问专用功能(如自动化基因与表达分析、生物医学知识图谱、通路富集与网络模拟、可解释的结合亲和力预测),同时中央评估智能体持续根据生物学合理性、新颖性、计算机模拟效能与安全性对提案进行排名。共享记忆层捕获经验证的洞见,并随着时间推移微调底层子模型,从而实现自我改进系统。这一方案的关键在于将大语言模型的专业化与模块化相结合,形成高效的AI辅助药物发现流程。

链接: https://arxiv.org/abs/2504.17967
作者: Kevin Song,Andrew Trotter,Jake Y. Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Drug discovery remains a formidable challenge: more than 90 percent of candidate molecules fail in clinical evaluation, and development costs often exceed one billion dollars per approved therapy. Disparate data streams, from genomics and transcriptomics to chemical libraries and clinical records, hinder coherent mechanistic insight and slow progress. Meanwhile, large language models excel at reasoning and tool integration but lack the modular specialization and iterative memory required for regulated, hypothesis-driven workflows. We introduce PharmaSwarm, a unified multi-agent framework that orchestrates specialized LLM “agents” to propose, validate, and refine hypotheses for novel drug targets and lead compounds. Each agent accesses dedicated functionality–automated genomic and expression analysis; a curated biomedical knowledge graph; pathway enrichment and network simulation; interpretable binding affinity prediction–while a central Evaluator LLM continuously ranks proposals by biological plausibility, novelty, in silico efficacy, and safety. A shared memory layer captures validated insights and fine-tunes underlying submodels over time, yielding a self-improving system. Deployable on low-code platforms or Kubernetes-based microservices, PharmaSwarm supports literature-driven discovery, omics-guided target identification, and market-informed repurposing. We also describe a rigorous four-tier validation pipeline spanning retrospective benchmarking, independent computational assays, experimental testing, and expert user studies to ensure transparency, reproducibility, and real-world impact. By acting as an AI copilot, PharmaSwarm can accelerate translational research and deliver high-confidence hypotheses more efficiently than traditional pipelines.
zh

[AI-34] Evaluating Machine Expertise: How Graduate Students Develop Frameworks for Assessing GenAI Content

【速读】:该论文试图解决的问题是如何帮助研究生发展评估网络环境中大型语言模型(LLMs)生成的专业能力的框架。论文通过结合问卷调查、LLM交互记录以及对14名研究生的深入访谈,识别出这些新兴专业人士在评估和参与AI生成内容时的行为模式。研究的关键在于揭示学生构建评价框架所受的主要影响因素,包括职业身份、验证能力和系统导航经验。论文指出,学生们并非简单地接受或拒绝LLM输出,而是根据职业身份保护特定领域,同时将其他领域委托给他人管理。此外,学生的验证能力和系统操作经验也进一步塑造了他们的评价框架。这一研究为网络科学贡献了关于人与生成式AI(GenAI)互动的新见解,并提出了平台如何更好地支持用户建立有效评估机制的建议。

链接: https://arxiv.org/abs/2504.17964
作者: Celia Chen,Alex Leitch
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Under review at ACM Web Science Conference 2025’s Human-GenAI Interactions Workshop, 4 pages

点击查看摘要

Abstract:This paper examines how graduate students develop frameworks for evaluating machine-generated expertise in web-based interactions with large language models (LLMs). Through a qualitative study combining surveys, LLM interaction transcripts, and in-depth interviews with 14 graduate students, we identify patterns in how these emerging professionals assess and engage with AI-generated content. Our findings reveal that students construct evaluation frameworks shaped by three main factors: professional identity, verification capabilities, and system navigation experience. Rather than uniformly accepting or rejecting LLM outputs, students protect domains central to their professional identities while delegating others–with managers preserving conceptual work, designers safeguarding creative processes, and programmers maintaining control over core technical expertise. These evaluation frameworks are further influenced by students’ ability to verify different types of content and their experience navigating complex systems. This research contributes to web science by highlighting emerging human-genAI interaction patterns and suggesting how platforms might better support users in developing effective frameworks for evaluating machine-generated expertise signals in AI-mediated web environments.
zh

[AI-35] ApproXAI: Energy-Efficient Hardware Acceleration of Explainable AI using Approximate Computing IJCNN

【速读】:该论文试图解决在实时场景中基于硬件加速的可解释人工智能(XAI)方法因高能耗而受限的问题。解决方案的关键在于提出了一种名为XAIedge的新框架,它将近似计算技术引入到XAI算法(如集成梯度、模型蒸馏和Shapley分析)中,并将其转化为近似矩阵计算,同时利用卷积、傅里叶变换与近似计算范式的协同效应,从而实现在基于TPU的边缘设备上的高效硬件加速,显著提升了能量效率,同时保持了与现有精确XAI硬件加速方法相当的准确性。

链接: https://arxiv.org/abs/2504.17929
作者: Ayesha Siddique,Khurram Khalil,Khaza Anuarul Hoque
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted at the International Joint Conference on Neural Networks (IJCNN), June 30th - July 5th, 2025 in Rome, Italy

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) enhances AI system transparency by framing interpretability as an optimization problem. However, this approach often necessitates numerous iterations of computationally intensive operations, limiting its applicability in real-time scenarios. While recent research has focused on XAI hardware acceleration on FPGAs and TPU, these methods do not fully address energy efficiency in real-time settings. To address this limitation, we propose XAIedge, a novel framework that leverages approximate computing techniques into XAI algorithms, including integrated gradients, model distillation, and Shapley analysis. XAIedge translates these algorithms into approximate matrix computations and exploits the synergy between convolution, Fourier transform, and approximate computing paradigms. This approach enables efficient hardware acceleration on TPU-based edge devices, facilitating faster real-time outcome interpretations. Our comprehensive evaluation demonstrates that XAIedge achieves a 2\times improvement in energy efficiency compared to existing accurate XAI hardware acceleration techniques while maintaining comparable accuracy. These results highlight the potential of XAIedge to significantly advance the deployment of explainable AI in energy-constrained real-time applications.
zh

[AI-36] Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

【速读】:本文旨在解决基于概念的模型(Concept-based Models, CMs)在处理分布外(Out-of-Distribution, OOD)输入时的表现问题,特别是通过概念干预(Concept Interventions)提升其任务预测准确性时存在的局限性。研究发现当前最先进的CMs存在一种称为“泄漏中毒”(Leakage Poisoning)的现象,即当模型在OOD样本上接受人为修正的概念干预时,无法有效提高其准确性。为了解决这一问题,论文提出了一种名为MixCEM的新方法,其关键在于通过动态学习仅在信息处于分布内(In-distribution)时利用概念中缺失的泄露信息,从而克服泄漏中毒现象。实验结果表明,MixCEM在有无完整概念注释的任务中均显著优于强基准模型,在有或无概念干预的情况下,无论是分布内还是OOD样本的准确性均有大幅提升。

链接: https://arxiv.org/abs/2504.17921
作者: Mateo Espinosa Zarlenga,Gabriele Dominici,Pietro Barbiero,Zohreh Shams,Mateja Jamnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level concepts (e.g., stripes, black) and then predict a task label from those concepts. In particular, we study the impact of concept interventions (i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs’ task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce MixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.
zh

[AI-37] Beyond Task and Motion Planning : Hierarchical Robot Planning with General-Purpose Policies

【速读】:该论文致力于解决传统任务与运动规划方法在处理包含闭环电机控制器的复杂技能时的局限性,这些技能无法仅通过运动学规划实现。论文的关键创新在于提出了一种基于可组合交互基元(Composable Interaction Primitives, CIPs)的新方法,将闭环电机控制器集成到运动规划中,从而支持在分层机器人规划中使用多样化的非组合预学习技能。这种方案的核心在于利用CIPs实现了技能的灵活组合与应用,验证了任务与技能规划(Task and Skill Planning, TASP)方法的有效性,使移动操作机器人能够在真实场景中有效结合运动规划与通用技能以完成复杂任务。

链接: https://arxiv.org/abs/2504.17901
作者: Benned Hedegaard,Ziyi Yang,Yichen Wei,Ahmed Jaafar,Stefanie Tellex,George Konidaris,Naman Shah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task and motion planning is a well-established approach for solving long-horizon robot planning problems. However, traditional methods assume that each task-level robot action, or skill, can be reduced to kinematic motion planning. In this work, we address the challenge of planning with both kinematic skills and closed-loop motor controllers that go beyond kinematic considerations. We propose a novel method that integrates these controllers into motion planning using Composable Interaction Primitives (CIPs), enabling the use of diverse, non-composable pre-learned skills in hierarchical robot planning. Toward validating our Task and Skill Planning (TASP) approach, we describe ongoing robot experiments in real-world scenarios designed to demonstrate how CIPs can allow a mobile manipulator robot to effectively combine motion planning with general-purpose skills to accomplish complex tasks.
zh

[AI-38] Crypto-ncRNA: Non-coding RNA (ncRNA) Based Encryption Algorithm ICLR2025

【速读】:这篇论文旨在解决传统加密系统在后量子时代因面临量子计算攻击而逐渐失去安全性的问题。解决方案的关键在于提出了一种名为crypto-ncRNA的生物融合加密框架,它利用非编码RNA(non-coding RNA, ncRNA)动态折叠的特性生成高熵、抗量子的密钥,并产生不可预测的密文。该框架通过将明文编码为RNA序列、使用先进算法预测和操控RNA二级结构,以及借助RNA分子的物理不可克隆性提取加密密钥的多阶段过程实现其目标。实验评估表明,尽管crypto-ncRNA的加密速度略低于AES,但在效率和可扩展性方面显著优于RSA,并且在NIST SP 800-22随机性测试中达到了100%的通过率。这些结果证明crypto-ncRNA为保护数字基础设施免受量子计算威胁提供了有前景且稳健的方法。

链接: https://arxiv.org/abs/2504.17878
作者: Xu Wang,Yiquan Wang,Tin-yeh Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the AI4NA workshop at ICLR 2025. 18pages, 4figures

点击查看摘要

Abstract:In the looming post-quantum era, traditional cryptographic systems are increasingly vulnerable to quantum computing attacks that can compromise their mathematical foundations. To address this critical challenge, we propose crypto-ncRNA-a bio-convergent cryptographic framework that leverages the dynamic folding properties of non-coding RNA (ncRNA) to generate high-entropy, quantum-resistant keys and produce unpredictable ciphertexts. The framework employs a novel, multi-stage process: encoding plaintext into RNA sequences, predicting and manipulating RNA secondary structures using advanced algorithms, and deriving cryptographic keys through the intrinsic physical unclonability of RNA molecules. Experimental evaluations indicate that, although crypto-ncRNA’s encryption speed is marginally lower than that of AES, it significantly outperforms RSA in terms of efficiency and scalability while achieving a 100% pass rate on the NIST SP 800-22 randomness tests. These results demonstrate that crypto-ncRNA offers a promising and robust approach for securing digital infrastructures against the evolving threats posed by quantum computing.
zh

[AI-39] Flow Matching Ergodic Coverag e ATC

【速读】:该论文旨在解决现有遍历覆盖(Ergodic Coverage)方法受限于有限的遍历度量集,从而限制其性能的问题。论文的关键解决方案是提出了一种基于流匹配(Flow Matching)的新方法,这是一种在生成推理中广泛用于高效和可扩展采样的技术。作者形式化推导了遍历覆盖的流匹配问题,并证明其等价于具有闭式解的线性二次调节器(Linear Quadratic Regulator, LQR)问题。这种方法允许使用生成推理中的替代遍历度量,克服了现有度量的局限性,这些度量之前因计算开销而不可用于控制综合。具体而言,结合Stein变分梯度流的流匹配可以直接作用于目标分布的得分函数,提高了对未归一化分布的鲁棒性;而结合Sinkhorn散度流的流匹配则实现了基于最优传输的遍历度量,提升了对非光滑分布且支持不规则的覆盖性能。论文通过数值基准测试和不同非线性动力学验证了方法的改进性能和竞争性的计算效率,并进一步通过Franka机器人上的绘图和擦除任务展示了其实用性。

链接: https://arxiv.org/abs/2504.17872
作者: Max Muchen Sun,Allison Pinosky,Todd Murphey
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 15 figures. Accepted to Robotics: Science and Systems (RSS) 2025. Project website: this https URL

点击查看摘要

Abstract:Ergodic coverage effectively generates exploratory behaviors for embodied agents by aligning the spatial distribution of the agent’s trajectory with a target distribution, where the difference between these two distributions is measured by the ergodic metric. However, existing ergodic coverage methods are constrained by the limited set of ergodic metrics available for control synthesis, fundamentally limiting their performance. In this work, we propose an alternative approach to ergodic coverage based on flow matching, a technique widely used in generative inference for efficient and scalable sampling. We formally derive the flow matching problem for ergodic coverage and show that it is equivalent to a linear quadratic regulator problem with a closed-form solution. Our formulation enables alternative ergodic metrics from generative inference that overcome the limitations of existing ones. These metrics were previously infeasible for control synthesis but can now be supported with no computational overhead. Specifically, flow matching with the Stein variational gradient flow enables control synthesis directly over the score function of the target distribution, improving robustness to the unnormalized distributions; on the other hand, flow matching with the Sinkhorn divergence flow enables an optimal transport-based ergodic metric, improving coverage performance on non-smooth distributions with irregular supports. We validate the improved performance and competitive computational efficiency of our method through comprehensive numerical benchmarks and across different nonlinear dynamics. We further demonstrate the practicality of our method through a series of drawing and erasing tasks on a Franka robot.
zh

[AI-40] CaRL: Learning Scalable Planning Policies with Simple Rewards

【速读】:该论文旨在解决自动驾驶领域中基于规则的方法在处理长尾场景时的局限性问题,同时克服当代基于强化学习(Reinforcement Learning, RL)方法中复杂奖励函数优化困难及可扩展性不足的问题。论文的关键创新在于提出了一种新的奖励设计,主要基于单一直观的奖励项——路线完成度(route completion),并通过终止 Episode 或乘法方式减少路线完成度来惩罚违规行为。这种方法使得 Proximal Policy Optimization (PPO) 在较大的 mini-batch 大小下能够有效优化,并通过分布式数据并行实现高效扩展,最终显著提升了模型性能。

链接: https://arxiv.org/abs/2504.17838
作者: Bernhard Jaeger,Daniel Dauner,Jens Beißwenger,Simon Gerstenecker,Kashyap Chitta,Andreas Geiger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We investigate reinforcement learning (RL) for privileged planning in autonomous driving. State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg~progress, position, or orientation rewards. We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion. Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism. We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.
zh

[AI-41] he Role of Open-Source LLM s in Shaping the Future of GeoAI

【速读】:该论文旨在探讨开源范式在大型语言模型(Large Language Models, LLMs)推动地理空间人工智能(Geospatial Artificial Intelligence, GeoAI)转型中的关键作用。论文指出,尽管专有LLMs提供了易用性,但其定制化、互操作性和透明度的局限性往往难以满足特定地理空间任务的需求。相比之下,开源替代方案通过增强适应性、可重复性和社区驱动的创新,在地理信息科学(Geographic Information Science, GIScience)领域取得了显著进展。关键在于利用开源框架支持研究人员开发定制化解决方案、整合前沿方法(如强化学习、高级空间索引)以及遵循FAIR原则(可发现、可访问、可互操作、可重用)。然而,随着对LLMs依赖的增加,需谨慎考虑安全漏洞、伦理风险及AI生成地理空间输出的治理问题。论文强调,GIScience的最佳发展路径并非单一模型类型,而是构建一个包含开源基础、定制化地理空间模型和跨学科协作的多样化、互操作生态系统。通过批判性评估开源LLMs在GeoAI领域的机遇与挑战,本文为如何以公平、可持续且科学严谨的方式利用AI推进空间研究、政策制定和决策提供了深入见解。

链接: https://arxiv.org/abs/2504.17833
作者: Xiao Huang,Zhengzhong Tu,Xinyue Ye,Michael Goodchild
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming geospatial artificial intelligence (GeoAI), offering new capabilities in data processing, spatial analysis, and decision support. This paper examines the open-source paradigm’s pivotal role in this transformation. While proprietary LLMs offer accessibility, they often limit the customization, interoperability, and transparency vital for specialized geospatial tasks. Conversely, open-source alternatives significantly advance Geographic Information Science (GIScience) by fostering greater adaptability, reproducibility, and community-driven innovation. Open frameworks empower researchers to tailor solutions, integrate cutting-edge methodologies (e.g., reinforcement learning, advanced spatial indexing), and align with FAIR principles. However, the growing reliance on any LLM necessitates careful consideration of security vulnerabilities, ethical risks, and robust governance for AI-generated geospatial outputs. Ongoing debates on accessibility, regulation, and misuse underscore the critical need for responsible AI development strategies. This paper argues that GIScience advances best not through a single model type, but by cultivating a diverse, interoperable ecosystem combining open-source foundations for innovation, bespoke geospatial models, and interdisciplinary collaboration. By critically evaluating the opportunities and challenges of open-source LLMs within the broader GeoAI landscape, this work contributes to a nuanced discourse on leveraging AI to effectively advance spatial research, policy, and decision-making in an equitable, sustainable, and scientifically rigorous manner.
zh

[AI-42] Evolution Meets Diffusion: Efficient Neural Architecture Generation

【速读】:该论文旨在解决神经架构搜索(NAS)中因搜索空间庞大且复杂而导致的高计算成本和长耗时问题。为应对这一挑战,主流方法如扩散模型虽有潜力,但仍受限于全局搜索能力不足以及高昂的计算和时间需求。论文提出了一种名为进化扩散神经架构生成(EDNAG)的新方法,其关键是结合进化算法与扩散模型的优势,通过模拟扩散模型的去噪过程,并利用适应度引导从随机高斯分布向最优架构分布的转变,实现高效且无需训练的架构生成。这种方法不仅在架构优化方面达到了最先进的性能,提升了高达10.45%的准确性,还大幅减少了训练时间,将推理速度平均提升了50倍,充分展示了其卓越的效率和有效性。

链接: https://arxiv.org/abs/2504.17827
作者: Bingye Zhou,Caiyang Yu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) has gained widespread attention for its transformative potential in deep learning model design. However, the vast and complex search space of NAS leads to significant computational and time costs. Neural Architecture Generation (NAG) addresses this by reframing NAS as a generation problem, enabling the precise generation of optimal architectures for specific tasks. Despite its promise, mainstream methods like diffusion models face limitations in global search capabilities and are still hindered by high computational and time demands. To overcome these challenges, we propose Evolutionary Diffusion-based Neural Architecture Generation (EDNAG), a novel approach that achieves efficient and training-free architecture generation. EDNAG leverages evolutionary algorithms to simulate the denoising process in diffusion models, using fitness to guide the transition from random Gaussian distributions to optimal architecture distributions. This approach combines the strengths of evolutionary strategies and diffusion models, enabling rapid and effective architecture generation. Extensive experiments demonstrate that EDNAG achieves state-of-the-art (SOTA) performance in architecture optimization, with an improvement in accuracy of up to 10.45%. Furthermore, it eliminates the need for time-consuming training and boosts inference speed by an average of 50 times, showcasing its exceptional efficiency and effectiveness.
zh

[AI-43] EduBot – Can LLM s Solve Personalized Learning and Programming Assignments? AAAI2025

【速读】:该论文旨在解决综合性编程任务中递归请求和错误修复能力不足的问题,特别是现有大语言模型(Large Language Models, LLMs)在处理复杂编程任务时的局限性。论文提出EduBot,这是一种结合概念知识教学、端到端代码开发、个性化编程以及基于少量人工干预的递归提示驱动调试的智能自动化辅助系统。其关键在于利用预训练LLMs实现无需微调的递归自动提示驱动机制,从而解决从概念理解到代码实现再到调试的多步骤编程任务。通过设计包含算法、机器学习及现实问题的20个场景基准测试套件,验证了EduBot在高效完成复杂任务方面的性能,并进一步评估其在不同能力LLMs上的兼容性和鲁棒性。

链接: https://arxiv.org/abs/2504.17824
作者: Yibin Wang,Jiaxi Xie,Lakshminarayanan Subramanian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published at AAAI 2025 AI4EDU Workshop

点击查看摘要

Abstract:The prevalence of Large Language Models (LLMs) is revolutionizing the process of writing code. General and code LLMs have shown impressive performance in generating standalone functions and code-completion tasks with one-shot queries. However, the ability to solve comprehensive programming tasks with recursive requests and bug fixes remains questionable. In this paper, we propose EduBot, an intelligent automated assistant system that combines conceptual knowledge teaching, end-to-end code development, personalized programming through recursive prompt-driven methods, and debugging with limited human interventions powered by LLMs. We show that EduBot can solve complicated programming tasks consisting of sub-tasks with increasing difficulties ranging from conceptual to coding questions by recursive automatic prompt-driven systems without finetuning on LLMs themselves. To further evaluate EduBot’s performance, we design and conduct a benchmark suite consisting of 20 scenarios in algorithms, machine learning, and real-world problems. The result shows that EduBot can complete most scenarios in less than 20 minutes. Based on the benchmark suites, we perform a comparative study to take different LLMs as the backbone and to verify EduBot’s compatibility and robustness across LLMs with varying capabilities. We believe that EduBot is an exploratory approach to explore the potential of pre-trained LLMs in multi-step reasoning and code generation for solving personalized assignments with knowledge learning and code generation.
zh

[AI-44] he Cloud Weaving Model for AI development ALT

【速读】:该论文旨在解决在开发面向边缘化社区的人工智能(AI)试点项目中所面临的挑战难以通过常用范式有效表达的问题。为应对这一困境,论文提出了“云编织模型”(Cloud Weaving Model)这一替代性概念框架,其灵感来源于本土知识、自然图案以及东方传统,旨在将AI发展根植于社会结构之中。该模型的关键在于引入了“云”、“蜘蛛”、“线”、“蛛网”和“天气”等基本元素,并赋予它们在AI语境下的特定含义。通过应用此框架,论文揭示了与边缘化社区共创试点项目中的观察模式,强调了负责任AI发展中被忽视但至关重要的维度。

链接: https://arxiv.org/abs/2504.17823
作者: Darcy Kim,Aida Kalender,Sennay Ghebreab,Giovanni Sileno
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: presented at this http URL 2025, Yokohama

点击查看摘要

Abstract:While analysing challenges in pilot projects developing AI with marginalized communities, we found it difficult to express them within commonly used paradigms. We therefore constructed an alternative conceptual framework to ground AI development in the social fabric – the Cloud Weaving Model – inspired (amongst others) by indigenous knowledge, motifs from nature, and Eastern traditions. This paper introduces and elaborates on the fundamental elements of the model (clouds, spiders, threads, spiderwebs, and weather) and their interpretation in an AI context. The framework is then applied to comprehend patterns observed in co-creation pilots approaching marginalized communities, highlighting neglected yet relevant dimensions for responsible AI development.
zh

[AI-45] Research on Cloud Platform Network Traffic Monitoring and Anomaly Detection System based on Large Language Models

【速读】:该论文试图解决云平台快速发展和网络流量复杂性增加所导致的网络流量监控与异常检测需求问题,以确保网络安全与性能。解决方案的关键在于提出了一种基于大语言模型(Large Language Model, LLM)的网络流量监控与异常检测系统。该系统不仅利用自动编码器(Autoencoder)和决策树等传统方法,还引入了Transformer架构中的注意力机制到监督学习框架中,以更好地捕获网络流量数据中的复杂模式及细微波动。此外,通过迁移学习方法增强模型快速适应未知网络结构和对抗性条件的能力,而无需大量标记数据,从而显著提高检测准确性并降低误报率。

链接: https://arxiv.org/abs/2504.17807
作者: Ze Yang,Yihong Jin,Juntian Liu,Xinhe Xu,Yihan Zhang,Shuyang Ji
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proceedings of 2025 IEEE 7th International Conference on Communications, Information System and Computer Engineering (CISCE 2025)

点击查看摘要

Abstract:The rapidly evolving cloud platforms and the escalating complexity of network traffic demand proper network traffic monitoring and anomaly detection to ensure network security and performance. This paper introduces a large language model (LLM)-based network traffic monitoring and anomaly detection system. In addition to existing models such as autoencoders and decision trees, we harness the power of large language models for processing sequence data from network traffic, which allows us a better capture of underlying complex patterns, as well as slight fluctuations in the dataset. We show for a given detection task, the need for a hybrid model that incorporates the attention mechanism of the transformer architecture into a supervised learning framework in order to achieve better accuracy. A pre-trained large language model analyzes and predicts the probable network traffic, and an anomaly detection layer that considers temporality and context is added. Moreover, we present a novel transfer learning-based methodology to enhance the model’s effectiveness to quickly adapt to unknown network structures and adversarial conditions without requiring extensive labeled datasets. Actual results show that the designed model outperforms traditional methods in detection accuracy and computational efficiency, effectively identify various network anomalies such as zero-day attacks and traffic congestion pattern, and significantly reduce the false positive rate.
zh

[AI-46] Fuzzy Logic – Based Scheduling System for Part-Time Workforce

【速读】:该论文旨在解决大学部分时间学生员工排班优化问题,目标是在考虑员工偏好(如每周期望工作小时数)和可用性的同时,生成满足运营需求(如最大周工作小时数、当班员工数量要求等)的可行排班方案。解决方案的关键在于采用遗传模糊系统 (Genetic Fuzzy System, GFS),通过训练和测试收集自辛辛那提大学学生的可用性数据,实现高效且鲁棒的排班生成能力,尤其在人员不足的情况下表现出色。

链接: https://arxiv.org/abs/2504.17805
作者: Tri Nguyen,Kelly Cohen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the application of genetic fuzzy systems to efficiently generate schedules for a team of part-time student workers at a university. Given the preferred number of working hours and availability of employees, our model generates feasible solutions considering various factors, such as maximum weekly hours, required number of workers on duty, and the preferred number of working hours. The algorithm is trained and tested with availability data collected from students at the University of Cincinnati. The results demonstrate the algorithm’s efficiency in producing schedules that meet operational criteria and its robustness in understaffed conditions.
zh

[AI-47] Evolution of Optimization Algorithms for Global Placement via Large Language Models

【速读】:该论文试图解决传统全局布局(Global Placement)优化算法设计依赖手工调参且需深厚领域知识的问题。论文的关键解决方案在于提出了一种基于大型语言模型(Large Language Model, LLM)的自动化框架,通过精心设计的提示(prompts)生成多样化的候选算法,并利用基于LLM的遗传学流程(genetic flow)进一步演化这些候选算法。这种方法不仅显著提升了多个基准测试(如MMS、ISPD2005和ISPD2019)上的平均HPWL(Half-Perimeter Wirelength)性能,分别达到5.05%、5.29%和8.30%的改进,某些特定案例甚至提升高达17%,同时还展示了良好的泛化能力并与现有参数调优方法互补。

链接: https://arxiv.org/abs/2504.17801
作者: Xufeng Yao,Jiaxi Jiang,Yuxuan Zhao,Peiyu Liao,Yibo Lin,Bei Yu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization algorithms are widely employed to tackle complex problems, but designing them manually is often labor-intensive and requires significant expertise. Global placement is a fundamental step in electronic design automation (EDA). While analytical approaches represent the state-of-the-art (SOTA) in global placement, their core optimization algorithms remain heavily dependent on heuristics and customized components, such as initialization strategies, preconditioning methods, and line search techniques. This paper presents an automated framework that leverages large language models (LLM) to evolve optimization algorithms for global placement. We first generate diverse candidate algorithms using LLM through carefully crafted prompts. Then we introduce an LLM-based genetic flow to evolve selected candidate algorithms. The discovered optimization algorithms exhibit substantial performance improvements across many benchmarks. Specifically, Our design-case-specific discovered algorithms achieve average HPWL improvements of \textbf5.05%, \text5.29% and \textbf8.30% on MMS, ISPD2005 and ISPD2019 benchmarks, and up to \textbf17% improvements on individual cases. Additionally, the discovered algorithms demonstrate good generalization ability and are complementary to existing parameter-tuning methods.
zh

[AI-48] Subfunction Structure Matters: A New Perspective on Local Optima Networks

【速读】:该论文试图解决的问题是如何在局部最优网络(Local Optima Networks, LONs)的构建与分析中有效利用问题结构信息。当前的LON构建方法通常是黑箱式的,未充分利用关于问题结构的知识,如变量间的相互作用。论文提出的关键解决方案是引入基于子函数(subfunction)的信息来改进LON分析,这些信息可以是先验已知的或在搜索过程中学习到的。为此,作者使用三种方法构建了多个基准伪布尔问题的LON:标准算法、采用确定性灰箱交叉的算法以及根据学习到的变量交互信息选择扰动的算法。同时,论文提出了与LON中子函数变化相关的度量指标,并将其与先前文献中的其他LON度量进行比较。通过将问题结构融入LON的构建和分析,能够为优化动力学提供更丰富的洞察,这对理解使用最先进的连锁学习优化器解决特定问题的难度至关重要。论文建议,在具有已知或疑似子函数结构的问题的景观分析中,将问题结构的纳入作为替代范式。

链接: https://arxiv.org/abs/2504.17799
作者: S. L. Thomson,M. W. Przewozniczek
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Local optima networks (LONs) capture fitness landscape information. They are typically constructed in a black-box manner; information about the problem structure is not utilised. This also applies to the analysis of LONs: knowledge about the problem, such as interaction between variables, is not considered. We challenge this status-quo with an alternative approach: we consider how LON analysis can be improved by incorporating subfunction-based information - this can either be known a-priori or learned during search. To this end, LONs are constructed for several benchmark pseudo-boolean problems using three approaches: firstly, the standard algorithm; a second algorithm which uses deterministic grey-box crossover; and a third algorithm which selects perturbations based on learned information about variable interactions. Metrics related to subfunction changes in a LON are proposed and compared with metrics from previous literature which capture other aspects of a LON. Incorporating problem structure in LON construction and analysing it can bring enriched insight into optimisation dynamics. Such information may be crucial to understanding the difficulty of solving a given problem with state-of-the-art linkage learning optimisers. In light of the results, we suggest incorporation of problem structure as an alternative paradigm in landscape analysis for problems with known or suspected subfunction structure.
zh

[AI-49] My Precious Crash Data: Barriers and Opportunities in Encourag ing Autonomous Driving Companies to Share Safety-Critical Data

【速读】:该论文旨在探究自动驾驶车辆(Autonomous Vehicle, AV)公司为何不愿对外分享安全关键数据,并着眼于这些障碍如何启发新的共享促进方法。论文通过采访从事相关数据工作的十二名AV公司员工,识别出两个此前未知的主要障碍:(1) 数据集本身嵌入了提升AV安全的关键显性知识,且资源密集型,因此即使在公司内部,数据共享也充满政治博弈;(2) 受访者认为AV安全知识是公司的私有知识,能够带来竞争优势,而非用于社会福祉的公共知识。论文讨论了这些发现对激励和实现安全关键AV数据共享的意义,特别是针对(1) 辩论和分层公共与私人AV安全知识的方法,(2) 创新数据工具和共享管道以简化公共AV安全数据及知识的共享,以及(3) 抵消整理安全关键数据的成本并激励数据共享的关键解决方案。

链接: https://arxiv.org/abs/2504.17792
作者: Hauke Sandhaus,Angel Hsing-Chi Hwang,Wendy Ju,Qian Yang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: To appear in Proc. ACM Hum.-Comput. Interact., Computer-Supported Cooperative Work Social Computing (CSCW), 2025

点击查看摘要

Abstract:Safety-critical data, such as crash and near-crash records, are crucial to improving autonomous vehicle (AV) design and development. Sharing such data across AV companies, academic researchers, regulators, and the public can help make all AVs safer. However, AV companies rarely share safety-critical data externally. This paper aims to pinpoint why AV companies are reluctant to share safety-critical data, with an eye on how these barriers can inform new approaches to promote sharing. We interviewed twelve AV company employees who actively work with such data in their day-to-day work. Findings suggest two key, previously unknown barriers to data sharing: (1) Datasets inherently embed salient knowledge that is key to improving AV safety and are resource-intensive. Therefore, data sharing, even within a company, is fraught with politics. (2) Interviewees believed AV safety knowledge is private knowledge that brings competitive edges to their companies, rather than public knowledge for social good. We discuss the implications of these findings for incentivizing and enabling safety-critical AV data sharing, specifically, implications for new approaches to (1) debating and stratifying public and private AV safety knowledge, (2) innovating data tools and data sharing pipelines that enable easier sharing of public AV safety data and knowledge; (3) offsetting costs of curating safety-critical data and incentivizing data sharing.
zh

[AI-50] Artificial Intelligence health advice accuracy varies across languages and contexts

【速读】:该论文旨在评估主流大型语言模型(Large Language Models, LLMs)在多语言、多主题公共健康陈述中的表现,并揭示其在非欧洲语言及特定主题与来源上的局限性。论文通过分析来自英国和欧盟注册机构授权的基本健康声明以及经过9,100条由记者验证的公共健康断言(涵盖堕胎、COVID-19、政治等话题),发现尽管这些模型在以英语为中心的教科书式陈述上表现出较高准确性,但在多种非欧洲语言中性能下降,并且在不同主题和信息来源上的表现存在波动。论文的核心解决方案在于强调在将人工智能应用于全球健康传播之前,进行全面的多语言、领域感知的验证工作具有紧迫性。

链接: https://arxiv.org/abs/2504.18310
作者: Prashant Garg,Thiemo Fetzer
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 10 pages, 2 figures. All data, code and materials used is freely available in the Zenodo (DOI: https://doi.org/10.5281/zenodo.15281282 )

点击查看摘要

Abstract:Using basic health statements authorized by UK and EU registers and 9,100 journalist-vetted public-health assertions on topics such as abortion, COVID-19 and politics from sources ranging from peer-reviewed journals and government advisories to social media and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that, despite high accuracy on English-centric textbook claims, performance falls in multiple non-European languages and fluctuates by topic and source, highlighting the urgency of comprehensive multilingual, domain-aware validation before deploying AI in global health communication.
zh

机器学习

[LG-0] Intelligent Attacks and Defense Methods in Federated Learning-enabled Energy-Efficient Wireless Networks

链接: https://arxiv.org/abs/2504.18519
作者: Han Zhang,Hao Zhou,Medhat Elsayed,Majid Bavand,Raimundas Gaigalas,Yigit Ozcan,Melike Erol-Kantarci
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a promising technique for learning-based functions in wireless networks, thanks to its distributed implementation capability. On the other hand, distributed learning may increase the risk of exposure to malicious attacks where attacks on a local model may spread to other models by parameter exchange. Meanwhile, such attacks can be hard to detect due to the dynamic wireless environment, especially considering local models can be heterogeneous with non-independent and identically distributed (non-IID) data. Therefore, it is critical to evaluate the effect of malicious attacks and develop advanced defense techniques for FL-enabled wireless networks. In this work, we introduce a federated deep reinforcement learning-based cell sleep control scenario that enhances the energy efficiency of the network. We propose multiple intelligent attacks targeting the learning-based approach and we propose defense methods to mitigate such attacks. In particular, we have designed two attack models, generative adversarial network (GAN)-enhanced model poisoning attack and regularization-based model poisoning attack. As a counteraction, we have proposed two defense schemes, autoencoder-based defense, and knowledge distillation (KD)-enabled defense. The autoencoder-based defense method leverages an autoencoder to identify the malicious participants and only aggregate the parameters of benign local models during the global aggregation, while KD-based defense protects the model from attacks by controlling the knowledge transferred between the global model and local models.

[LG-1] PODNO: Proper Orthogonal Decomposition Neural Operators

链接: https://arxiv.org/abs/2504.18513
作者: Zilan Cheng,Zhongjian Wang,Li-Lian Wang,Mejdi Azaiez
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Proper Orthogonal Decomposition Neural Operators (PODNO) for solving partial differential equations (PDEs) dominated by high-frequency components. Building on the structure of Fourier Neural Operators (FNO), PODNO replaces the Fourier transform with (inverse) orthonormal transforms derived from the Proper Orthogonal Decomposition (POD) method to construct the integral kernel. Due to the optimality of POD basis, the PODNO has potential to outperform FNO in both accuracy and computational efficiency for high-frequency problems. From analysis point of view, we established the universality of a generalization of PODNO, termed as Generalized Spectral Operator (GSO). In addition, we evaluate PODNO’s performance numerically on dispersive equations such as the Nonlinear Schrodinger (NLS) equation and the Kadomtsev-Petviashvili (KP) equation.

[LG-2] Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional

链接: https://arxiv.org/abs/2504.18506
作者: Sanjeev Raja,Martin Šípka,Michael Psenka,Tobias Kreiman,Michal Pavelka,Aditi S. Krishnapriyan
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Transition path sampling (TPS), which involves finding probable paths connecting two points on an energy landscape, remains a challenge due to the complexity of real-world atomistic systems. Current machine learning approaches use expensive, task-specific, and data-free training procedures, limiting their ability to benefit from recent advances in atomistic machine learning, such as high-quality datasets and large-scale pre-trained models. In this work, we address TPS by interpreting candidate paths as trajectories sampled from stochastic dynamics induced by the learned score function of pre-trained generative models, specifically denoising diffusion and flow matching. Under these dynamics, finding high-likelihood transition paths becomes equivalent to minimizing the Onsager-Machlup (OM) action functional. This enables us to repurpose pre-trained generative models for TPS in a zero-shot manner, in contrast with bespoke, task-specific TPS models trained in previous work. We demonstrate our approach on varied molecular systems, obtaining diverse, physically realistic transition pathways and generalizing beyond the pre-trained model’s original training dataset. Our method can be easily incorporated into new generative models, making it practically relevant as models continue to scale and improve with increased data availability.

[LG-3] Discovering Governing Equations of Geomagnetic Storm Dynamics with Symbolic Regression

链接: https://arxiv.org/abs/2504.18461
作者: Stefano Markidis,Jonah Ekelund,Luca Pennati,Andong Hu,Ivy Peng
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Accepted for publication in the 25th International Conference on Computational Science proceedings

点击查看摘要

Abstract:Geomagnetic storms are large-scale disturbances of the Earth’s magnetosphere driven by solar wind interactions, posing significant risks to space-based and ground-based infrastructure. The Disturbance Storm Time (Dst) index quantifies geomagnetic storm intensity by measuring global magnetic field variations. This study applies symbolic regression to derive data-driven equations describing the temporal evolution of the Dst index. We use historical data from the NASA OMNIweb database, including solar wind density, bulk velocity, convective electric field, dynamic pressure, and magnetic pressure. The PySR framework, an evolutionary algorithm-based symbolic regression library, is used to identify mathematical expressions linking dDst/dt to key solar wind. The resulting models include a hierarchy of complexity levels and enable a comparison with well-established empirical models such as the Burton-McPherron-Russell and O’Brien-McPherron models. The best-performing symbolic regression models demonstrate superior accuracy in most cases, particularly during moderate geomagnetic storms, while maintaining physical interpretability. Performance evaluation on historical storm events includes the 2003 Halloween Storm, the 2015 St. Patrick’s Day Storm, and a 2017 moderate storm. The results provide interpretable, closed-form expressions that capture nonlinear dependencies and thresholding effects in Dst evolution.

[LG-4] Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

链接: https://arxiv.org/abs/2504.18454
作者: Hiroki Naganuma,Xinzhi Zhang,Man-Chung Yue,Ioannis Mitliagkas,Philipp A. Witte,Russell J. Hewett,Yin Tat Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven development of distributed deep learning methods. Data parallelism is an essential approach to speed up training, but it requires frequent global communication between workers, which can bottleneck training at the largest scales. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is an extension of Local SGD (Stich, 2018) and DiLoCo (Douillard et al., 2023), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm’s behavior and performance guarantees. We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and DiLoCo. Notably, PALSGD trains 18.4% faster than DDP on ImageNet-1K with ResNet-50, 24.4% faster than DDP on TinyStories with GPT-Neo125M, and 21.1% faster than DDP on TinyStories with GPT-Neo-8M.

[LG-5] Enhancing Strawberry Yield Forecasting with Backcasted IoT Sensor Data and Machine Learning

链接: https://arxiv.org/abs/2504.18451
作者: Tewodros Alemu Ayall,Andy Li,Matthew Beddows,Milan Markovic,Georgios Leontidis
类目: Machine Learning (cs.LG)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Due to rapid population growth globally, digitally-enabled agricultural sectors are crucial for sustainable food production and making informed decisions about resource management for farmers and various stakeholders. The deployment of Internet of Things (IoT) technologies that collect real-time observations of various environmental (e.g., temperature, humidity, etc.) and operational factors (e.g., irrigation) influencing production is often seen as a critical step to enable additional novel downstream tasks, such as AI-based yield forecasting. However, since AI models require large amounts of data, this creates practical challenges in a real-world dynamic farm setting where IoT observations would need to be collected over a number of seasons. In this study, we deployed IoT sensors in strawberry production polytunnels for two growing seasons to collect environmental data, including water usage, external and internal temperature, external and internal humidity, soil moisture, soil temperature, and photosynthetically active radiation. The sensor observations were combined with manually provided yield records spanning a period of four seasons. To bridge the gap of missing IoT observations for two additional seasons, we propose an AI-based backcasting approach to generate synthetic sensor observations using historical weather data from a nearby weather station and the existing polytunnel observations. We built an AI-based yield forecasting model to evaluate our approach using the combination of real and synthetic observations. Our results demonstrated that incorporating synthetic data improved yield forecasting accuracy, with models incorporating synthetic data outperforming those trained only on historical yield, weather records, and real sensor data.

[LG-6] Boosting-Enabled Robust System Identification of Partially Observed LTI Systems Under Heavy-Tailed Noise

链接: https://arxiv.org/abs/2504.18444
作者: Vinay Kanakeri,Aritra Mitra
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider the problem of system identification of partially observed linear time-invariant (LTI) systems. Given input-output data, we provide non-asymptotic guarantees for identifying the system parameters under general heavy-tailed noise processes. Unlike previous works that assume Gaussian or sub-Gaussian noise, we consider significantly broader noise distributions that are required to admit only up to the second moment. For this setting, we leverage tools from robust statistics to propose a novel system identification algorithm that exploits the idea of boosting. Despite the much weaker noise assumptions, we show that our proposed algorithm achieves sample complexity bounds that nearly match those derived under sub-Gaussian noise. In particular, we establish that our bounds retain a logarithmic dependence on the prescribed failure probability. Interestingly, we show that such bounds can be achieved by requiring just a finite fourth moment on the excitatory input process.

[LG-7] An Axiomatic Assessment of Entropy- and Variance-based Uncertainty Quantification in Regression

链接: https://arxiv.org/abs/2504.18433
作者: Christopher Bülte,Yusuf Sale,Timo Löhr,Paul Hofman,Gitta Kutyniok,Eyke Hüllermeier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) is crucial in machine learning, yet most (axiomatic) studies of uncertainty measures focus on classification, leaving a gap in regression settings with limited formal justification and evaluations. In this work, we introduce a set of axioms to rigorously assess measures of aleatoric, epistemic, and total uncertainty in supervised regression. By utilizing a predictive exponential family, we can generalize commonly used approaches for uncertainty representation and corresponding uncertainty measures. More specifically, we analyze the widely used entropy- and variance-based measures regarding limitations and challenges. Our findings provide a principled foundation for UQ in regression, offering theoretical insights and practical guidelines for reliable uncertainty assessment.

[LG-8] Online learning to accelerate nonlinear PDE solvers: applied to multiphase porous media flow

链接: https://arxiv.org/abs/2504.18414
作者: Vinicius L S Silva,Pablo Salinas,Claire E Heaney,Matthew Jackson,Christopher C Pain
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We propose a novel type of nonlinear solver acceleration for systems of nonlinear partial differential equations (PDEs) that is based on online/adaptive learning. It is applied in the context of multiphase flow in porous media. The proposed method rely on four pillars: (i) dimensionless numbers as input parameters for the machine learning model, (ii) simplified numerical model (two-dimensional) for the offline training, (iii) dynamic control of a nonlinear solver tuning parameter (numerical relaxation), (iv) and online learning for real-time improvement of the machine learning model. This strategy decreases the number of nonlinear iterations by dynamically modifying a single global parameter, the relaxation factor, and by adaptively learning the attributes of each numerical model on-the-run. Furthermore, this work performs a sensitivity study in the dimensionless parameters (machine learning features), assess the efficacy of various machine learning models, demonstrate a decrease in nonlinear iterations using our method in more intricate, realistic three-dimensional models, and fully couple a machine learning model into an open-source multiphase flow simulator achieving up to 85% reduction in computational time.

[LG-9] hree Types of Calibration with Properties and their Semantic and Formal Relationships

链接: https://arxiv.org/abs/2504.18395
作者: Rabanus Derr,Jessie Finocchiaro,Robert C. Williamson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fueled by discussions around “trustworthiness” and algorithmic fairness, calibration of predictive systems has regained scholars attention. The vanilla definition and understanding of calibration is, simply put, on all days on which the rain probability has been predicted to be p, the actual frequency of rain days was p. However, the increased attention has led to an immense variety of new notions of “calibration.” Some of the notions are incomparable, serve different purposes, or imply each other. In this work, we provide two accounts which motivate calibration: self-realization of forecasted properties and precise estimation of incurred losses of the decision makers relying on forecasts. We substantiate the former via the reflection principle and the latter by actuarial fairness. For both accounts we formulate prototypical definitions via properties \Gamma of outcome distributions, e.g., the mean or median. The prototypical definition for self-realization, which we call \Gamma -calibration, is equivalent to a certain type of swap regret under certain conditions. These implications are strongly connected to the omniprediction learning paradigm. The prototypical definition for precise loss estimation is a modification of decision calibration adopted from Zhao et al. [73]. For binary outcome sets both prototypical definitions coincide under appropriate choices of reference properties. For higher-dimensional outcome sets, both prototypical definitions can be subsumed by a natural extension of the binary definition, called distribution calibration with respect to a property. We conclude by commenting on the role of groupings in both accounts of calibration often used to obtain multicalibration. In sum, this work provides a semantic map of calibration in order to navigate a fragmented terrain of notions and definitions.

[LG-10] Machine Learning and Statistical Insights into Hospital Stay Durations: The Italian EHR Case

链接: https://arxiv.org/abs/2504.18393
作者: Marina Andric,Mauro Dragoni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Length of hospital stay is a critical metric for assessing healthcare quality and optimizing hospital resource management. This study aims to identify factors influencing LoS within the Italian healthcare context, using a dataset of hospitalization records from over 60 healthcare facilities in the Piedmont region, spanning from 2020 to 2023. We explored a variety of features, including patient characteristics, comorbidities, admission details, and hospital-specific factors. Significant correlations were found between LoS and features such as age group, comorbidity score, admission type, and the month of admission. Machine learning models, specifically CatBoost and Random Forest, were used to predict LoS. The highest R2 score, 0.49, was achieved with CatBoost, demonstrating good predictive performance.

[LG-11] Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels AISTATS2025

链接: https://arxiv.org/abs/2504.18385
作者: Danial Dervovic,Michael Cashmore
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 9 pages, 4 figures. Accepted to AISTATS 2025

点击查看摘要

Abstract:Missing data in supervised learning is well-studied, but the specific issue of missing labels during model evaluation has been overlooked. Ignoring samples with missing values, a common solution, can introduce bias, especially when data is Missing Not At Random (MNAR). We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC. This method not only offers point estimates but also a predictive distribution for these quantities when labels are missing. We empirically show that the predictive distribution’s location and shape are generally correct, even in the MNAR regime. Moreover, we establish that this distribution is approximately Gaussian and provide finite-sample convergence bounds. Additionally, a robustness proof is presented, confirming the validity of the approximation under a realistic error model.

[LG-12] Explainable AI for UAV Mobility Management: A Deep Q-Network Approach for Handover Minimization

链接: https://arxiv.org/abs/2504.18371
作者: Irshad A. Meer,Bruno Hörmann,Mustafa Ozger,Fabien Geyer,Alberto Viseras,Dominic Schupke,Cicek Cavdar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to IEEE PIMRC 2025

点击查看摘要

Abstract:The integration of unmanned aerial vehicles (UAVs) into cellular networks presents significant mobility management challenges, primarily due to frequent handovers caused by probabilistic line-of-sight conditions with multiple ground base stations (BSs). To tackle these challenges, reinforcement learning (RL)-based methods, particularly deep Q-networks (DQN), have been employed to optimize handover decisions dynamically. However, a major drawback of these learning-based approaches is their black-box nature, which limits interpretability in the decision-making process. This paper introduces an explainable AI (XAI) framework that incorporates Shapley Additive Explanations (SHAP) to provide deeper insights into how various state parameters influence handover decisions in a DQN-based mobility management system. By quantifying the impact of key features such as reference signal received power (RSRP), reference signal received quality (RSRQ), buffer status, and UAV position, our approach enhances the interpretability and reliability of RL-based handover solutions. To validate and compare our framework, we utilize real-world network performance data collected from UAV flight trials. Simulation results show that our method provides intuitive explanations for policy decisions, effectively bridging the gap between AI-driven models and human decision-makers.

[LG-13] SSA-UNet: Advanced Precipitation Nowcasting via Channel Shuffling

链接: https://arxiv.org/abs/2504.18309
作者: Marco Turzi,Siamak Mehrkanoon
类目: Machine Learning (cs.LG)
*备注: 8 pages. 8 figs

点击查看摘要

Abstract:Weather forecasting is essential for facilitating diverse socio-economic activity and environmental conservation initiatives. Deep learning techniques are increasingly being explored as complementary approaches to Numerical Weather Prediction (NWP) models, offering potential benefits such as reduced complexity and enhanced adaptability in specific applications. This work presents a novel design, Small Shuffled Attention UNet (SSA-UNet), which enhances SmaAt-UNet’s architecture by including a shuffle channeling mechanism to optimize performance and diminish complexity. To assess its efficacy, this architecture and its reduced variant are examined and trained on two datasets: a Dutch precipitation dataset from 2016 to 2019, and a French cloud cover dataset containing radar images from 2017 to 2018. Three output configurations of the proposed architecture are evaluated, yielding outputs of 1, 6, and 12 precipitation maps, respectively. To better understand how this model operates and produces its predictions, a gradient-based approach called Grad-CAM is used to analyze the outputs generated. The analysis of heatmaps generated by Grad-CAM facilitated the identification of regions within the input maps that the model considers most informative for generating its predictions. The implementation of SSA-UNet can be found on our Github\footnote\hrefthis https URLthis https URL

[LG-14] Deep Reinforcement Learning Based Navigation with Macro Actions and Topological Maps

链接: https://arxiv.org/abs/2504.18300
作者: Simon Hakenes,Tobias Glasmachers
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:This paper addresses the challenge of navigation in large, visually complex environments with sparse rewards. We propose a method that uses object-oriented macro actions grounded in a topological map, allowing a simple Deep Q-Network (DQN) to learn effective navigation policies. The agent builds a map by detecting objects from RGBD input and selecting discrete macro actions that correspond to navigating to these objects. This abstraction drastically reduces the complexity of the underlying reinforcement learning problem and enables generalization to unseen environments. We evaluate our approach in a photorealistic 3D simulation and show that it significantly outperforms a random baseline under both immediate and terminal reward conditions. Our results demonstrate that topological structure and macro-level abstraction can enable sample-efficient learning even from pixel data.

[LG-15] A comprehensive review of classifier probability calibration metrics

链接: https://arxiv.org/abs/2504.18278
作者: Richard Oliver Lane
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 60 pages, 7 figures

点击查看摘要

Abstract:Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers.

[LG-16] Studying Small Language Models with Susceptibilities

链接: https://arxiv.org/abs/2504.18274
作者: Garrett Baker,George Wang,Jesse Hoogland,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small, controlled perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. Building a set of perturbations (probes) yields a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer. Susceptibilities link local learning coefficients from singular learning theory with linear-response theory, and quantify how local loss landscape geometry deforms under shifts in the data distribution.

[LG-17] Efficient Learning on Large Graphs using a Densifying Regularity Lemma

链接: https://arxiv.org/abs/2504.18273
作者: Jonathan Kouchly,Ben Finkelshtein,Michael Bronstein,Ron Levie
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning on large graphs presents significant challenges, with traditional Message Passing Neural Networks suffering from computational and memory costs scaling linearly with the number of edges. We introduce the Intersecting Block Graph (IBG), a low-rank factorization of large directed graphs based on combinations of intersecting bipartite components, each consisting of a pair of communities, for source and target nodes. By giving less weight to non-edges, we show how to efficiently approximate any graph, sparse or dense, by a dense IBG. Specifically, we prove a constructive version of the weak regularity lemma, showing that for any chosen accuracy, every graph, regardless of its size or sparsity, can be approximated by a dense IBG whose rank depends only on the accuracy. This dependence of the rank solely on the accuracy, and not on the sparsity level, is in contrast to previous forms of the weak regularity lemma. We present a graph neural network architecture operating on the IBG representation of the graph and demonstrating competitive performance on node classification, spatio-temporal graph analysis, and knowledge graph completion, while having memory and computational complexity linear in the number of nodes rather than edges.

[LG-18] Local Statistical Parity for the Estimation of Fair Decision Trees

链接: https://arxiv.org/abs/2504.18262
作者: Andrea Quintanilla,Johan Van Horebeek
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the high computational complexity of decision tree estimation, classical methods construct a tree by adding one node at a time in a recursive way. To facilitate promoting fairness, we propose a fairness criterion local to the tree nodes. We prove how it is related to the Statistical Parity criterion, popular in the Algorithmic Fairness literature, and show how to incorporate it into standard recursive tree estimation algorithms. We present a tree estimation algorithm called Constrained Logistic Regression Tree (C-LRT), which is a modification of the standard CART algorithm using locally linear classifiers and imposing restrictions as done in Constrained Logistic Regression. Finally, we evaluate the performance of trees estimated with C-LRT on datasets commonly used in the Algorithmic Fairness literature, using various classification and fairness metrics. The results confirm that C-LRT successfully allows to control and balance accuracy and fairness. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.18262 [cs.LG] (or arXiv:2504.18262v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.18262 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-19] DualRAG : A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering

链接: https://arxiv.org/abs/2504.18243
作者: Rong Cheng,Jinyi Liu,YAN ZHENG,Fei Ni,Jiazhen Du,Hangyu Mao,Fuzheng Zhang,Bo Wang,Jianye HAO
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.

[LG-20] Switch-Based Multi-Part Neural Network

链接: https://arxiv.org/abs/2504.18241
作者: Surajit Majumder,Paritosh Ranjan,Prodip Roy,Bhuban Padhan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:This paper introduces decentralized and modular neural network framework designed to enhance the scalability, interpretability, and performance of artificial intelligence (AI) systems. At the heart of this framework is a dynamic switch mechanism that governs the selective activation and training of individual neurons based on input characteristics, allowing neurons to specialize in distinct segments of the data domain. This approach enables neurons to learn from disjoint subsets of data, mimicking biological brain function by promoting task specialization and improving the interpretability of neural network behavior. Furthermore, the paper explores the application of federated learning and decentralized training for real-world AI deployments, particularly in edge computing and distributed environments. By simulating localized training on non-overlapping data subsets, we demonstrate how modular networks can be efficiently trained and evaluated. The proposed framework also addresses scalability, enabling AI systems to handle large datasets and distributed processing while preserving model transparency and interpretability. Finally, we discuss the potential of this approach in advancing the design of scalable, privacy-preserving, and efficient AI systems for diverse applications.

[LG-21] Ultra-fast feature learning for the training of two-layer neural networks in the two-timescale regime

链接: https://arxiv.org/abs/2504.18208
作者: Raphaël Barboni(ÉNS-PSL),Gabriel Peyré(CNRS and ÉNS-PSL),François-Xavier Vialard(LIGM)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the convergence of gradient methods for the training of mean-field single hidden layer neural networks with square loss. Observing this is a separable non-linear least-square problem which is linear w.r.t. the outer layer’s weights, we consider a Variable Projection (VarPro) or two-timescale learning algorithm, thereby eliminating the linear variables and reducing the learning problem to the training of the feature distribution. Whereas most convergence rates or the training of neural networks rely on a neural tangent kernel analysis where features are fixed, we show such a strategy enables provable convergence rates for the sampling of a teacher feature distribution. Precisely, in the limit where the regularization strength vanishes, we show that the dynamic of the feature distribution corresponds to a weighted ultra-fast diffusion equation. Relying on recent results on the asymptotic behavior of such PDEs, we obtain guarantees for the convergence of the trained feature distribution towards the teacher feature distribution in a teacher-student setup.

[LG-22] A Machine Learning Approach For Bitcoin Forecasting

链接: https://arxiv.org/abs/2504.18206
作者: Stefano Sossi-Rojas,Gissel Velarde,Damian Zieba
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 15 pages

点击查看摘要

Abstract:Bitcoin is one of the cryptocurrencies that is gaining more popularity in recent years. Previous studies have shown that closing price alone is not enough to forecast stock market series. We introduce a new set of time series and demonstrate that a subset is necessary to improve directional accuracy based on a machine learning ensemble. In our experiments, we study which time series and machine learning algorithms deliver the best results. We found that the most relevant time series that contribute to improving directional accuracy are Open, High and Low, with the largest contribution of Low in combination with an ensemble of Gated Recurrent Unit network and a baseline forecast. The relevance of other Bitcoin-related features that are not price-related is negligible. The proposed method delivers similar performance to the state-of-the-art when observing directional accuracy.

[LG-23] An Open-Source and Reproducible Implementation of LSTM and GRU Networks for Time Series Forecasting

链接: https://arxiv.org/abs/2504.18185
作者: Gissel Velarde,Pedro Branez,Alejandro Bueno,Rodrigo Heredia,Mateo Lopez-Ledezma
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:This paper introduces an open-source and reproducible implementation of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) Networks for time series forecasting. We evaluated LSTM and GRU networks because of their performance reported in related work. We describe our method and its results on two datasets. The first dataset is the SP BSE BANKEX, composed of stock time series (closing prices) of ten financial institutions. The second dataset, called Activities, comprises ten synthetic time series resembling weekly activities with five days of high activity and two days of low activity. We report Root Mean Squared Error (RMSE) between actual and predicted values, as well as Directional Accuracy (DA). We show that a single time series from a dataset can be used to adequately train the networks if the sequences in the dataset contain patterns that repeat, even with certain variation, and are properly processed. For 1-step ahead and 20-step ahead forecasts, LSTM and GRU networks significantly outperform a baseline on the Activities dataset. The baseline simply repeats the last available value. On the stock market dataset, the networks perform just like the baseline, possibly due to the nature of these series. We release the datasets used as well as the implementation with all experiments performed to enable future comparisons and to make our research reproducible.

[LG-24] Unveiling 3D Ocean Biogeochemical Provinces: A Machine Learning Approach for Systematic Clustering and Validation

链接: https://arxiv.org/abs/2504.18181
作者: Yvonne Jenniges,Maike Sonnewald,Sebastian Maneth,Are Olsen,Boris P. Koch
类目: Machine Learning (cs.LG)
*备注: Submitted to Ecological Informatics. Images in this preprint are of lower resolution than in the journal submission

点击查看摘要

Abstract:Defining ocean regions and water masses helps to understand marine processes and can serve downstream-tasks such as defining marine protected areas. However, such definitions are often a result of subjective decisions potentially producing misleading, unreproducible results. Here, the aim was to objectively define regions of the North Atlantic. For this, a data-driven, systematic machine learning approach was applied to generate and validate ocean clusters employing external, internal and relative validation techniques. About 300 million measured salinity, temperature, and oxygen, nitrate, phosphate and silicate concentration values served as input for various clustering methods (KMeans, agglomerative Ward, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN)). Uniform Manifold Approximation and Projection (UMAP) emphasised (dis-)similarities in the data while reducing dimensionality. Based on a systematic validation of the considered clustering methods and their hyperparameters, the results showed that UMAP-DBSCAN best represented the data. To address stochastic variability, 100 UMAP-DBSCAN clustering runs were conducted and aggregated using Native Emergent Manifold Interrogation (NEMI), producing a final set of 321 clusters. Reproducibility was evaluated by calculating the ensemble overlap (88.81 ± 1.8%) and the mean grid cell-wise uncertainty estimated by NEMI (15.49 ± 20%). The presented clustering results agreed very well with common water mass definitions. This study revealed a more detailed regionalization compared to previous concepts such as the Longhurst provinces. The applied method is objective, efficient and reproducible and will support future research focusing on biogeochemical differences and changes in oceanic regions.

[LG-25] A Generative Graph Contrastive Learning Model with Global Signal

链接: https://arxiv.org/abs/2504.18148
作者: Xiaofan Wei,Binyan Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) has garnered significant attention recently since it learns complex structural information from graphs through self-supervised learning manner. However, prevalent GCL models may suffer from performance degradation due to inappropriate contrastive signals. Concretely, they commonly generate augmented views based on random perturbation, which leads to biased essential structures due to the introduction of noise. In addition, they assign equal weight to both hard and easy sample pairs, thereby ignoring the difference in importance of the sample pairs. To address these issues, this study proposes a novel Contrastive Signal Generative Framework for Accurate Graph Learning (CSG2L) with the following two-fold ideas: a) building a singular value decomposition (SVD)-directed augmented module (SVD-aug) to obtain the global interactions as well as avoiding the random noise perturbation; b) designing a local-global dependency learning module (LGDL) with an adaptive reweighting strategy which can differentiate the effects of hard and easy sample pairs. Extensive experiments on benchmark datasets demonstrate that the proposed CSG2L outperforms the state-of-art baselines. Moreover, CSG2L is compatible with a variety of GNNs.

[LG-26] NoEsis: Differentially Private Knowledge Transfer in Modular LLM Adaptation ICLR2025

链接: https://arxiv.org/abs/2504.18147
作者: Rob Romijnders,Stefanos Laskaridis,Ali Shahin Shamsabadi,Hamed Haddadi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: ICLR 2025 MCDC workshop

点击查看摘要

Abstract:Large Language Models (LLM) are typically trained on vast amounts of data from various sources. Even when designed modularly (e.g., Mixture-of-Experts), LLMs can leak privacy on their sources. Conversely, training such models in isolation arguably prohibits generalization. To this end, we propose a framework, NoEsis, which builds upon the desired properties of modularity, privacy, and knowledge transfer. NoEsis integrates differential privacy with a hybrid two-staged parameter-efficient fine-tuning that combines domain-specific low-rank adapters, acting as experts, with common prompt tokens, acting as a knowledge-sharing backbone. Results from our evaluation on CodeXGLUE showcase that NoEsis can achieve provable privacy guarantees with tangible knowledge transfer across domains, and empirically show protection against Membership Inference Attacks. Finally, on code completion tasks, NoEsis bridges at least 77% of the accuracy gap between the non-shared and the non-private baseline.

[LG-27] ree Boosting Methods for Balanced andImbalanced Classification and their Robustness Over Time in Risk Assessment

链接: https://arxiv.org/abs/2504.18133
作者: Gissel Velarde,Michael Weichert,Anuj Deshmunkh,Sanjay Deshmane,Anindya Sudhir,Khushboo Sharma,Vaibhav Joshi
类目: Machine Learning (cs.LG)
*备注: 14 pages. arXiv admin note: text overlap with arXiv:2303.15218

点击查看摘要

Abstract:Most real-world classification problems deal with imbalanced datasets, posing a challenge for Artificial Intelligence (AI), i.e., machine learning algorithms, because the minority class, which is of extreme interest, often proves difficult to be detected. This paper empirically evaluates tree boosting methods’ performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. For tabular data, tree-based methods such as XGBoost, stand out in several benchmarks due to detection performance and speed. Therefore, XGBoost and Imbalance-XGBoost are evaluated. After introducing the motivation to address risk assessment with machine learning, the paper reviews evaluation metrics for detection systems or binary classifiers. It proposes a method for data preparation followed by tree boosting methods including hyper-parameter optimization. The method is evaluated on private datasets of 1 thousand (K), 10K and 100K samples on distributions with 50, 45, 25, and 5 percent positive samples. As expected, the developed method increases its recognition performance as more data is given for training and the F1 score decreases as the data distribution becomes more imbalanced, but it is still significantly superior to the baseline of precision-recall determined by the ratio of positives divided by positives and negatives. Sampling to balance the training set does not provide consistent improvement and deteriorates detection. In contrast, classifier hyper-parameter optimization improves recognition, but should be applied carefully depending on data volume and distribution. Finally, the developed method is robust to data variation over time up to some point. Retraining can be used when performance starts deteriorating.

[LG-28] Score-Based Deterministic Density Sampling

链接: https://arxiv.org/abs/2504.18130
作者: Vasily Ilin,Bamdad Hosseini,Jingwei Hu
类目: Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We propose and analyze a deterministic sampling framework using Score-Based Transport Modeling (SBTM) for sampling an unnormalized target density \pi . While diffusion generative modeling relies on pre-training the score function \nabla \log f_t using samples from \pi , SBTM addresses the more general and challenging setting where only \nabla \log\pi is known. SBTM approximates the Wasserstein gradient flow on KL (f_t|\pi) by learning the time-varying score \nabla \log f_t on the fly using score matching. The learned score gives immediate access to relative Fisher information, providing a built-in convergence diagnostic. The deterministic trajectories are smooth, interpretable, and free of Brownian-motion noise, while having the same distribution as ULA. We prove that SBTM dissipates relative entropy at the same rate as the exact gradient flow, provided sufficient training. We further extend our framework to annealed dynamics, to handle non log-concave targets. Numerical experiments validate our theoretical findings: SBTM converges at the optimal rate, has smooth trajectories, and is easily integrated with annealed dynamics. We compare to the baselines of ULA and annealed ULA.

[LG-29] hink Prune Train Improve: Scaling Reasoning without Scaling Models

链接: https://arxiv.org/abs/2504.18116
作者: Caia Costello,Simon Guo,Anna Goldie,Azalia Mirhoseini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data. Synthetic data can be leveraged to enhance fine-tuning outcomes, but several factors influence this process, including model size, synthetic data volume, pruning strategy, and number of fine-tuning rounds. We explore these axes and investigate which conditions enable model self-improvement. We introduce the Think, Prune, Train process, a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data. This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o, demonstrating the effectiveness of self-generated reasoning and systematic data selection for improving LLM capabilities.

[LG-30] mperature Estimation in Induction Motors using Machine Learning

链接: https://arxiv.org/abs/2504.18105
作者: Dinan Li,Panagiotis Kakosimos
类目: Machine Learning (cs.LG)
*备注: 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:The number of electrified powertrains is ever increasing today towards a more sustainable future; thus, it is essential that unwanted failures are prevented, and a reliable operation is secured. Monitoring the internal temperatures of motors and keeping them under their thresholds is an important first step. Conventional modeling methods require expert knowledge and complicated mathematical approaches. With all the data a modern electric drive collects nowadays during the system operation, it is feasible to apply data-driven approaches for estimating thermal behaviors. In this paper, multiple machine-learning methods are investigated on their capability to approximate the temperatures of the stator winding and bearing in induction motors. The explored algorithms vary from linear to neural networks. For this reason, experimental lab data have been captured from a powertrain under predetermined operating conditions. For each approach, a hyperparameter search is then performed to find the optimal configuration. All the models are evaluated by various metrics, and it has been found that neural networks perform satisfactorily even under transient conditions.

[LG-31] Subject-independent Classification of Meditative State from the Resting State using EEG

链接: https://arxiv.org/abs/2504.18095
作者: Jerrin Thomas Panachakel,Pradeep Kumar G.,Suryaa Seran,Kanishka Sharma,Ramakrishnan Angarai Ganesan
类目: Machine Learning (cs.LG)
*备注: copyright 2024 IEEE Personal use of this material is permitted. 2024 IEEE 21st India Council International Conference (INDICON). IEEE, 2024

点击查看摘要

Abstract:While it is beneficial to objectively determine whether a subject is meditating, most research in the literature reports good results only in a subject-dependent manner. This study aims to distinguish the modified state of consciousness experienced during Rajyoga meditation from the resting state of the brain in a subject-independent manner using EEG data. Three architectures have been proposed and evaluated: The CSP-LDA Architecture utilizes common spatial pattern (CSP) for feature extraction and linear discriminant analysis (LDA) for classification. The CSP-LDA-LSTM Architecture employs CSP for feature extraction, LDA for dimensionality reduction, and long short-term memory (LSTM) networks for classification, modeling the binary classification problem as a sequence learning problem. The SVD-NN Architecture uses singular value decomposition (SVD) to select the most relevant components of the EEG signals and a shallow neural network (NN) for classification. The CSP-LDA-LSTM architecture gives the best performance with 98.2% accuracy for intra-subject classification. The SVD-NN architecture provides significant performance with 96.4% accuracy for inter-subject classification. This is comparable to the best-reported accuracies in the literature for intra-subject classification. Both architectures are capable of capturing subject-invariant EEG features for effectively classifying the meditative state from the resting state. The high intra-subject and inter-subject classification accuracies indicate these systems’ robustness and their ability to generalize across different subjects.

[LG-32] Reliable and Efficient Inverse Analysis using Physics-Informed Neural Networks with Distance Functions and Adaptive Weight Tuning

链接: https://arxiv.org/abs/2504.18091
作者: Shota Deguchi,Mitsuteru Asai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks have attracted significant attention in scientific machine learning for their capability to solve forward and inverse problems governed by partial differential equations. However, the accuracy of PINN solutions is often limited by the treatment of boundary conditions. Conventional penalty-based methods, which incorporate boundary conditions as penalty terms in the loss function, cannot guarantee exact satisfaction of the given boundary conditions and are highly sensitive to the choice of penalty parameters. This paper demonstrates that distance functions, specifically R-functions, can be leveraged to enforce boundary conditions, overcoming these limitations. R-functions provide normalized distance fields, enabling accurate representation of boundary geometries, including non-convex domains, and facilitating various types of boundary conditions. We extend this distance function-based boundary condition imposition method to inverse problems using PINNs and introduce an adaptive weight tuning technique to ensure reliable and efficient inverse analysis. We demonstrate the efficacy of the method through several numerical experiments. Numerical results show that the proposed method solves inverse problems more accurately and efficiently than penalty-based methods, even in the presence of complex non-convex geometries. This approach offers a reliable and efficient framework for inverse analysis using PINNs, with potential applications across a wide range of engineering problems.

[LG-33] A Model Zoo on Phase Transitions in Neural Networks

链接: https://arxiv.org/abs/2504.18072
作者: Konstantin Schürholt,Léo Meynent,Yefan Zhou,Haiquan Lu,Yaoqing Yang,Damian Borth
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Using the weights of trained Neural Network (NN) models as data modality has recently gained traction as a research field - dubbed Weight Space Learning (WSL). Multiple recent works propose WSL methods to analyze models, evaluate methods, or synthesize weights. Weight space learning methods require populations of trained models as datasets for development and evaluation. However, existing collections of models - called model zoos' - are unstructured or follow a rudimentary definition of diversity. In parallel, work rooted in statistical physics has identified phases and phase transitions in NN models. Models are homogeneous within the same phase but qualitatively differ from one phase to another. We combine the idea of model zoos’ with phase information to create a controlled notion of diversity in populations. We introduce 12 large-scale zoos that systematically cover known phases and vary over model architecture, size, and datasets. These datasets cover different modalities, such as computer vision, natural language processing, and scientific ML. For every model, we compute loss landscape metrics and validate full coverage of the phases. With this dataset, we provide the community with a resource with a wide range of potential applications for WSL and beyond. Evidence suggests the loss landscape phase plays a role in applications such as model training, analysis, or sparsification. We demonstrate this in an exploratory study of the downstream methods like transfer learning or model weights averaging.

[LG-34] Modes of Sequence Models and Learning Coefficients

链接: https://arxiv.org/abs/2504.18048
作者: Zhongtian Chen,Daniel Murfet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks. First, we cast conditional sequence distributions into a Hilbert-space framework and apply tensor decompositions to identify their principal modes. Truncating the small-amplitude modes yields an effective data distribution that preserves dominant structure while discarding statistical detail. Second, we show theoretically that Local Learning Coefficient (LLC) estimates are insensitive to modes below a data-dependent threshold. Consequently, the LLC calculated in practice characterises the geometry of the effective rather than the true distribution. This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss, and it highlights how the inverse temperature in SGLD acts as a resolution dial on the landscape structure.

[LG-35] GDT: A Temporal Graph-based Digital Twin for Urban Traffic Corridors

链接: https://arxiv.org/abs/2504.18008
作者: Nooshin Yousefzadeh,Rahul Sengupta,Sanjay Ranka
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 1 table

点击查看摘要

Abstract:Urban congestion at signalized intersections leads to significant delays, economic losses, and increased emissions. Existing deep learning models often lack spatial generalizability, rely on complex architectures, and struggle with real-time deployment. To address these limitations, we propose the Temporal Graph-based Digital Twin (TGDT), a scalable framework that integrates Temporal Convolutional Networks and Attentional Graph Neural Networks for dynamic, direction-aware traffic modeling and assessment at urban corridors. TGDT estimates key Measures of Effectiveness (MOEs) for traffic flow optimization at both the intersection level (e.g., queue length, waiting time) and the corridor level (e.g., traffic volume, travel time). Its modular architecture and sequential optimization scheme enable easy extension to any number of intersections and MOEs. The model outperforms state-of-the-art baselines by accurately producing high-dimensional, concurrent multi-output estimates. It also demonstrates high robustness and accuracy across diverse traffic conditions, including extreme scenarios, while relying on only a minimal set of traffic features. Fully parallelized, TGDT can simulate over a thousand scenarios within a matter of seconds, offering a cost-effective, interpretable, and real-time solution for traffic signal optimization.

[LG-36] Self-Balancing Memory Efficient Dynamic Metric Space Data Maintenance for Rapid Multi-Kernel Estimation

链接: https://arxiv.org/abs/2504.18003
作者: Aditya S Ellendula,Chandrajit Bajaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a dynamic self-balancing octree data structure that enables efficient neighborhood maintenance in evolving metric spaces, a key challenge in modern machine learning systems. Many learning and generative models operate as dynamical systems whose representations evolve during training, requiring fast, adaptive spatial organization. Our two-parameter octree supports logarithmic-time updates and queries, eliminating the need for costly full rebuilds as data distributions shift. We demonstrate its effectiveness in four areas: (1) accelerating Stein variational gradient descent by supporting more particles with lower overhead; (2) enabling real-time, incremental KNN classification with logarithmic complexity; (3) facilitating efficient, dynamic indexing and retrieval for retrieval-augmented generation; and (4) improving sample efficiency by jointly optimizing input and latent spaces. Across all applications, our approach yields exponential speedups while preserving accuracy, particularly in high-dimensional spaces where maintaining adaptive spatial structure is critical.

[LG-37] Streaming Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving

链接: https://arxiv.org/abs/2504.17999
作者: Chang Xiao,Brenda Yang
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative conversational interfaces powered by large language models (LLMs) typically stream output token-by-token at a rate determined by computational budget, often neglecting actual human reading speeds and the cognitive load associated with the content. This mismatch frequently leads to inefficient use of computational resources. For example, in cloud-based services, streaming content faster than users can read appears unnecessary, resulting in wasted computational resources and potential delays for other users, particularly during peak usage periods. To address this issue, we propose an adaptive streaming method that dynamically adjusts the pacing of LLM streaming output in real-time based on inferred cognitive load. Our approach estimates the cognitive load associated with streaming content and strategically slows down the stream during complex or information-rich segments, thereby freeing computational resources for other users. Our statistical analysis of computational savings, combined with crowdsourced user studies, provides insights into the trade-offs between service efficiency and user satisfaction, demonstrating that our method can significantly reduce computational consumption up to 16.8%. This context-aware computational resource management strategy presents a practical framework for enhancing system efficiency in cloud-based conversational AI interfaces without compromising user experience.

[LG-38] Plug-and-Play Physics-informed Learning using Uncertainty Quantified Port-Hamiltonian Models

链接: https://arxiv.org/abs/2504.17966
作者: Kaiyuan Tan,Peilun Li,Jun Wang,Thomas Beckers
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures

点击查看摘要

Abstract:The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations relative to the training data. In this paper, we introduce a Plug-and-Play Physics-Informed Machine Learning (PnP-PIML) framework to address this challenge. Our method employs conformal prediction to identify outlier dynamics and, in that case, switches from a nominal predictor to a physics-consistent model, namely distributed Port-Hamiltonian systems (dPHS). We leverage Gaussian processes to model the energy function of the dPHS, enabling not only the learning of system dynamics but also the quantification of predictive uncertainty through its Bayesian nature. In this way, the proposed framework produces reliable physics-informed predictions even for the out-of-distribution scenarios.

[LG-39] Mathematics of Continual Learning

链接: https://arxiv.org/abs/2504.17963
作者: Liangzu Peng,René Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning is an emerging subject in machine learning that aims to solve multiple tasks presented sequentially to the learner without forgetting previously learned tasks. Recently, many deep learning based approaches have been proposed for continual learning, however the mathematical foundations behind existing continual learning methods remain underdeveloped. On the other hand, adaptive filtering is a classic subject in signal processing with a rich history of mathematically principled methods. However, its role in understanding the foundations of continual learning has been underappreciated. In this tutorial, we review the basic principles behind both continual learning and adaptive filtering, and present a comparative analysis that highlights multiple connections between them. These connections allow us to enhance the mathematical foundations of continual learning based on existing results for adaptive filtering, extend adaptive filtering insights using existing continual learning methods, and discuss a few research directions for continual learning suggested by the historical developments in adaptive filtering.

[LG-40] CIVIL: Causal and Intuitive Visual Imitation Learning

链接: https://arxiv.org/abs/2504.17959
作者: Yinlong Dai,Robert Ramirez Sanchez,Ryan Jeronimus,Shahabedin Sagheb,Cara M. Nunez,Heramb Nemlekar,Dylan P. Losey
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Today’s robots learn new tasks by imitating human examples. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding the features that factor into the human’s decisions, robot learners often misinterpret the data and fail to perform the task when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to indicate task-relevant features using markers and language prompts. Our proposed algorithm, CIVIL, leverages this augmented data to filter the robot’s visual observations and extract a feature representation that causally informs human actions. CIVIL then applies these causal features to train a transformer-based policy that emulates human behaviors without being confused by visual distractors. Our simulations, real-world experiments, and user study demonstrate that robots trained with CIVIL can learn from fewer human demonstrations and perform better than state-of-the-art baselines, especially in previously unseen scenarios. See videos at our project website: this https URL

[LG-41] Fishing for Phishers: Learning-Based Phishing Detection in Ethereum Transactions

链接: https://arxiv.org/abs/2504.17953
作者: Ahod Alghuried,Abdulaziz Alghamdi,Ali Alkinoon,Soohyeon Choi,Manar Mohaisen,David Mohaisen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 23 pages, 6 tables, 5 figures

点击查看摘要

Abstract:Phishing detection on Ethereum has increasingly leveraged advanced machine learning techniques to identify fraudulent transactions. However, limited attention has been given to understanding the effectiveness of feature selection strategies and the role of graph-based models in enhancing detection accuracy. In this paper, we systematically examine these issues by analyzing and contrasting explicit transactional features and implicit graph-based features, both experimentally and analytically. We explore how different feature sets impact the performance of phishing detection models, particularly in the context of Ethereum’s transactional network. Additionally, we address key challenges such as class imbalance and dataset composition and their influence on the robustness and precision of detection methods. Our findings demonstrate the advantages and limitations of each feature type, while also providing a clearer understanding of how feature affect model resilience and generalization in adversarial environments.

[LG-42] Causality-Driven Neural Network Repair: Challenges and Opportunities

链接: https://arxiv.org/abs/2504.17946
作者: Fatemeh Vares,Brittany Johnson
类目: Machine Learning (cs.LG)
*备注: Causality in Software Engineering (CauSE) 2025 Workshop at ESEC/FSE

点击查看摘要

Abstract:Deep Neural Networks (DNNs) often rely on statistical correlations rather than causal reasoning, limiting their robustness and interpretability. While testing methods can identify failures, effective debugging and repair remain challenging. This paper explores causal inference as an approach primarily for DNN repair, leveraging causal debugging, counterfactual analysis, and structural causal models (SCMs) to identify and correct failures. We discuss in what ways these techniques support fairness, adversarial robustness, and backdoor mitigation by providing targeted interventions. Finally, we discuss key challenges, including scalability, generalization, and computational efficiency, and outline future directions for integrating causality-driven interventions to enhance DNN reliability.

[LG-43] Machine Learning-Based Prediction of Quality Shifts on Video Streaming Over 5G

链接: https://arxiv.org/abs/2504.17938
作者: Raza Ul Mustafa,Sesha Dassanayake
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Quality of Experience (QoE) is the users satisfaction while streaming a video session over an over-the-top (OTT) platform like YouTube. QoE of YouTube reflects the smooth streaming session without any buffering and quality shift events. One of the most important factors nowadays affecting QoE of YouTube is frequent shifts from higher to lower resolutions and vice versa. These shifts ensure a smooth streaming session; however, it might get a lower mean opinion score. For instance, dropping from 1080p to 480p during a video can preserve continuity but might reduce the viewers enjoyment. Over time, OTT platforms are looking for alternative ways to boost user experience instead of relying on traditional Quality of Service (QoS) metrics such as bandwidth, latency, and throughput. As a result, we look into the relationship between quality shifting in YouTube streaming sessions and the channel metrics RSRP, RSRQ, and SNR. Our findings state that these channel metrics positively correlate with shifts. Thus, in real-time, OTT can only rely on them to predict video streaming sessions into lower- and higher-resolution categories, thus providing more resources to improve user experience. Using traditional Machine Learning (ML) classifiers, we achieved an accuracy of 77-percent, while using only RSRP, RSRQ, and SNR. In the era of 5G and beyond, where ultra-reliable, low-latency networks promise enhanced streaming capabilities, the proposed methodology can be used to improve OTT services.

[LG-44] Optimized Approaches to Malware Detection: A Study of Machine Learning and Deep Learning Techniques

链接: https://arxiv.org/abs/2504.17930
作者: Abrar Fahim,Shamik Dey,Md. Nurul Absur,Md Kamrul Siam,Md. Tahmidul Huque,Jafreen Jafor Godhuli
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Digital systems find it challenging to keep up with cybersecurity threats. The daily emergence of more than 560,000 new malware strains poses significant hazards to the digital ecosystem. The traditional malware detection methods fail to operate properly and yield high false positive rates with low accuracy of the protection system. This study explores the ways in which malware can be detected using these machine learning (ML) and deep learning (DL) approaches to address those shortcomings. This study also includes a systematic comparison of the performance of some of the widely used ML models, such as random forest, multi-layer perceptron (MLP), and deep neural network (DNN), for determining the effectiveness of the domain of modern malware threat systems. We use a considerable-sized database from Kaggle, which has undergone optimized feature selection and preprocessing to improve model performance. Our finding suggests that the DNN model outperformed the other traditional models with the highest training accuracy of 99.92% and an almost perfect AUC score. Furthermore, the feature selection and preprocessing can help improve the capabilities of detection. This research makes an important contribution by analyzing the performance of the model on the performance metrics and providing insight into the effectiveness of the advanced detection techniques to build more robust and more reliable cybersecurity solutions against the growing malware threats.

[LG-45] CANet: ChronoAdaptive Network for Enhanced Long-Term Time Series Forecasting under Non-Stationarity

链接: https://arxiv.org/abs/2504.17913
作者: Mert Sonmezer,Seyda Ertekin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term time series forecasting plays a pivotal role in various real-world applications. Despite recent advancements and the success of different architectures, forecasting is often challenging due to non-stationary nature of the real-world data, which frequently exhibit distribution shifts and temporal changes in statistical properties like mean and variance over time. Previous studies suggest that this inherent variability complicates forecasting, limiting the performance of many models by leading to loss of non-stationarity and resulting in over-stationarization (Liu, Wu, Wang and Long, 2022). To address this challenge, we introduce a novel architecture, ChoronoAdaptive Network (CANet), inspired by style-transfer techniques. The core of CANet is the Non-stationary Adaptive Normalization module, seamlessly integrating the Style Blending Gate and Adaptive Instance Normalization (AdaIN) (Huang and Belongie, 2017). The Style Blending Gate preserves and reintegrates non-stationary characteristics, such as mean and standard deviation, by blending internal and external statistics, preventing over-stationarization while maintaining essential temporal dependencies. Coupled with AdaIN, which dynamically adapts the model to statistical changes, this approach enhances predictive accuracy under non-stationary conditions. CANet also employs multi-resolution patching to handle short-term fluctuations and long-term trends, along with Fourier analysis-based adaptive thresholding to reduce noise. A Stacked Kronecker Product Layer further optimizes the model’s efficiency while maintaining high performance. Extensive experiments on real-world datasets validate CANet’s superiority over state-of-the-art methods, achieving a 42% reduction in MSE and a 22% reduction in MAE. The source code is publicly available at this https URL.

[LG-46] he use of Multi-domain Electroencephalogram Representations in the building of Models based on Convolutional and Recurrent Neural Networks for Epilepsy Detection

链接: https://arxiv.org/abs/2504.17908
作者: Luiz Antonio Nicolau Anghinoni,Gustavo Weber Denardin,Jadson Castro Gertrudes,Dalcimar Casanova,Jefferson Tales Oliva
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epilepsy, affecting approximately 50 million people globally, is characterized by abnormal brain activity and remains challenging to treat. The diagnosis of epilepsy relies heavily on electroencephalogram (EEG) data, where specialists manually analyze epileptiform patterns across pre-ictal, ictal, post-ictal, and interictal periods. However, the manual analysis of EEG signals is prone to variability between experts, emphasizing the need for automated solutions. Although previous studies have explored preprocessing techniques and machine learning approaches for seizure detection, there is a gap in understanding how the representation of EEG data (time, frequency, or time-frequency domains) impacts the predictive performance of deep learning models. This work addresses this gap by systematically comparing deep neural networks trained on EEG data in these three domains. Through the use of statistical tests, we identify the optimal data representation and model architecture for epileptic seizure detection. The results demonstrate that frequency-domain data achieves detection metrics exceeding 97%, providing a robust foundation for more accurate and reliable seizure detection systems.

[LG-47] Do We Need Transformers to Play FPS Video Games?

链接: https://arxiv.org/abs/2504.17891
作者: Karmanbir Batth,Krish Sethi,Aly Shariff,Leo Shi,Hetul Patel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we explore the Transformer based architectures for reinforcement learning in both online and offline settings within the Doom game environment. Our investigation focuses on two primary approaches: Deep Transformer Q- learning Networks (DTQN) for online learning and Decision Transformers (DT) for offline reinforcement learning. DTQN leverages the sequential modelling capabilities of Transformers to enhance Q-learning in partially observable environments,while Decision Transformers repurpose sequence modelling techniques to enable offline agents to learn from past trajectories without direct interaction with the environment. We conclude that while Transformers might have performed well in Atari games, more traditional methods perform better than Transformer based method in both the settings in the VizDoom environment.

[LG-48] High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures

链接: https://arxiv.org/abs/2504.17857
作者: A. J Miller,Fangzhou Yu,Michael Brauckmann,Farbod Farshidian
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This work presents an overview of the technical details behind a high performance reinforcement learning policy deployment with the Spot RL Researcher Development Kit for low level motor access on Boston Dynamics Spot. This represents the first public demonstration of an end to end end reinforcement learning policy deployed on Spot hardware with training code publicly available through Nvidia IsaacLab and deployment code available through Boston Dynamics. We utilize Wasserstein Distance and Maximum Mean Discrepancy to quantify the distributional dissimilarity of data collected on hardware and in simulation to measure our sim2real gap. We use these measures as a scoring function for the Covariance Matrix Adaptation Evolution Strategy to optimize simulated parameters that are unknown or difficult to measure from Spot. Our procedure for modeling and training produces high quality reinforcement learning policies capable of multiple gaits, including a flight phase. We deploy policies capable of over 5.2ms locomotion, more than triple Spots default controller maximum speed, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot. We detail our method and release our code to support future work on Spot with the low level API.

[LG-49] OmniSage: Large Scale Multi-Entity Heterogeneous Graph Representation Learning

链接: https://arxiv.org/abs/2504.17811
作者: Anirudhan Badrinath,Alex Yang,Kousik Rajesh,Prabhat Agarwal,Jaewon Yang,Haoyu Chen,Jiajing Xu,Charles Rosenberg
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning, a task of learning latent vectors to represent entities, is a key task in improving search and recommender systems in web applications. Various representation learning methods have been developed, including graph-based approaches for relationships among entities, sequence-based methods for capturing the temporal evolution of user activities, and content-based models for leveraging text and visual content. However, the development of a unifying framework that integrates these diverse techniques to support multiple applications remains a significant challenge. This paper presents OmniSage, a large-scale representation framework that learns universal representations for a variety of applications at Pinterest. OmniSage integrates graph neural networks with content-based models and user sequence models by employing multiple contrastive learning tasks to effectively process graph data, user sequence data, and content signals. To support the training and inference of OmniSage, we developed an efficient infrastructure capable of supporting Pinterest graphs with billions of nodes. The universal representations generated by OmniSage have significantly enhanced user experiences on Pinterest, leading to an approximate 2.5% increase in sitewide repins (saves) across five applications. This paper highlights the impact of unifying representation learning methods, and we will open source the OmniSage code by the time of publication.

[LG-50] Near-Driven Autonomous Rover Navigation in Complex Environments: Extensions to Urban Search-and-Rescue and Industrial Inspection

链接: https://arxiv.org/abs/2504.17794
作者: Dhadkan Shrestha,Lincoln Bhattarai
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper explores the use of an extended neuroevolutionary approach, based on NeuroEvolution of Augmenting Topologies (NEAT), for autonomous robots in dynamic environments associated with hazardous tasks like firefighting, urban search-and-rescue (USAR), and industrial inspections. Building on previous research, it expands the simulation environment to larger and more complex settings, demonstrating NEAT’s adaptability across different applications. By integrating recent advancements in NEAT and reinforcement learning, the study uses modern simulation frameworks for realism and hybrid algorithms for optimization. Experimental results show that NEAT-evolved controllers achieve success rates comparable to state-of-the-art deep reinforcement learning methods, with superior structural adaptability. The agents reached ~80% success in outdoor tests, surpassing baseline models. The paper also highlights the benefits of transfer learning among tasks and evaluates the effectiveness of NEAT in complex 3D navigation. Contributions include evaluating NEAT for diverse autonomous applications and discussing real-world deployment considerations, emphasizing the approach’s potential as an alternative or complement to deep reinforcement learning in autonomous navigation tasks.

[LG-51] Persistence of Backdoor-based Watermarks for Neural Networks: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2501.02704
作者: Anh Tu Ngo,Chuan Song Heng,Nandish Chattopadhyay,Anupam Chattopadhyay
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注: Preprint. Under Review

点击查看摘要

Abstract:Deep Neural Networks (DNNs) have gained considerable traction in recent years due to the unparalleled results they gathered. However, the cost behind training such sophisticated models is resource intensive, resulting in many to consider DNNs to be intellectual property (IP) to model owners. In this era of cloud computing, high-performance DNNs are often deployed all over the internet so that people can access them publicly. As such, DNN watermarking schemes, especially backdoor-based watermarks, have been actively developed in recent years to preserve proprietary rights. Nonetheless, there lies much uncertainty on the robustness of existing backdoor watermark schemes, towards both adversarial attacks and unintended means such as fine-tuning neural network models. One reason for this is that no complete guarantee of robustness can be assured in the context of backdoor-based watermark. In this paper, we extensively evaluate the persistence of recent backdoor-based watermarks within neural networks in the scenario of fine-tuning, we propose/develop a novel data-driven idea to restore watermark after fine-tuning without exposing the trigger set. Our empirical results show that by solely introducing training data after fine-tuning, the watermark can be restored if model parameters do not shift dramatically during fine-tuning. Depending on the types of trigger samples used, trigger accuracy can be reinstated to up to 100%. Our study further explores how the restoration process works using loss landscape visualization, as well as the idea of introducing training data in fine-tuning stage to alleviate watermark vanishing.

[LG-52] Representation Learning for Distributional Perturbation Extrapolation ICLR

链接: https://arxiv.org/abs/2504.18522
作者: Julius von Kügelgen,Jakob Ketterer,Xinwei Shen,Nicolai Meinshausen,Jonas Peters
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint; work presented at the ICLR Workshop on Learning Meaningful Representations of Life

点击查看摘要

Abstract:We consider the problem of modelling the effects of unseen perturbations such as gene knockdowns or drug combinations on low-level measurements such as RNA sequencing data. Specifically, given data collected under some perturbations, we aim to predict the distribution of measurements for new perturbations. To address this challenging extrapolation task, we posit that perturbations act additively in a suitable, unknown embedding space. More precisely, we formulate the generative process underlying the observed data as a latent variable model, in which perturbations amount to mean shifts in latent space and can be combined additively. Unlike previous work, we prove that, given sufficiently diverse training perturbations, the representation and perturbation effects are identifiable up to affine transformation, and use this to characterize the class of unseen perturbations for which we obtain extrapolation guarantees. To estimate the model from data, we propose a new method, the perturbation distribution autoencoder (PDAE), which is trained by maximising the distributional similarity between true and predicted perturbation distributions. The trained model can then be used to predict previously unseen perturbation distributions. Empirical evidence suggests that PDAE compares favourably to existing methods and baselines at predicting the effects of unseen perturbations.

[LG-53] Enhancing Visual Interpretability and Explainability in Functional Survival Trees and Forests

链接: https://arxiv.org/abs/2504.18498
作者: Giuseppe Loffredo,Elvira Romano,Fabrizio MAturo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Functional survival models are key tools for analyzing time-to-event data with complex predictors, such as functional or high-dimensional inputs. Despite their predictive strength, these models often lack interpretability, which limits their value in practical decision-making and risk analysis. This study investigates two key survival models: the Functional Survival Tree (FST) and the Functional Random Survival Forest (FRSF). It introduces novel methods and tools to enhance the interpretability of FST models and improve the explainability of FRSF ensembles. Using both real and simulated datasets, the results demonstrate that the proposed approaches yield efficient, easy-to-understand decision trees that accurately capture the underlying decision-making processes of the model ensemble.

[LG-54] Generalization Guarantees for Multi-View Representation Learning and Application to Regularization via Gaussian Product Mixture Prior

链接: https://arxiv.org/abs/2504.18455
作者: Milad Sefidgaran,Abdellatif Zaidi,Piotr Krasnowski
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2502.15540

点击查看摘要

Abstract:We study the problem of distributed multi-view representation learning. In this problem, K agents observe each one distinct, possibly statistically correlated, view and independently extracts from it a suitable representation in a manner that a decoder that gets all K representations estimates correctly the hidden label. In the absence of any explicit coordination between the agents, a central question is: what should each agent extract from its view that is necessary and sufficient for a correct estimation at the decoder? In this paper, we investigate this question from a generalization error perspective. First, we establish several generalization bounds in terms of the relative entropy between the distribution of the representations extracted from training and “test” datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for all views and training and test datasets. Then, we use the obtained bounds to devise a regularizer; and investigate in depth the question of the selection of a suitable prior. In particular, we show and conduct experiments that illustrate that our data-dependent Gaussian mixture priors with judiciously chosen weights lead to good performance. For single-view settings (i.e., K=1 ), our experimental results are shown to outperform existing prior art Variational Information Bottleneck (VIB) and Category-Dependent VIB (CDVIB) approaches. Interestingly, we show that a weighted attention mechanism emerges naturally in this setting. Finally, for the multi-view setting, we show that the selection of the joint prior as a Gaussians product mixture induces a Gaussian mixture marginal prior for each marginal view and implicitly encourages the agents to extract and output redundant features, a finding which is somewhat counter-intuitive.

[LG-55] Enhanced Sampling Public Dataset and Generative Model for Drug-Protein Dissociation Dynamics

链接: https://arxiv.org/abs/2504.18367
作者: Maodong Li,Jiying Zhang,Bin Feng,Wenqi Zeng,Dechin Chen,Zhijun Pan,Yu Li,Zijing Liu,Yi Isaac Yang
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: The code will be accessed from our GitHub repository this https URL

点击查看摘要

Abstract:Drug-protein binding and dissociation dynamics are fundamental to understanding molecular interactions in biological systems. While many tools for drug-protein interaction studies have emerged, especially artificial intelligence (AI)-based generative models, predictive tools on binding/dissociation kinetics and dynamics are still limited. We propose a novel research paradigm that combines molecular dynamics (MD) simulations, enhanced sampling, and AI generative models to address this issue. We propose an enhanced sampling strategy to efficiently implement the drug-protein dissociation process in MD simulations and estimate the free energy surface (FES). We constructed a program pipeline of MD simulations based on this sampling strategy, thus generating a dataset including 26,612 drug-protein dissociation trajectories containing about 13 million frames. We named this dissociation dynamics dataset DD-13M and used it to train a deep equivariant generative model UnbindingFlow, which can generate collision-free dissociation trajectories. The DD-13M database and UnbindingFlow model represent a significant advancement in computational structural biology, and we anticipate its broad applicability in machine learning studies of drug-protein interactions. Our ongoing efforts focus on expanding this methodology to encompass a broader spectrum of drug-protein complexes and exploring novel applications in pathway prediction.

[LG-56] Post-Transfer Learning Statistical Inference in High-Dimensional Regression

链接: https://arxiv.org/abs/2504.18212
作者: Nguyen Vu Khai Tam,Cao Huyen My,Vo Nguyen Le Duy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning (TL) for high-dimensional regression (HDR) is an important problem in machine learning, particularly when dealing with limited sample size in the target task. However, there currently lacks a method to quantify the statistical significance of the relationship between features and the response in TL-HDR settings. In this paper, we introduce a novel statistical inference framework for assessing the reliability of feature selection in TL-HDR, called PTL-SI (Post-TL Statistical Inference). The core contribution of PTL-SI is its ability to provide valid p -values to features selected in TL-HDR, thereby rigorously controlling the false positive rate (FPR) at desired significance level \alpha (e.g., 0.05). Furthermore, we enhance statistical power by incorporating a strategic divide-and-conquer approach into our framework. We demonstrate the validity and effectiveness of the proposed PTL-SI through extensive experiments on both synthetic and real-world high-dimensional datasets, confirming its theoretical properties and utility in testing the reliability of feature selection in TL scenarios.

[LG-57] Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels

链接: https://arxiv.org/abs/2504.18184
作者: Jia-Qi Yang,Lei Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Statistics Theory (math.ST)
*备注: 56 pages, 2 figures

点击查看摘要

Abstract:This paper investigates regularized stochastic gradient descent (SGD) algorithms for estimating nonlinear operators from a Polish space to a separable Hilbert space. We assume that the regression operator lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel. Two significant settings are considered: an online setting with polynomially decaying step sizes and regularization parameters, and a finite-horizon setting with constant step sizes and regularization parameters. We introduce regularity conditions on the structure and smoothness of the target operator and the input random variables. Under these conditions, we provide a dimension-free convergence analysis for the prediction and estimation errors, deriving both expectation and high-probability error bounds. Our analysis demonstrates that these convergence rates are nearly optimal. Furthermore, we present a new technique for deriving bounds with high probability for general SGD schemes, which also ensures almost-sure convergence. Finally, we discuss potential extensions to more general operator-valued kernels and the encoder-decoder framework.

[LG-58] Lecture Notes on Normalizing Flows for Lattice Quantum Field Theories

链接: https://arxiv.org/abs/2504.18126
作者: Miranda C. N. Cheng,Niki Stratikopoulou
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 70 pages

点击查看摘要

Abstract:Numerical simulations of quantum field theories on lattices serve as a fundamental tool for studying the non-perturbative regime of the theories, where analytic tools often fall short. Challenges arise when one takes the continuum limit or as the system approaches a critical point, especially in the presence of non-trivial topological structures in the theory. Rapid recent advances in machine learning provide a promising avenue for progress in this area. These lecture notes aim to give a brief account of lattice field theories, normalizing flows, and how the latter can be applied to study the former. The notes are based on the lectures given by the first author in various recent research schools.

[LG-59] Bayesian Quantum Orthogonal Neural Networks for Anomaly Detection

链接: https://arxiv.org/abs/2504.18103
作者: Natansh Mathur,Brian Coyle,Nishant Jain,Snehal Raj,Akshat Tandon,Jasper Simon Krauser,Rainer Stoessel
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Identification of defects or anomalies in 3D objects is a crucial task to ensure correct functionality. In this work, we combine Bayesian learning with recent developments in quantum and quantum-inspired machine learning, specifically orthogonal neural networks, to tackle this anomaly detection problem for an industrially relevant use case. Bayesian learning enables uncertainty quantification of predictions, while orthogonality in weight matrices enables smooth training. We develop orthogonal (quantum) versions of 3D convolutional neural networks and show that these models can successfully detect anomalies in 3D objects. To test the feasibility of incorporating quantum computers into a quantum-enhanced anomaly detection pipeline, we perform hardware experiments with our models on IBM’s 127-qubit Brisbane device, testing the effect of noise and limited measurement shots.

[LG-60] Non-identifiability distinguishes Neural Networks among Parametric Models

链接: https://arxiv.org/abs/2504.18017
作者: Sourav Chatterjee,Timothy Sudijono
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 pages. Comments welcome

点击查看摘要

Abstract:One of the enduring problems surrounding neural networks is to identify the factors that differentiate them from traditional statistical models. We prove a pair of results which distinguish feedforward neural networks among parametric models at the population level, for regression tasks. Firstly, we prove that for any pair of random variables (X,Y) , neural networks always learn a nontrivial relationship between X and Y , if one exists. Secondly, we prove that for reasonable smooth parametric models, under local and global identifiability conditions, there exists a nontrivial (X,Y) pair for which the parametric model learns the constant predictor \mathbbE[Y] . Together, our results suggest that a lack of identifiability distinguishes neural networks among the class of smooth parametric models.

[LG-61] A computational model of infant sensorimotor exploration in the mobile paradigm

链接: https://arxiv.org/abs/2504.17939
作者: Josua Spisak,Sergiu Tcaci Popescu,Stefan Wermter,Matej Hoffmann,J. Kevin O’Regan
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 16 pages, 16 figures

点击查看摘要

Abstract:We present a computational model of the mechanisms that may determine infants’ behavior in the “mobile paradigm”. This paradigm has been used in developmental psychology to explore how infants learn the sensory effects of their actions. In this paradigm, a mobile (an articulated and movable object hanging above an infant’s crib) is connected to one of the infant’s limbs, prompting the infant to preferentially move that “connected” limb. This ability to detect a “sensorimotor contingency” is considered to be a foundational cognitive ability in development. To understand how infants learn sensorimotor contingencies, we built a model that attempts to replicate infant behavior. Our model incorporates a neural network, action-outcome prediction, exploration, motor noise, preferred activity level, and biologically-inspired motor control. We find that simulations with our model replicate the classic findings in the literature showing preferential movement of the connected limb. An interesting observation is that the model sometimes exhibits a burst of movement after the mobile is disconnected, casting light on a similar occasional finding in infants. In addition to these general findings, the simulations also replicate data from two recent more detailed studies using a connection with the mobile that was either gradual or all-or-none. A series of ablation studies further shows that the inclusion of mechanisms of action-outcome prediction, exploration, motor noise, and biologically-inspired motor control was essential for the model to correctly replicate infant behavior. This suggests that these components are also involved in infants’ sensorimotor learning.

[LG-62] SOFARI-R: High-Dimensional Manifold-Based Inference for Latent Responses

链接: https://arxiv.org/abs/2504.17874
作者: Zemin Zheng,Xin Zhou,Jinchi Lv
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 90 pages, 2 figures

点击查看摘要

Abstract:Data reduction with uncertainty quantification plays a key role in various multi-task learning applications, where large numbers of responses and features are present. To this end, a general framework of high-dimensional manifold-based SOFAR inference (SOFARI) was introduced recently in Zheng, Zhou, Fan and Lv (2024) for interpretable multi-task learning inference focusing on the left factor vectors and singular values exploiting the latent singular value decomposition (SVD) structure. Yet, designing a valid inference procedure on the latent right factor vectors is not straightforward from that of the left ones and can be even more challenging due to asymmetry of left and right singular vectors in the response matrix. To tackle these issues, in this paper we suggest a new method of high-dimensional manifold-based SOFAR inference for latent responses (SOFARI-R), where two variants of SOFARI-R are introduced. The first variant deals with strongly orthogonal factors by coupling left singular vectors with the design matrix and then appropriately rescaling them to generate new Stiefel manifolds. The second variant handles the more general weakly orthogonal factors by employing the hard-thresholded SOFARI estimates and delicately incorporating approximation errors into the distribution. Both variants produce bias-corrected estimators for the latent right factor vectors that enjoy asymptotically normal distributions with justified asymptotic variance estimates. We demonstrate the effectiveness of the newly suggested method using extensive simulation studies and an economic application.

[LG-63] Learning Enhanced Ensemble Filters

链接: https://arxiv.org/abs/2504.17836
作者: Eviatar Bach,Ricardo Baptista,Edoardo Calvello,Bohan Chen,Andrew Stuart
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Computational Physics (physics.comp-ph)
*备注: Preprint submitted to Journal of Computational Physics

点击查看摘要

Abstract:The filtering distribution in hidden Markov models evolves according to the law of a mean-field model in state–observation space. The ensemble Kalman filter (EnKF) approximates this mean-field model with an ensemble of interacting particles, employing a Gaussian ansatz for the joint distribution of the state and observation at each observation time. These methods are robust, but the Gaussian ansatz limits accuracy. This shortcoming is addressed by approximating the mean-field evolution using a novel form of neural operator taking probability distributions as input: a Measure Neural Mapping (MNM). A MNM is used to design a novel approach to filtering, the MNM-enhanced ensemble filter (MNMEF), which is defined in both the mean-fieldlimit and for interacting ensemble particle approximations. The ensemble approach uses empirical measures as input to the MNM and is implemented using the set transformer, which is invariant to ensemble permutation and allows for different ensemble sizes. The derivation of methods from a mean-field formulation allows a single parameterization of the algorithm to be deployed at different ensemble sizes. In practice fine-tuning of a small number of parameters, for specific ensemble sizes, further enhances the accuracy of the scheme. The promise of the approach is demonstrated by its superior root-mean-square-error performance relative to leading methods in filtering the Lorenz 96 and Kuramoto-Sivashinsky models.

信息检索

[IR-0] An Empirical Study of Evaluating Long-form Question Answering

链接: https://arxiv.org/abs/2504.18413
作者: Ning Xian,Yixing Fan,Ruqing Zhang,Maarten de Rijke,Jiafeng Guo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:\AcLFQA aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram matching, while the reliability of large language model-based evaluations for long-form answers remains relatively unexplored. We address this gap by conducting an in-depth study of long-form answer evaluation with the following research questions: (i) To what extent do existing automatic evaluation metrics serve as a substitute for human evaluations? (ii) What are the limitations of existing evaluation metrics compared to human evaluations? (iii) How can the effectiveness and robustness of existing evaluation methods be improved? We collect 5,236 factoid and non-factoid long-form answers generated by different large language models and conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. Subsequently, we investigated the performance of automatic evaluation metrics by evaluating these answers, analyzing the consistency between these metrics and human evaluations. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics. However, fine-grained evaluation helps mitigate this issue on some metrics. Our findings have important implications for the use of large language models for evaluating long-form question answering. All code and datasets are available at this https URL.

[IR-1] Leverag ing Decoder Architectures for Learned Sparse Retrieval

链接: https://arxiv.org/abs/2504.18151
作者: Jingfen Qiao,Thong Nguyen,Evangelos Kanoulas,Andrew Yates
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Learned Sparse Retrieval (LSR) has traditionally focused on small-scale encoder-only transformer architectures. With the advent of large-scale pre-trained language models, their capability to generate sparse representations for retrieval tasks across different transformer-based architectures, including encoder-only, decoder-only, and encoder-decoder models, remains largely unexplored. This study investigates the effectiveness of LSR across these architectures, exploring various sparse representation heads and model scales. Our results highlight the limitations of using large language models to create effective sparse representations in zero-shot settings, identifying challenges such as inappropriate term expansions and reduced performance due to the lack of expansion. We find that the encoder-decoder architecture with multi-tokens decoding approach achieves the best performance among the three backbones. While the decoder-only model performs worse than the encoder-only model, it demonstrates the potential to outperform when scaled to a high number of parameters.

[IR-2] Revisiting Algorithmic Audits of TikTok: Poor Reproducibility and Short-term Validity of Findings SIGIR2025

链接: https://arxiv.org/abs/2504.18140
作者: Matej Mosnar,Adam Skurla,Branislav Pecher,Matus Tibensky,Jan Jakubcik,Adrian Bindas,Peter Sakalik,Ivan Srba
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注: ACM SIGIR 2025. 10 pages

点击查看摘要

Abstract:Social media platforms are constantly shifting towards algorithmically curated content based on implicit or explicit user feedback. Regulators, as well as researchers, are calling for systematic social media algorithmic audits as this shift leads to enclosing users in filter bubbles and leading them to more problematic content. An important aspect of such audits is the reproducibility and generalisability of their findings, as it allows to draw verifiable conclusions and audit potential changes in algorithms over time. In this work, we study the reproducibility of the existing sockpuppeting audits of TikTok recommender systems, and the generalizability of their findings. In our efforts to reproduce the previous works, we find multiple challenges stemming from social media platform changes and content evolution, but also the research works themselves. These drawbacks limit the audit reproducibility and require an extensive effort altogether with inevitable adjustments to the auditing methodology. Our experiments also reveal that these one-shot audit findings often hold only in the short term, implying that the reproducibility and generalizability of the audits heavily depend on the methodological choices and the state of algorithms and content on the platform. This highlights the importance of reproducible audits that allow us to determine how the situation changes in time.

[IR-3] SoK: Timeline based event reconstruction for digital forensics: Terminology methodology and current challenges

链接: https://arxiv.org/abs/2504.18131
作者: Frank Breitinger,Hudan Studiawan,Chris Hargreaves
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: Accepted for publication at DFRWS USA

点击查看摘要

Abstract:Event reconstruction is a technique that examiners can use to attempt to infer past activities by analyzing digital artifacts. Despite its significance, the field suffers from fragmented research, with studies often focusing narrowly on aspects like timeline creation or tampering detection. This paper addresses the lack of a unified perspective by proposing a comprehensive framework for timeline-based event reconstruction, adapted from traditional forensic science models. We begin by harmonizing existing terminology and presenting a cohesive diagram that clarifies the relationships between key elements of the reconstruction process. Through a comprehensive literature survey, we classify and organize the main challenges, extending the discussion beyond common issues like data volume. Lastly, we highlight recent advancements and propose directions for future research, including specific research gaps. By providing a structured approach, key findings, and a clearer understanding of the underlying challenges, this work aims to strengthen the foundation of digital forensics.

[IR-4] FIM: Frequency-Aware Multi-View Interest Modeling for Local-Life Service Recommendation SIGIR’25 SIGIR

链接: https://arxiv.org/abs/2504.17814
作者: Guoquan Wang,Qiang Luo,Weisong Hu,Pengfei Yao,Wencong Zeng,Guorui Zhou,Kun Gai
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures, Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13–18, 2025, Padua, Italy

点击查看摘要

Abstract:People’s daily lives involve numerous periodic behaviors, such as eating and traveling. Local-life platforms cater to these recurring needs by providing essential services tied to daily routines. Therefore, users’ periodic intentions are reflected in their interactions with the platforms. There are two main challenges in modeling users’ periodic behaviors in the local-life service recommendation systems: 1) the diverse demands of users exhibit varying periodicities, which are difficult to distinguish as they are mixed in the behavior sequences; 2) the periodic behaviors of users are subject to dynamic changes due to factors such as holidays and promotional events. Existing methods struggle to distinguish the periodicities of diverse demands and overlook the importance of dynamically capturing changes in users’ periodic behaviors. To this end, we employ a Frequency-Aware Multi-View Interest Modeling framework (FIM). Specifically, we propose a multi-view search strategy that decomposes users’ demands from different perspectives to separate their various periodic intentions. This allows the model to comprehensively extract their periodic features than category-searched-only methods. Moreover, we propose a frequency-domain perception and evolution module. This module uses the Fourier Transform to convert users’ temporal behaviors into the frequency domain, enabling the model to dynamically perceive their periodic features. Extensive offline experiments demonstrate that FIM achieves significant improvements on public and industrial datasets, showing its capability to effectively model users’ periodic intentions. Furthermore, the model has been deployed on the Kuaishou local-life service platform. Through online A/B experiments, the transaction volume has been significantly improved.

[IR-5] Music Tempo Estimation on Solo Instrumental Performance

链接: https://arxiv.org/abs/2504.18502
作者: Zhanhong He,Roberto Togneri,Xiangyu Zhang
类目: Audio and Speech Processing (eess.AS); Information Retrieval (cs.IR)
*备注: 4 pages, rejected paper by WASPAA2023

点击查看摘要

Abstract:Recently, automatic music transcription has made it possible to convert musical audio into accurate MIDI. However, the resulting MIDI lacks music notations such as tempo, which hinders its conversion into sheet music. In this paper, we investigate state-of-the-art tempo estimation techniques and evaluate their performance on solo instrumental music. These include temporal convolutional network (TCN) and recurrent neural network (RNN) models that are pretrained on massive of mixed vocals and instrumental music, as well as TCN models trained specifically with solo instrumental performances. Through evaluations on drum, guitar, and classical piano datasets, our TCN models with the new training scheme achieved the best performance. Our newly trained TCN model increases the Acc1 metric by 38.6% for guitar tempo estimation, compared to the pretrained TCN model with an Acc1 of 61.1%. Although our trained TCN model is twice as accurate as the pretrained TCN model in estimating classical piano tempo, its Acc1 is only 50.9%. To improve the performance of deep learning models, we investigate their combinations with various post-processing methods. These post-processing techniques effectively enhance the performance of deep learning models when they struggle to estimate the tempo of specific instruments.

附件下载

点击下载今日全部论文列表