本篇博文主要内容为 2025-05-07 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-07)

今日共更新440篇论文,其中:

  • 自然语言处理46篇(Computation and Language (cs.CL))
  • 人工智能119篇(Artificial Intelligence (cs.AI))
  • 计算机视觉102篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习107篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

【速读】: 该论文旨在解决语音模型在流式处理中生成第一个音频标记时存在高延迟的问题,这一问题严重影响了语音系统的实时部署。其解决方案的关键在于提出VITA-Audio,这是一种端到端的大规模语音模型,通过引入轻量级的多模态跨标记预测(Multiple Cross-modal Token Prediction, MCTP)模块,在单次模型前向传播中高效生成多个音频标记,从而显著降低流式场景下首个音频生成的延迟。此外,还采用四阶段渐进式训练策略,在保持语音质量的前提下实现模型加速。

链接: https://arxiv.org/abs/2505.03739
作者: Zuwei Long,Yunhang Shen,Chaoyou Fu,Heting Gao,Lijiang Li,Peixian Chen,Mengdan Zhang,Hang Shao,Jian Li,Jinlong Peng,Haoyu Cao,Ke Li,Rongrong Ji,Xing Sun
机构: Tencent Youtu Lab(腾讯优图实验室); Nanjing University(南京大学); Xiamen University(厦门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Training and Inference Codes: this https URL

点击查看摘要

Abstract:With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
zh

[NLP-1] WebGen-Bench: Evaluating LLM s on Generating Interactive and Functional Websites from Scratch

【速读】: 该论文旨在解决如何评估基于大语言模型(Large Language Model, LLM)的智能体在从零开始生成多文件网站代码库方面的能力问题。其关键解决方案是构建了一个名为WebGen-Bench的新基准,该基准包含通过人类标注者与GPT-4o协作生成的多样化网站生成指令,并设计了647个测试用例以评估生成网站的功能性。此外,论文还引入了一个强大的网页导航代理来自动化测试过程,从而提高测试的可重复性与准确性。同时,论文还构建了WebGen-Instruct训练集,用于提升模型在网站生成任务上的表现。

链接: https://arxiv.org/abs/2505.03733
作者: Zimu Lu,Yunqiao Yang,Houxing Ren,Haotian Hou,Han Xiao,Ke Wang,Weikang Shi,Aojun Zhou,Mingjie Zhan,Hongsheng Li
机构: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent’s ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, this http URL, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, this http URL powered by DeepSeek-R1, achieves only 27.8% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on this http URL trajectories generated from a subset of this training set achieves an accuracy of 38.2%, surpassing the performance of the best proprietary model.
zh

[NLP-2] NBF at SemEval-2025 Task 5: Light-Burst Attention Enhanced System for Multilingual Subject Recommendation

【速读】: 该论文旨在解决跨语言学科分类问题,特别是在英语和德语学术领域中的主题检索任务。其关键解决方案是采用双语数据进行训练,并结合负采样和基于间隔的检索目标,同时引入一种维度作为标记的自注意力机制,该机制通过显著减少内部维度来有效编码句子嵌入,从而提升主题检索效果。

链接: https://arxiv.org/abs/2505.03711
作者: Baharul Islam,Nasim Ahmad,Ferdous Ahmed Barbhuiya,Kuntal Dey
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present our system submission for SemEval 2025 Task 5, which focuses on cross-lingual subject classification in the English and German academic domains. Our approach leverages bilingual data during training, employing negative sampling and a margin-based retrieval objective. We demonstrate that a dimension-as-token self-attention mechanism designed with significantly reduced internal dimensions can effectively encode sentence embeddings for subject retrieval. In quantitative evaluation, our system achieved an average recall rate of 32.24% in the general quantitative setting (all subjects), 43.16% and 31.53% of the general qualitative evaluation methods with minimal GPU usage, highlighting their competitive performance. Our results demonstrate that our approach is effective in capturing relevant subject information under resource constraints, although there is still room for improvement.
zh

[NLP-3] IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages

【速读】: 该论文试图解决问答(Question-Answering, QA)系统在低资源语言中的发展不足问题,特别是针对拥有大量母语使用者的印地语系(Indic)语言缺乏代表性的问题。其解决方案的关键在于构建一个涵盖九种主要印地语系语言的综合性多语言抽取式问答数据集IndicSQuAD,该数据集系统地来源于SQuAD数据集,并通过适应和扩展翻译技术来保持高语言保真度和答案跨度的准确对齐。

链接: https://arxiv.org/abs/2505.03688
作者: Sharvi Endait,Ruturaj Ghatage,Aditya Kulkarni,Rajlaxmi Patil,Raviraj Joshi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at this https URL
zh

[NLP-4] Rational Retrieval Acts: Leverag ing Prag matic Reasoning to Improve Sparse Retrieval SIGIR2025

【速读】: 该论文试图解决当前稀疏神经信息检索(Sparse Neural Information Retrieval, SNIR)方法以及传统模型如BM25在文档表示过程中未能充分考虑文档集合及不同词项权重之间复杂相互作用的问题。其解决方案的关键在于将语言学框架理性言语行为(Rational Speech Acts, RSA)适配到信息检索场景中,通过动态调节词项与文档之间的交互,考虑数据集中其他文档的影响,从而更有效地对比文档表示。实验表明,引入RSA能够持续提升多种稀疏检索模型的性能,并在BEIR基准的跨领域数据集上达到最先进水平。

链接: https://arxiv.org/abs/2505.03676
作者: Arthur Satouf,Gabriel Ben Zenou,Benjamin Piwowarski,Habiboulaye Amadou Boubacar,Pablo Piantanida
机构: Université Paris-Saclay(巴黎-萨克雷大学); ILLS(智能与学习系统研究所); ISIR(智能机器人与系统研究所); Air Liquide France(法国空气液化公司); CentraleSupélec(中央理工学院); AMIADFrance(法国AMIAD); CNRS(法国国家科学研究中心); MILA - Quebec AI Institute(蒙特利尔人工智能实验室-魁北克人工智能研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 6 pages - 2 figures - conference: accepted at SIGIR 2025

点击查看摘要

Abstract:Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into account the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case – and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. this https URL
zh

[NLP-5] owards conversational assistants for health applications: using ChatGPT to generate conversations about heart failure

【速读】: 该论文试图解决为非洲裔美国心力衰竭患者生成聚焦自我护理策略的对话问题,该领域缺乏专门的数据集。解决方案的关键在于设计有效的提示策略,包括领域知识、非标准英语(African American Vernacular English, AAVE)、健康的社会决定因素(Social Determinants of Health, SDOH)以及基于SDOH的推理,以提升对话质量。

链接: https://arxiv.org/abs/2505.03675
作者: Anuja Tayal,Devika Salunke,Barbara Di Eugenio,Paula G Allen-Meares,Eulalia P Abril,Olga Garcia-Bedoya,Carolyn A Dickens,Andrew D. Boyd
机构: University of Illinois at Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We explore the potential of ChatGPT (3.5-turbo and 4) to generate conversations focused on self-care strategies for African-American heart failure patients – a domain with limited specialized datasets. To simulate patient-health educator dialogues, we employed four prompting strategies: domain, African American Vernacular English (AAVE), Social Determinants of Health (SDOH), and SDOH-informed reasoning. Conversations were generated across key self-care domains of food, exercise, and fluid intake, with varying turn lengths (5, 10, 15) and incorporated patient-specific SDOH attributes such as age, gender, neighborhood, and socioeconomic status. Our findings show that effective prompt design is essential. While incorporating SDOH and reasoning improves dialogue quality, ChatGPT still lacks the empathy and engagement needed for meaningful healthcare communication.
zh

[NLP-6] Say It Another Way: A Framework for User-Grounded Paraphrasing

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对提示(prompt)微小变化时行为出现显著差异的问题,这引发了对其评估稳定性和可靠性的担忧。解决方案的关键在于提出一种基于最小语言变换分类法的受控改写框架,以系统性地生成自然的提示变体,从而更真实地反映现实语言使用的多样性。通过这种方法,研究者能够更深入地分析LLMs在刻板印象评估任务中对改写提示的响应变化。

链接: https://arxiv.org/abs/2505.03563
作者: Cléa Chataigner,Rebecca Ma,Prakhar Ganesh,Afaf Taïk,Elliot Creager,Golnoosh Farnadi
机构: Mila(蒙特利尔人工智能实验室); Quebec AI Institute(魁北克人工智能研究所); McGill University(麦吉尔大学); University of Waterloo(滑铁卢大学); Vector Institute(矢量研究所); Université de Montréal(蒙特利尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small changes in how a prompt is worded can lead to meaningful differences in the behavior of large language models (LLMs), raising concerns about the stability and reliability of their evaluations. While prior work has explored simple formatting changes, these rarely capture the kinds of natural variation seen in real-world language use. We propose a controlled paraphrasing framework based on a taxonomy of minimal linguistic transformations to systematically generate natural prompt variations. Using the BBQ dataset, we validate our method with both human annotations and automated checks, then use it to study how LLMs respond to paraphrased prompts in stereotype evaluation tasks. Our analysis shows that even subtle prompt modifications can lead to substantial changes in model behavior. These results highlight the need for robust, paraphrase-aware evaluation protocols.
zh

[NLP-7] Faster MoE LLM Inference for Extremely Large Models

【速读】: 该论文旨在解决稀疏专家混合(Sparse Mixture of Experts, MoE)模型在不同服务负载下的效率动态问题,以及减少路由专家数量对MoE效率与性能之间权衡的影响。其解决方案的关键在于通过调整激活专家数量和总专家数量,探索优化路径,发现减少激活专家数量可在特定场景下显著提升效率且仅导致轻微性能下降,而减少总专家数量则效率增益有限但性能损失严重。研究提出的方法能够在不牺牲性能的前提下,至少提升吞吐量10%。

链接: https://arxiv.org/abs/2505.03531
作者: Haoqi Yang,Luohe Shi,Qiwei Li,Zuchao Li,Ping Wang,Bo Du,Mengjia Shen,Hai Zhao
机构: Wuhan University (武汉大学); Wuhan Second Ship Design And Research Institute (武汉第二船舶设计研究所); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.
zh

[NLP-8] BadLingual: A Novel Lingual-Backdoor Attack against Large Language Models

【速读】: 该论文试图解决在大型语言模型(Large Language Models, LLMs)中通过语言本身作为触发器实施后门攻击的问题,即语义后门攻击(lingual-backdoor attacks)。此类攻击能够精准针对特定语言群体,加剧种族歧视。解决方案的关键在于设计一种任务无关的语义后门攻击方法——BadLingual,其通过基于PPL约束的贪心坐标梯度搜索(PPL-constrained Greedy Coordinate Gradient-based Search, PGCG)的对抗训练,扩展语义后门的决策边界,从而提升后门攻击在多种下游任务中的泛化能力。

链接: https://arxiv.org/abs/2505.03501
作者: Zihan Wang,Hongwei Li,Rui Zhang,Wenbo Jiang,Kangjie Chen,Tianwei Zhang,Qingchuan Zhao,Guowen Xu
机构: University of Electronic Science and Technology of China (电子科技大学); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we present a new form of backdoor attack against Large Language Models (LLMs): lingual-backdoor attacks. The key novelty of lingual-backdoor attacks is that the language itself serves as the trigger to hijack the infected LLMs to generate inflammatory speech. They enable the precise targeting of a specific language-speaking group, exacerbating racial discrimination by malicious entities. We first implement a baseline lingual-backdoor attack, which is carried out by poisoning a set of training data for specific downstream tasks through translation into the trigger language. However, this baseline attack suffers from poor task generalization and is impractical in real-world settings. To address this challenge, we design BadLingual, a novel task-agnostic lingual-backdoor, capable of triggering any downstream tasks within the chat LLMs, regardless of the specific questions of these tasks. We design a new approach using PPL-constrained Greedy Coordinate Gradient-based Search (PGCG) based adversarial training to expand the decision boundary of lingual-backdoor, thereby enhancing the generalization ability of lingual-backdoor across various tasks. We perform extensive experiments to validate the effectiveness of our proposed attacks. Specifically, the baseline attack achieves an ASR of over 90% on the specified tasks. However, its ASR reaches only 37.61% across six tasks in the task-agnostic scenario. In contrast, BadLingual brings up to 37.35% improvement over the baseline. Our study sheds light on a new perspective of vulnerabilities in LLMs with multilingual capabilities and is expected to promote future research on the potential defenses to enhance the LLMs’ robustness
zh

[NLP-9] Sentence Embeddings as an intermediate target in end-to-end summarisation

【速读】: 该论文试图解决在处理包含大量输入的文档摘要任务时,基于神经网络的方法表现不佳的问题。其解决方案的关键在于将抽取式方法与外部预训练的句子级嵌入相结合,并引入抽象式摘要模型,从而在端到端的用户住宿评论摘要任务中取得优于现有方法的效果。此外,研究还表明,预测摘要的句子级嵌入能够提升松散对齐源语和目标语语料库的端到端系统的质量,相较于常规的句子选择概率分布预测更具优势。

链接: https://arxiv.org/abs/2505.03481
作者: Maciej Zembrzuski,Saad Mahamood
机构: trivago N.V. / Düsseldorf, Germany (trivago N.V. / 杜塞尔多夫,德国)
类目: Computation and Language (cs.CL)
备注: 10 pages, 1 figure, Year: 2019

点击查看摘要

Abstract:Current neural network-based methods to the problem of document summarisation struggle when applied to datasets containing large inputs. In this paper we propose a new approach to the challenge of content-selection when dealing with end-to-end summarisation of user reviews of accommodations. We show that by combining an extractive approach with externally pre-trained sentence level embeddings in an addition to an abstractive summarisation model we can outperform existing methods when this is applied to the task of summarising a large input dataset. We also prove that predicting sentence level embedding of a summary increases the quality of an end-to-end system for loosely aligned source to target corpora, than compared to commonly predicting probability distributions of sentence selection.
zh

[NLP-10] Evaluation of LLM s on Long-tail Entity Linking in Historical Documents

【速读】: 该论文试图解决长尾实体链接(long-tail entity linking)问题,即在知识库中对较少出现、难以识别的实体进行准确链接。传统方法在处理此类实体时表现不佳,而尽管大语言模型(LLMs)具有强大的上下文理解能力,其在长尾实体链接任务中的表现仍缺乏系统研究。论文的解决方案关键在于评估两种主流LLMs(GPT和LLama3)在长尾实体链接场景下的性能,并与先进的ReLiK框架进行对比,以验证LLMs在该任务中的潜力。实验结果表明,LLMs在长尾实体链接任务中表现出色,显示出其在填补头部实体与长尾实体链接之间差距方面的价值。

链接: https://arxiv.org/abs/2505.03473
作者: Marta Boscariol,Luana Bulla,Lia Draetta,Beatrice Fiumanò,Emanuele Lenzi,Leonardo Piano
机构: University of Turin (都灵大学); University of Catania (卡塔尼亚大学); University of Bologna (博洛尼亚大学); University of Pisa (比萨大学); University of Cagliari (卡利亚里大学); Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR) (意大利国家研究理事会信息科学与技术研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity Linking (EL) plays a crucial role in Natural Language Processing (NLP) applications, enabling the disambiguation of entity mentions by linking them to their corresponding entries in a reference knowledge base (KB). Thanks to their deep contextual understanding capabilities, LLMs offer a new perspective to tackle EL, promising better results than traditional methods. Despite the impressive generalization capabilities of LLMs, linking less popular, long-tail entities remains challenging as these entities are often underrepresented in training data and knowledge bases. Furthermore, the long-tail EL task is an understudied problem, and limited studies address it with LLMs. In the present work, we assess the performance of two popular LLMs, GPT and LLama3, in a long-tail entity linking scenario. Using MHERCL v0.1, a manually annotated benchmark of sentences from domain-specific historical texts, we quantitatively compare the performance of LLMs in identifying and linking entities to their corresponding Wikidata entries against that of ReLiK, a state-of-the-art Entity Linking and Relation Extraction framework. Our preliminary experiments reveal that LLMs perform encouragingly well in long-tail EL, indicating that this technology can be a valuable adjunct in filling the gap between head and long-tail EL.
zh

[NLP-11] Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models

【速读】: 该论文试图解决监督微调(Supervised Fine-Tuning, SFT)方法在将推理能力迁移至非推理模型时所面临的“过度思考”问题,即模型在推理过程中生成冗长且重复的推理链。解决方案的关键在于提出一种长-短推理链混合监督微调(Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning, LS-Mixture SFT)方法,该方法将长推理链数据集与其通过结构保留重写得到的短推理链数据集相结合,从而在提升模型推理准确性的同时显著减少响应长度。

链接: https://arxiv.org/abs/2505.03469
作者: Bin Yu,Hang Yuan,Yuliang Wei,Bailing Wang,Weizhen Qi,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the “overthinking” problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose \textbfLong-\textbfShort Chain-of-Thought \textbfMixture \textbfSupervised \textbfFine-\textbfTuning (\textbfLS-Mixture SFT), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3% across various benchmarks while substantially reducing model response length by approximately 47.61%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.
zh

[NLP-12] Uncertainty-Aware Large Language Models for Explainable Disease Diagnosis

【速读】: 该论文旨在解决临床诊断中因证据不足导致的诊断不确定性问题,此类问题可能增加误诊和不良后果的风险。现有诊断系统在识别和解释诊断不确定性方面仍存在不足。解决方案的关键在于提出ConfiDx,一个通过微调开源大语言模型(Large Language Model, LLM)并结合诊断标准构建的具有不确定性感知能力的模型,该模型能够有效识别诊断不确定性,并生成可信的诊断解释,从而提升自动诊断系统的可靠性。

链接: https://arxiv.org/abs/2505.03467
作者: Shuang Zhou,Jiashuo Wang,Zidu Xu,Song Wang,David Brauer,Lindsay Welton,Jacob Cogan,Yuen-Hei Chung,Lei Tian,Zaifu Zhan,Yu Hou,Mingquan Lin,Genevieve B. Melton,Rui Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures

点击查看摘要

Abstract:Explainable disease diagnosis, which leverages patient information (e.g., signs and symptoms) and computational models to generate probable diagnoses and reasonings, offers clear clinical values. However, when clinical notes encompass insufficient evidence for a definite diagnosis, such as the absence of definitive symptoms, diagnostic uncertainty usually arises, increasing the risk of misdiagnosis and adverse outcomes. Although explicitly identifying and explaining diagnostic uncertainties is essential for trustworthy diagnostic systems, it remains under-explored. To fill this gap, we introduce ConfiDx, an uncertainty-aware large language model (LLM) created by fine-tuning open-source LLMs with diagnostic criteria. We formalized the task and assembled richly annotated datasets that capture varying degrees of diagnostic ambiguity. Evaluating ConfiDx on real-world datasets demonstrated that it excelled in identifying diagnostic uncertainties, achieving superior diagnostic performance, and generating trustworthy explanations for diagnoses and uncertainties. To our knowledge, this is the first study to jointly address diagnostic uncertainty recognition and explanation, substantially enhancing the reliability of automatic diagnostic systems.
zh

[NLP-13] An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

【速读】: 该论文试图解决在特定应用场景下寻找最优的检索增强生成(Retrieval-Augmented Generation, RAG)配置的问题,该过程通常复杂且成本高昂。为了解决这一问题,研究提出了一种基于超参数优化(Hyper-Parameter Optimization, HPO)的框架,并通过在多个领域数据集上的实验验证了其有效性。该解决方案的关键在于探索大规模的HPO搜索空间,并采用两种优化的评估指标,证明了RAG的HPO可以高效完成,无论是通过贪心策略还是迭代随机搜索,均能显著提升RAG在所有数据集上的性能。此外,研究还表明,在贪心HPO方法中,优先优化模型而非按照RAG流水线顺序优化更为有效。

链接: https://arxiv.org/abs/2505.03452
作者: Matan Orbach,Ohad Eytan,Benjamin Sznajder,Ariel Gera,Odellia Boni,Yoav Kantor,Gal Bloch,Omri Levy,Hadas Abraham,Nitzan Barzilay,Eyal Shnarch,Michael E. Factor,Shila Ofek-Koifman,Paula Ta-Shma,Assaf Toledo
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains, including a new one collected for this work on real-world product documentation. Our study explores the largest HPO search space considered to date, with two optimized evaluation metrics. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with iterative random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing models first is preferable to the prevalent practice of optimizing sequentially according to the RAG pipeline order.
zh

[NLP-14] Elevating Semantic Exploration: A Novel Approach Utilizing Distributed Repositories

【速读】: 该论文试图解决传统集中式系统在可扩展性和容错性方面的局限性,以及分布式系统在管理复杂性上的挑战。其解决方案的关键在于构建一个分布式文档存储系统,利用边缘存储库对文本数据和元数据进行分析,从而提升语义探索能力,以适应大规模、高可用性和高性能的应用需求。

链接: https://arxiv.org/abs/2505.03443
作者: Valerio Bellandi
机构: Università Degli Studi di Milano (米兰大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted at the 6th International Conference on Recent Trends and Applications in Computer Science. It will appear in the proceedings

点击查看摘要

Abstract:Centralized and distributed systems are two main approaches to organizing ICT infrastructure, each with its pros and cons. Centralized systems concentrate resources in one location, making management easier but creating single points of failure. Distributed systems, on the other hand, spread resources across multiple nodes, offering better scalability and fault tolerance, but requiring more complex management. The choice between them depends on factors like application needs, scalability, and data sensitivity. Centralized systems suit applications with limited scalability and centralized control, while distributed systems excel in large-scale environments requiring high availability and performance. This paper explores a distributed document repository system developed for the Italian Ministry of Justice, using edge repositories to analyze textual data and metadata, enhancing semantic exploration capabilities.
zh

[NLP-15] MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks

【速读】: 该论文试图解决生成式 AI (Generative AI) 在阿拉伯语医学领域中的有效性问题,由于缺乏高质量的领域特定数据集和基准测试,这一领域的研究尚未得到充分探索。解决方案的关键在于构建了一个名为 MedArabiQ 的新型基准数据集,该数据集包含七个阿拉伯语医学任务,涵盖多个专科,并包括选择题、填空题以及患者-医生问答等形式。通过利用过去的医学考试和公开数据集构建该数据集,并引入多种修改以评估不同大语言模型(LLMs)的能力,包括偏见缓解,从而为未来研究提供了基础。

链接: https://arxiv.org/abs/2505.03427
作者: Mouath Abu Daoud,Chaimae Abouzahir,Leen Kharouf,Walid Al-Eisawi,Nizar Habash,Farah E. Shamout
机构: New York University Abu Dhabi(纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.
zh

[NLP-16] Enhancing Target-unspecific Tasks through a Features Matrix ICML2025

【速读】: 该论文试图解决大型视觉-语言模型在目标无关或泛化性任务中的性能不足问题,这一问题可能归因于过拟合训练导致模型遗忘对目标无关任务具有促进作用的通用知识。解决方案的关键在于提出一种名为 Features Matrix (FM) 的正则化方法,该方法通过提取和利用通用知识构建特征矩阵,从深度和细粒度的角度捕捉多样化输入的语义,从而保留关键的通用知识并降低过拟合风险。

链接: https://arxiv.org/abs/2505.03414
作者: Fangming Cui,Yonggang Zhang,Xuan Wang,Xinmei Tian,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICML 2025

点击查看摘要

Abstract:Recent developments in prompt learning of large vision-language models have significantly improved performance in target-specific tasks. However, these prompt optimizing methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge having strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) regularization approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks, achieving state-of-the-art performance.
zh

[NLP-17] Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLM s and Retrieval-Augmented Generation

【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)提升医疗决策支持系统的准确性与实用性,特别是在整合医院特定数据和优化模型效率方面。其解决方案的关键在于将检索增强生成(Retrieval-Augmented Generation, RAG)技术与基于量化低秩适配(Quantized Low-Rank Adaptation, QLoRA)的微调方法相结合,以提高模型在医疗场景下的响应准确性和参数效率,同时通过专门的量化技术保障医疗信息的完整性。

链接: https://arxiv.org/abs/2505.03406
作者: Mohammad Shoaib Ansari,Mohd Sohail Ali Khan,Shubham Revankar,Aditya Varma,Anil S. Mokhade
机构: Visvesvaraya National Institute of Technology (VNIT), Nagpur
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:This research paper investigates the application of Large Language Models (LLMs) in healthcare, specifically focusing on enhancing medical decision support through Retrieval-Augmented Generation (RAG) integrated with hospital-specific data and fine-tuning using Quantized Low-Rank Adaptation (QLoRA). The system utilizes Llama 3.2-3B-Instruct as its foundation model. By embedding and retrieving context-relevant healthcare information, the system significantly improves response accuracy. QLoRA facilitates notable parameter efficiency and memory optimization, preserving the integrity of medical information through specialized quantization techniques. Our research also shows that our model performs relatively well on various medical benchmarks, indicating that it can be used to make basic medical suggestions. This paper details the system’s technical components, including its architecture, quantization methods, and key healthcare applications such as enhanced disease prediction from patient symptoms and medical history, treatment suggestions, and efficient summarization of complex medical reports. We touch on the ethical considerations-patient privacy, data security, and the need for rigorous clinical validation-as well as the practical challenges of integrating such systems into real-world healthcare workflows. Furthermore, the lightweight quantized weights ensure scalability and ease of deployment even in low-resource hospital environments. Finally, the paper concludes with an analysis of the broader impact of LLMs on healthcare and outlines future directions for LLMs in medical settings.
zh

[NLP-18] Absolute Zero: Reinforced Self-play Reasoning with Zero Data

【速读】: 该论文试图解决传统强化学习与可验证奖励(RLVR)方法中对人工标注数据的依赖问题,尤其是在零样本设置下仍需依赖人工构建的问答数据集进行训练,这限制了模型的长期可扩展性。解决方案的关键在于提出一种名为Absolute Zero的新RLVR范式,其核心是让单一模型自主生成任务以最大化自身学习进度,并通过解决这些任务来提升推理能力,而无需任何外部数据。在此范式下,研究者引入了Absolute Zero Reasoner (AZR),该系统通过代码执行器验证提出的代码推理任务并验证答案,作为统一的可验证奖励来源,从而实现开放但具有基础的学习过程。

链接: https://arxiv.org/abs/2505.03335
作者: Andrew Zhao,Yiran Wu,Yang Yue,Tong Wu,Quentin Xu,Yang Yue,Matthieu Lin,Shenzhi Wang,Qingyun Wu,Zilong Zheng,Gao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
zh

[NLP-19] Recall with Reasoning : Chain-of-Thought Distillation for Mambas Long-Context Memory and Extrapolation

【速读】: 该论文试图解决Mamba模型在处理远超训练长度的序列时,其理论上的无限上下文潜力无法在实践中充分发挥的问题。解决方案的关键在于提出一种简单而有效的方法——带推理的回忆(Recall with Reasoning, RwR),通过从教师模型中蒸馏链式思维(Chain-of-Thought, CoT)摘要,并在微调过程中将这些摘要作为CoT提示进行前置,从而教导Mamba主动回忆并推理长上下文信息。

链接: https://arxiv.org/abs/2505.03320
作者: Junyu Ma,Tianqing Fang,Zhisong Zhang,Hongming Zhang,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mamba’s theoretical infinite-context potential is limited in practice when sequences far exceed training lengths. This work explores unlocking Mamba’s long-context memory ability by a simple-yet-effective method, Recall with Reasoning (RwR), by distilling chain-of-thought (CoT) summarization from a teacher model. Specifically, RwR prepends these summarization as CoT prompts during fine-tuning, teaching Mamba to actively recall and reason over long contexts. Experiments on LONGMEMEVAL and HELMET show RwR boosts Mamba’s long-context performance against comparable Transformer/hybrid baselines under similar pretraining conditions, while preserving short-context capabilities, all without architectural changes.
zh

[NLP-20] Ψ-Arena: Interactive Assessment and Optimization of LLM -based Psychological Counselors with Tripartite Feedback

【速读】: 该论文旨在解决当前对基于大语言模型(Large Language Models, LLMs)的心理咨询能力评估存在的局限性,包括静态评估、单一视角和开环框架等问题。其解决方案的关键在于提出Psi-Arena框架,该框架具有三个核心特征:(1)通过多阶段对话模拟真实心理咨询场景的现实竞技场交互,(2)从来访者、咨询师和督导三方进行综合评估的三元评价体系,(3)利用诊断反馈迭代优化的闭环优化机制。

链接: https://arxiv.org/abs/2505.03293
作者: Shijing Zhu,Zhuang Chen,Guanqun Bi,Binghang Li,Yaxi Deng,Dazhen Wan,Libiao Peng,Xiyao Xiao,Rongsheng Zhang,Tangjie Lv,Zhipeng Hu,FangFang Li,Minlie Huang
机构: Central South University (中南大学); CoAI Group, DCST, IAI, BNRIST, Tsinghua University (清华大学计算机科学与技术系、人工智能研究院、脑科学与神经工程研究所); Lingxin AI (灵犀AI); Fuxi AI Lab, NetEase Inc. (网易伏羲人工智能实验室)
类目: Computation and Language (cs.CL)
备注: in progress

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in providing scalable mental health support, while evaluating their counseling capability remains crucial to ensure both efficacy and safety. Existing evaluations are limited by the static assessment that focuses on knowledge tests, the single perspective that centers on user experience, and the open-loop framework that lacks actionable feedback. To address these issues, we propose \Psi-Arena, an interactive framework for comprehensive assessment and optimization of LLM-based counselors, featuring three key characteristics: (1) Realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients, (2) Tripartite evaluation that integrates assessments from the client, counselor, and supervisor perspectives, and (3) Closed-loop optimization that iteratively improves LLM counselors using diagnostic feedback. Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives. Moreover, reflection-based optimization results in up to a 141% improvement in counseling performance. We hope PsychoArena provides a foundational resource for advancing reliable and human-aligned LLM applications in mental healthcare.
zh

[NLP-21] SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation IJCAI2025

【速读】: 该论文旨在解决现有语音分离技术在真实复杂环境(如噪声和混响场景)中容易产生伪影或失真的问题。其解决方案的关键在于引入SepALM,该方法利用音频语言模型(Audio Language Model, ALM)在文本域内对初步分离的语音进行校正与重合成,通过整合端到端的错误校正机制,有效降低误差累积并避免传统方法中将自动语音识别(ASR)与大语言模型(LLM)结合时的优化难题。

链接: https://arxiv.org/abs/2505.03273
作者: Zhaoxi Mu,Xinyu Yang,Gang Wang
机构: Xi’an Jiaotong University (西安交通大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Appears in IJCAI 2025

点击查看摘要

Abstract:While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
zh

[NLP-22] Survey of Abstract Meaning Representation: Then Now Future

【速读】: 该论文旨在探讨抽象意义表示(Abstract Meaning Representation, AMR)这一语义表示框架及其在自然语言处理中的应用,重点分析其能力、解析(文本到AMR)与生成(AMR到文本)任务的现有方法及未来发展方向。解决方案的关键在于通过图结构对句子进行语义建模,将概念与关系以有向无环图的形式表达,从而有效捕捉复杂句子的语义信息,并推动其在文本生成、文本分类、信息抽取等任务中的应用。

链接: https://arxiv.org/abs/2505.03229
作者: Behrooz Mansouri
机构: AIIR Lab, University of Southern Maine(人工智能与信息检索实验室,南缅因大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents a survey of Abstract Meaning Representation (AMR), a semantic representation framework that captures the meaning of sentences through a graph-based structure. AMR represents sentences as rooted, directed acyclic graphs, where nodes correspond to concepts and edges denote relationships, effectively encoding the meaning of complex sentences. This survey investigates AMR and its extensions, focusing on AMR capabilities. It then explores the parsing (text-to-AMR) and generation (AMR-to-text) tasks by showing traditional, current, and possible futures approaches. It also reviews various applications of AMR including text generation, text classification, and information extraction and information seeking. By analyzing recent developments and challenges in the field, this survey provides insights into future directions for research and the potential impact of AMR on enhancing machine understanding of human language.
zh

[NLP-23] Improving Model Alignment Through Collective Intelligence of Open-Source LLM S ICML2025

【速读】: 该论文旨在解决构建高效、安全的大语言模型(Large Language Models, LLMs)过程中,因依赖高质量人工标注数据而导致的数据集构建成本高、难以扩展以及多样性与泛化能力受限的问题。其解决方案的关键在于提出一种名为“Agent混合对齐”(Mixture of Agents Alignment, MoAA)的方法,该方法通过整合多种语言模型的集体优势,生成高质量的合成数据用于模型对齐,从而提升监督微调和偏好优化的效果。实验结果表明,MoAA在多个基准测试中显著提升了模型性能,并展示了其在无需依赖外部强监督的情况下推动开源大语言模型发展的潜力。

链接: https://arxiv.org/abs/2505.03059
作者: Junlin Wang,Roy Xie,Shang Zhu,Jue Wang,Ben Athiwaratkun,Bhuwan Dhingra,Shuaiwen Leon Song,Ce Zhang,James Zou
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025

点击查看摘要

Abstract:Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback, which necessitates high-quality human-labeled data. Constructing such datasets is often expensive and hard to scale, and may face potential limitations on diversity and generalization. To address these challenges, we introduce Mixture of Agents Alignment (MoAA), that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning and preference optimization, leading to improved performance compared to using a single model alone to generate alignment data (e.g. using GPT-4o alone). Evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2, highlighting a promising direction for model alignment through this new scalable and diverse synthetic data recipe. Furthermore, we demonstrate that MoAA enables a self-improvement pipeline, where models finetuned on MoA-generated data surpass their own initial capabilities, providing evidence that our approach can push the frontier of open-source LLMs without reliance on stronger external supervision. Data and code will be released.
zh

[NLP-24] BLAB: Brutally Long Audio Bench

【速读】: 该论文试图解决当前大型音频语言模型(Audio Language Models, LMs)在处理长时序对话语音段落时表现不佳的问题,尤其是在定位、时长估计、情感分析和计数等任务上的挑战。现有研究主要集中在短音频片段(通常不超过30秒)上,未能充分探索更贴近真实用户交互的长时序语音数据。解决方案的关键在于构建一个名为Brutally Long Audio Bench (BLAB)的基准测试集,该基准使用平均时长达51分钟的音频片段进行评估,包含833小时以上的多样化音频数据,并配有由人工标注的文本问题与答案,以全面评估音频LMs在长时序语音理解方面的能力。

链接: https://arxiv.org/abs/2505.03054
作者: Orevaoghene Ahia,Martijn Bartelds,Kabir Ahuja,Hila Gonen,Valentin Hofmann,Siddhant Arora,Shuyue Stella Li,Vishal Puttagunta,Mofetoluwa Adeyemi,Charishma Buchireddy,Ben Walls,Noah Bennett,Shinji Watanabe,Noah A. Smith,Yulia Tsvetkov,Sachin Kumar
机构: University of Washington (华盛顿大学); Stanford University (斯坦福大学); The Ohio State University (俄亥俄州立大学); Allen Institute for AI (艾伦人工智能研究所); Carnegie Mellon University (卡内基梅隆大学); University of Waterloo (滑铁卢大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.
zh

[NLP-25] Developing A Framework to Support Human Evaluation of Bias in Generated Free Response Text

【速读】: 该论文试图解决在实际部署场景中,大型语言模型(Large Language Model, LLM)的偏见评估问题,尤其是在任务特定提示和经验性上下文相互作用的情况下,传统基于短上下文和固定选项基准的评估方法可能失去有效性。解决方案的关键在于开发一个以人类洞察为核心、半自动化的偏见评估框架,通过定义操作性偏见概念来实现评估流程的自动化,并提出一种超越多项选择的偏见分类方法,同时利用人工评估揭示偏见基准中的问题模板。

链接: https://arxiv.org/abs/2505.03053
作者: Jennifer Healey,Laurie Byrum,Md Nadeem Akhtar,Surabhi Bhargava,Moumita Sinha
机构: Adobe(Adobe)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, no figures, presented at CHI 2025 workshop for Human Evaluation and Auditing of Language Models

点击查看摘要

Abstract:LLM evaluation is challenging even the case of base models. In real world deployments, evaluation is further complicated by the interplay of task specific prompts and experiential context. At scale, bias evaluation is often based on short context, fixed choice benchmarks that can be rapidly evaluated, however, these can lose validity when the LLMs’ deployed context differs. Large scale human evaluation is often seen as too intractable and costly. Here we present our journey towards developing a semi-automated bias evaluation framework for free text responses that has human insights at its core. We discuss how we developed an operational definition of bias that helped us automate our pipeline and a methodology for classifying bias beyond multiple choice. We additionally comment on how human evaluation helped us uncover problematic templates in a bias benchmark.
zh

[NLP-26] aching Models to Understand (but not Generate) High-risk Data

【速读】: 该论文试图解决语言模型在预训练过程中因过滤高风险内容(如有害或受版权保护的文本)而导致模型无法有效识别和应对有害或敏感内容的问题。解决方案的关键在于提出一种名为SLUNG(Selective Loss to Understand but Not Generate)的预训练范式,该方法通过选择性地避免激励模型生成高风险标记,同时确保这些标记保留在模型的上下文窗口内,从而使模型在学习预测低风险后续标记的过程中理解高风险内容。

链接: https://arxiv.org/abs/2505.03052
作者: Ryan Wang,Matthew Finlayson,Luca Soldaini,Swabha Swayamdipta,Robin Jia
机构: University of Southern California (南加州大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language model developers typically filter out high-risk content – such as toxic or copyrighted text – from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models’ ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model’s context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models’ understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.
zh

[NLP-27] Radio: Rate-Distortion Optimization for Large Language Model Compression ICML2025

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在资源受限设备上的部署问题,以及通过模型压缩降低计算成本和环境影响。其解决方案的关键在于从率失真理论(Rate-Distortion Theory)的角度建立LLM量化的基础,并提出一种基于简单率失真优化的量化技术,该技术能够适用于包含数千亿参数的模型,并允许用户在训练后根据指定的模型大小或精度进行压缩。

链接: https://arxiv.org/abs/2505.03031
作者: Sean I. Young
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:In recent years, the compression of large language models (LLMs) has emerged as a key problem in facilitating LLM deployment on resource-limited devices, reducing compute costs, and mitigating the environmental footprint due to large-scale AI infrastructure. Here, we establish the foundations of LLM quantization from a rate-distortion theory perspective and propose a quantization technique based on simple rate-distortion optimization. Our technique scales to models containing hundreds of billions of weight parameters and offers users the flexibility to compress models, post-training, to a model size or accuracy specified by the user.
zh

[NLP-28] UCSC at SemEval-2025 Task 3: Context Models and Prompt Optimization for Automated Hallucination Detection in LLM Output

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在回答知识密集型查询时产生的幻觉(hallucinations)问题,特别是如何精确定位LLM输出中出现幻觉的具体位置。解决方案的关键在于提出一个框架,该框架首先检索相关上下文,然后从答案中识别虚假内容,并最终将这些内容映射回LLM输出中的具体片段。此外,通过自动优化提示(prompts)进一步提升了该过程的效果。

链接: https://arxiv.org/abs/2505.03030
作者: Sicong Huang,Jincheng He,Shiyuan Huang,Karthik Raja Anandan,Arkajyoti Chakraborty,Ian Lane
机构: University of California, Santa Cruz (加利福尼亚大学圣克鲁兹分校)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Hallucinations pose a significant challenge for large language models when answering knowledge-intensive queries. As LLMs become more widely adopted, it is crucial not only to detect if hallucinations occur but also to pinpoint exactly where in the LLM output they occur. SemEval 2025 Task 3, Mu-SHROOM: Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, is a recent effort in this direction. This paper describes the UCSC system submission to the shared Mu-SHROOM task. We introduce a framework that first retrieves relevant context, next identifies false content from the answer, and finally maps them back to spans in the LLM output. The process is further enhanced by automatically optimizing prompts. Our system achieves the highest overall performance, ranking #1 in average position across all languages. We release our code and experiment results.
zh

[NLP-29] A Typology of Synthetic Datasets for Dialogue Processing in Clinical Contexts

【速读】: 该论文旨在解决合成数据集在临床对话相关任务中的应用缺乏系统性理论指导的问题,尤其是在如何有效利用和泛化到新应用场景方面。其解决方案的关键在于提出一种新的分类体系(typology),用于对数据合成的类型和程度进行分类,从而促进不同合成数据集之间的比较与评估。

链接: https://arxiv.org/abs/2505.03025
作者: Steven Bedrick,A. Seza Doğruöz,Sergiu Nisioi
机构: Oregon Health and Science University (俄勒冈健康与科学大学); IDLab (IDLab); Universiteit Gent (根特大学); HLT Research Center (HLT研究中心); University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.03025 [cs.CL] (or arXiv:2505.03025v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.03025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-30] Memorization or Interpolation ? Detecting LLM Memorization through Input Perturbation Analysis

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中可能产生的记忆现象,即模型倾向于直接复制训练数据而非实现真正的泛化能力,这引发了数据隐私、知识产权和模型评估可靠性等方面的担忧。解决方案的关键在于提出PEARL方法,该方法通过评估模型对输入扰动的敏感性来检测记忆行为,而无需访问模型内部结构,从而能够区分真实泛化与记忆再现。

链接: https://arxiv.org/abs/2505.03019
作者: Albérick Euraste Djiré,Abdoul Kader Kaboré,Earl T. Barr,Jacques Klein,Tegawendé F. Bissyandé
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) achieve remarkable performance through training on massive datasets, they can exhibit concerning behaviors such as verbatim reproduction of training data rather than true generalization. This memorization phenomenon raises significant concerns about data privacy, intellectual property rights, and the reliability of model evaluations. This paper introduces PEARL, a novel approach for detecting memorization in LLMs. PEARL assesses how sensitive an LLM’s performance is to input perturbations, enabling memorization detection without requiring access to the model’s internals. We investigate how input perturbations affect the consistency of outputs, enabling us to distinguish between true generalization and memorization. Our findings, following extensive experiments on the Pythia open model, provide a robust framework for identifying when the model simply regurgitates learned information. Applied on the GPT 4o models, the PEARL framework not only identified cases of memorization of classic texts from the Bible or common code from HumanEval but also demonstrated that it can provide supporting evidence that some data, such as from the New York Times news articles, were likely part of the training data of a given model.
zh

[NLP-31] RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

【速读】: 该论文试图解决如何高效地将基于softmax注意力机制的Transformer模型转换为线性注意力解码器模型的问题,从而在保持模型性能的同时显著降低计算和存储成本。解决方案的关键在于提出了一种名为RADLADS的协议,该协议通过仅需350-700M tokens的微调过程,即可将大规模预训练模型(如Qwen2.5)转换为线性注意力模型,且转换成本低于2000美元,同时在推理时保持接近原始Transformer模型的质量。

链接: https://arxiv.org/abs/2505.03005
作者: Daniel Goldstein,Eric Alcaide,Janna Lu,Eugene Cheah
机构: Recursal AI; EleutherAI; Dalle Molle Institute for Artificial Intelligence USI-SUPSI; George Mason University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \ 2,000 USD at today’s prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at this https URL Training Code at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.2.7 Cite as: arXiv:2505.03005 [cs.CL] (or arXiv:2505.03005v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.03005 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-32] Logits-Constrained Framework with RoBERTa for Ancient Chinese NER

【速读】: 该论文旨在解决古代汉语命名实体识别(Ancient Chinese Named Entity Recognition, AC-NER)中的标签过渡约束问题,以提升模型在高标签复杂度或大规模数据场景下的性能。其解决方案的关键在于提出了一种基于Logits的约束框架(Logits-Constrained, LC),通过结合GujiRoBERTa进行上下文编码与可微分解码机制,强制实现有效的BMES标签转移,从而优化实体识别效果。

链接: https://arxiv.org/abs/2505.02983
作者: Wenjie Hua,Shenghan Xu
机构: Wuhan University (武汉大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 6 tables. Accepted to EvaHan 2025 shared task on Ancient Chinese NLP

点击查看摘要

Abstract:This paper presents a Logits-Constrained (LC) framework for Ancient Chinese Named Entity Recognition (NER), evaluated on the EvaHan 2025 benchmark. Our two-stage model integrates GujiRoBERTa for contextual encoding and a differentiable decoding mechanism to enforce valid BMES label transitions. Experiments demonstrate that LC improves performance over traditional CRF and BiLSTM-based approaches, especially in high-label or large-data settings. We also propose a model selection criterion balancing label complexity and dataset size, providing practical guidance for real-world Ancient Chinese NLP tasks.
zh

[NLP-33] Iterative Resolution of Prompt Ambiguities Using a Progressive Cutting-Search Approach

【速读】: 该论文试图解决自然语言指令在生成式 AI (Generative AI) 中因固有歧义导致的不精确性问题,这一问题通常迫使用户进行多次迭代测试、修正和重新提交提示。解决方案的关键在于提出一种迭代方法,通过结构化的澄清问题和替代方案提案系统性地缩小这些歧义,直至所有不确定性被解决,从而生成最终的精确解决方案。

链接: https://arxiv.org/abs/2505.02952
作者: Fabrizio Marozzo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative AI systems have revolutionized human interaction by enabling natural language-based coding and problem solving. However, the inherent ambiguity of natural language often leads to imprecise instructions, forcing users to iteratively test, correct, and resubmit their prompts. We propose an iterative approach that systematically narrows down these ambiguities through a structured series of clarification questions and alternative solution proposals, illustrated with input/output examples as well. Once every uncertainty is resolved, a final, precise solution is generated. Evaluated on a diverse dataset spanning coding, data analysis, and creative writing, our method demonstrates superior accuracy, competitive resolution times, and higher user satisfaction compared to conventional one-shot solutions, which typically require multiple manual iterations to achieve a correct output.
zh

[NLP-34] he Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

【速读】: 该论文旨在解决自动程序修复(Automatic Program Repair, APR)中如何有效平衡生成多个修复补丁与多轮迭代优化之间的关系问题。传统方法通过生成大量补丁以提高修复效果,而近期基于大语言模型(LLM)的方法则依赖于自迭代能力进行多轮修复。本文的关键解决方案是设计一种APR流程,在每种错误最多生成10个补丁的前提下,结合多输出生成与多轮迭代策略,从而在保证修复质量的同时提升效率。研究还表明,少量微调数据即可显著提升生成补丁的可行性,且迭代式生成方式在复杂基准测试中表现更优。

链接: https://arxiv.org/abs/2505.02931
作者: Fernando Vallecillos Ruiz,Max Hort,Leon Moonen
机构: Simula Research Laboratory (Simula 研究实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in the research track of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), 17-20 June 2025, Istanbul, Türkiye

点击查看摘要

Abstract:Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement. Comments: Accepted for publication in the research track of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), 17-20 June 2025, Istanbul, Türkiye Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2505.02931 [cs.SE] (or arXiv:2505.02931v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2505.02931 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-35] When Your Own Output Becomes Your Training Data: Noise-to-Meaning Loops and a Formal RSI Trigger

【速读】: 该论文试图解决人工智能代理在自我反馈机制下内部复杂性无限制增长的问题,其核心在于揭示一种基于噪声到语义的递归自我改进(Noise-to-Meaning Recursive Self-Improvement, N2M-RSI)的最小形式模型。解决方案的关键在于当AI代理将其输出作为输入反馈并跨越显式的信息整合阈值时,在作者假设条件下,其内部复杂性将呈现无界增长。该框架融合了早期关于自提示大型语言模型、哥德尔式自我指涉和AutoML的观点,同时保持了实现无关性,并展示了在允许实例间通信的情况下,系统可能产生超线性效应。

链接: https://arxiv.org/abs/2505.02888
作者: Rintaro Ando
机构: The University of Tokyo (东京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, 4 figures, 3 tables. Code: this http URL (v1.0)

点击查看摘要

Abstract:We present Noise-to-Meaning Recursive Self-Improvement (N2M-RSI), a minimal formal model showing that once an AI agent feeds its own outputs back as inputs and crosses an explicit information-integration threshold, its internal complexity will grow without bound under our assumptions. The framework unifies earlier ideas on self-prompting large language models, Gödelian self-reference, and AutoML, yet remains implementation-agnostic. The model furthermore scales naturally to interacting swarms of agents, hinting at super-linear effects once communication among instances is permitted. For safety reasons, we omit system-specific implementation details and release only a brief, model-agnostic toy prototype in Appendix C.
zh

[NLP-36] Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading

【速读】: 该论文试图解决的问题是:在阅读过程中,如何自动解码读者的开放性阅读目标(open-ended reading goals)从眼动行为中。解决方案的关键在于引入目标分类(goal classification)和目标重建(goal reconstruction)任务及评估框架,并利用大规模英语阅读眼动追踪数据,开发和比较多种结合眼动数据与文本的判别型和生成型多模态大语言模型(multimodal LLMs),以实现对读者特定文本目标的有效识别与重建。

链接: https://arxiv.org/abs/2505.02872
作者: Cfir Avraham Hadar,Omer Shubi,Yoav Meiri,Yevgeni Berzak
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When reading, we often have specific information that interests us in a text. For example, you might be reading this paper because you are curious about LLMs for eye movements in reading, the experimental design, or perhaps you only care about the question ``but does it work?‘’. More broadly, in daily life, people approach texts with any number of text-specific goals that guide their reading behavior. In this work, we ask, for the first time, whether open-ended reading goals can be automatically decoded from eye movements in reading. To address this question, we introduce goal classification and goal reconstruction tasks and evaluation frameworks, and use large-scale eye tracking for reading data in English with hundreds of text-specific information seeking tasks. We develop and compare several discriminative and generative multimodal LLMs that combine eye movements and text for goal classification and goal reconstruction. Our experiments show considerable success on both tasks, suggesting that LLMs can extract valuable information about the readers’ text-specific goals from eye movements.
zh

[NLP-37] Accelerating Large Language Model Reasoning via Speculative Search ICML2025

【速读】: 该论文旨在解决基于树搜索的推理方法在大型语言模型(Large Language Models, LLMs)中因生成大量中间推理步骤(即“思考”)而导致的显著推理延迟问题,从而限制了LLMs的实际应用。解决方案的关键在于提出一种名为推测搜索(Speculative Search, SpecSearch)的框架,该框架通过优化“思考”生成过程来显著加速LLM的推理。SpecSearch利用一个小模型在“思考”和“标记”两个层面与大模型协同工作,高效生成高质量的推理步骤,并引入一种新颖的质量保持拒绝机制,有效过滤掉质量低于大模型输出的思考内容。

链接: https://arxiv.org/abs/2505.02865
作者: Zhihai Wang,Jie Wang,Jilai Pan,Xilin Xia,Huiling Zhen,Mingxuan Yuan,Jianye Hao,Feng Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2025

点击查看摘要

Abstract:Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model’s outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12 \times speedup with comparable reasoning quality.
zh

[NLP-38] Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时的安全性问题,这类攻击能够绕过模型的安全机制。现有研究多依赖于暴力优化或手动设计,难以在实际场景中发现潜在风险。论文提出的解决方案关键在于构建一个名为ICRT的新型越狱攻击框架,其灵感来源于人类认知中的启发式方法和偏见。该框架通过简化效应(simplicity effect)进行认知分解以降低恶意提示的复杂性,并利用相关性偏差(relevance bias)重新组织提示,从而增强语义对齐并有效诱导有害输出。此外,论文还引入了一种基于排序的有害性评估指标,通过Elo、HodgeRank和Rank Centrality等排序聚合方法,更全面地量化生成内容的有害性。

链接: https://arxiv.org/abs/2505.02862
作者: Haoming Yang,Ke Ma,Xiaojun Jia,Yingfei Sun,Qianqian Xu,Qingming Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable performance of Large Language Models (LLMs), they remain vulnerable to jailbreak attacks, which can compromise their safety mechanisms. Existing studies often rely on brute-force optimization or manual design, failing to uncover potential risks in real-world scenarios. To address this, we propose a novel jailbreak attack framework, ICRT, inspired by heuristics and biases in human cognition. Leveraging the simplicity effect, we employ cognitive decomposition to reduce the complexity of malicious prompts. Simultaneously, relevance bias is utilized to reorganize prompts, enhancing semantic alignment and inducing harmful outputs effectively. Furthermore, we introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm by employing ranking aggregation methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify the harmfulness of generated content. Experimental results show that our approach consistently bypasses mainstream LLMs’ safety mechanisms and generates high-risk content, providing insights into jailbreak attack risks and contributing to stronger defense strategies.
zh

[NLP-39] Enhancing ML Model Interpretability: Leverag ing Fine-Tuned Large Language Models for Better Understanding of AI

【速读】: 该论文试图解决当前机器学习(Machine Learning, ML)模型日益增强的黑箱特性所带来的可解释性问题,特别是在电池健康状态(State-of-Health, SoH)预测场景中提升用户对模型决策的理解能力。解决方案的关键在于构建一种基于微调大型语言模型(Large Language Model, LLM)的交互式聊天机器人,作为可解释人工智能(Explainable AI, XAI)的参考架构,从而增强ML模型的人类可解释性,尤其针对缺乏XAI经验的用户。

链接: https://arxiv.org/abs/2505.02859
作者: Jonas Bokstaller,Julia Altheimer,Julian Dormehl,Alina Buss,Jasper Wiltfang,Johannes Schneider,Maximilian Röglinger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Across various sectors applications of eXplainableAI (XAI) gained momentum as the increasing black-boxedness of prevailing Machine Learning (ML) models became apparent. In parallel, Large Language Models (LLMs) significantly developed in their abilities to understand human language and complex patterns. By combining both, this paper presents a novel reference architecture for the interpretation of XAI through an interactive chatbot powered by a fine-tuned LLM. We instantiate the reference architecture in the context of State-of-Health (SoH) prediction for batteries and validate its design in multiple evaluation and demonstration rounds. The evaluation indicates that the implemented prototype enhances the human interpretability of ML, especially for users with less experience with XAI.
zh

[NLP-40] owards High-Fidelity Synthetic Multi-platform Social Media Datasets via Large Language Models

【速读】: 该论文试图解决社会媒体数据集获取困难的问题,特别是在多平台数据集构建方面,由于成本和平台限制,难以获得跨平台的高质量数据。其解决方案的关键在于利用大型语言模型生成具有词汇和语义相关性的合成社会媒体数据,通过多平台主题提示方法,从真实数据集中生成合成数据,并评估其与真实数据的相似性,以期达到接近真实数据的质量水平。

链接: https://arxiv.org/abs/2505.02858
作者: Henry Tari,Nojus Sereiva,Rishabh Kaushal,Thales Bertaglia,Adriana Iamnitchi
机构: Maastricht University (马斯特里赫特大学); Indira Gandhi Delhi Technical University for Women (英迪拉·甘地德里女子技术大学); Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: arXiv admin note: text overlap with arXiv:2407.08323

点击查看摘要

Abstract:Social media datasets are essential for research on a variety of topics, such as disinformation, influence operations, hate speech detection, or influencer marketing practices. However, access to social media datasets is often constrained due to costs and platform restrictions. Acquiring datasets that span multiple platforms, which is crucial for understanding the digital ecosystem, is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real data. We propose multi-platform topic-based prompting and employ various language models to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings show that using large language models to generate synthetic multi-platform social media data is promising, different language models perform differently in terms of fidelity, and a post-processing approach might be needed for generating high-fidelity synthetic datasets for research. In addition to the empirical evaluation of three state of the art large language models, our contributions include new fidelity metrics specific to multi-platform social media datasets.
zh

[NLP-41] Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets

【速读】: 该论文旨在解决生成式 AI(Generative AI)系统在模型更新或提示修改后行为可能出现漂移,从而导致可复现性和可靠性不足的问题。其解决方案的关键在于提出 GPR-bench,一个轻量且可扩展的基准测试框架,通过结合多任务、多场景的双语数据集与自动化评估流程,实现对通用用途场景下的回归测试。该框架采用“LLM-as-a-Judge”方法对生成结果的正确性和简洁性进行评分,为模型性能的持续监控和比较提供了标准化工具。

链接: https://arxiv.org/abs/2505.02854
作者: Masumi Morishige,Ryo Koshihara
机构: Galirage Inc. (Galirage 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Reproducibility and reliability remain pressing challenges for generative AI systems whose behavior can drift with each model update or prompt revision. We introduce GPR-bench, a lightweight, extensible benchmark that operationalizes regression testing for general purpose use cases. GPR-bench couples an open, bilingual (English and Japanese) dataset covering eight task categories (e.g., text generation, code generation, and information retrieval) and 10 scenarios in each task categories (80 total test cases for each language) with an automated evaluation pipeline that employs “LLM-as-a-Judge” scoring of correctness and conciseness. Experiments across three recent model versions - gpt-4o-mini, o3-mini, and o4-mini - and two prompt configurations (default versus concise-writing instruction) reveal heterogeneous quality. Our results show that newer models generally improve correctness, but the differences are modest and not statistically significant, suggesting that GPR-bench may not be sufficiently challenging to differentiate between recent model versions. In contrast, the concise-writing instruction significantly enhances conciseness (+12.37 pp, Mann-Whitney U test: p 0.001, effect size r = 0.2995) with minimal degradations on accuracy (-1.7 pp), demonstrating the effectiveness of prompt engineering. Released under the MIT License, GPR- bench lowers the barrier to initiating reproducibility monitoring and provides a foundation for community-driven extensions, while also raising important considerations about benchmark design for rapidly evolving language models.
zh

[NLP-42] 30DayGen: Leverag ing LLM s to Create a Content Corpus for Habit Formation ACL

【速读】: 该论文试图解决用户在习惯养成过程中难以将目标分解为可操作步骤并持续跟踪进展的问题,其解决方案的关键在于开发了30 Day Me应用中的30DAYGEN系统,该系统利用大型语言模型(Large Language Models, LLMs)从大量网络页面中生成3,531个独特的30天挑战,并支持根据用户定义的目标进行实时搜索,从而实现领域特定内容的快速构建与语义去重。

链接: https://arxiv.org/abs/2505.02851
作者: Franklin Zhang,Sonya Zhang,Alon Halevy
机构: Bellevue College (贝尔维尤学院); Eastside Preparatory School (东区预备学校); Google Cloud (谷歌云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8 pages (main content), 4 figures. Submitted to ACL BEA2025

点击查看摘要

Abstract:In this paper, we present 30 Day Me, a habit formation application that leverages Large Language Models (LLMs) to help users break down their goals into manageable, actionable steps and track their progress. Central to the app is the 30DAYGEN system, which generates 3,531 unique 30-day challenges sourced from over 15K webpages, and enables runtime search of challenge ideas aligned with user-defined goals. We showcase how LLMs can be harnessed to rapidly construct domain specific content corpora for behavioral and educational purposes, and propose a practical pipeline that incorporates effective LLM enhanced approaches for content generation and semantic deduplication.
zh

[NLP-43] Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors

【速读】: 该论文试图解决生成高质量多选题(MCQs)的难题,特别是针对不同认知层次并融入常见误解作为干扰项的设计,这一过程耗时且需要专业知识,使得大规模手动创建不切实际。现有自动化方法通常仅能生成低认知层次的问题,并且无法有效整合领域特定的误解。解决方案的关键在于提出一种基于分层概念图的框架,该框架通过结构化知识引导大语言模型(LLM)生成包含干扰项的MCQs。该框架首先构建涵盖主要物理主题及其相互关系的分层概念图,并通过自动化流程提取相关部分作为LLM生成问题和干扰项的结构化上下文,最终通过自动化验证确保生成的MCQs符合要求。

链接: https://arxiv.org/abs/2505.02850
作者: Nicy Scaria,Silvester John Joseph Kennedy,Diksha Seth,Ananya Thakur,Deepak Subramani
机构: Indian Institute of Science, Bengaluru
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Generating high-quality MCQs, especially those targeting diverse cognitive levels and incorporating common misconceptions into distractor design, is time-consuming and expertise-intensive, making manual creation impractical at scale. Current automated approaches typically generate questions at lower cognitive levels and fail to incorporate domain-specific misconceptions. This paper presents a hierarchical concept map-based framework that provides structured knowledge to guide LLMs in generating MCQs with distractors. We chose high-school physics as our test domain and began by developing a hierarchical concept map covering major Physics topics and their interconnections with an efficient database design. Next, through an automated pipeline, topic-relevant sections of these concept maps are retrieved to serve as a structured context for the LLM to generate questions and distractors that specifically target common misconceptions. Lastly, an automated validation is completed to ensure that the generated MCQs meet the requirements provided. We evaluate our framework against two baseline approaches: a base LLM and a RAG-based generation. We conducted expert evaluations and student assessments of the generated MCQs. Expert evaluation shows that our method significantly outperforms the baseline approaches, achieving a success rate of 75.20% in meeting all quality criteria compared to approximately 37% for both baseline methods. Student assessment data reveal that our concept map-driven approach achieved a significantly lower guess success rate of 28.05% compared to 37.10% for the baselines, indicating a more effective assessment of conceptual understanding. The results demonstrate that our concept map-based approach enables robust assessment across cognitive levels and instant identification of conceptual gaps, facilitating faster feedback loops and targeted interventions at scale.
zh

[NLP-44] Aligning Large Language Models with Healthcare Stakeholders: A Pathway to Trustworthy AI Integration

【速读】: 该论文试图解决医疗健康领域中大型语言模型(Large Language Models, LLMs)输出与医疗利益相关者偏好之间的对齐问题,这一对齐是有效、安全和负责任地赋能医疗工作流程的关键基础。解决方案的关键在于医疗利益相关者在整个LLMs采用生命周期中的积极参与,包括训练数据整理、模型训练和推理阶段,通过增强医疗知识整合、任务理解和人类指导,使LLMs更好地遵循人类价值观,从而提升人机对齐程度,构建可信赖的现实医疗应用。

链接: https://arxiv.org/abs/2505.02848
作者: Kexin Ding,Mu Zhou,Akshay Chaudhari,Shaoting Zhang,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); Stanford University (斯坦福大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The wide exploration of large language models (LLMs) raises the awareness of alignment between healthcare stakeholder preferences and model outputs. This alignment becomes a crucial foundation to empower the healthcare workflow effectively, safely, and responsibly. Yet the varying behaviors of LLMs may not always match with healthcare stakeholders’ knowledge, demands, and values. To enable a human-AI alignment, healthcare stakeholders will need to perform essential roles in guiding and enhancing the performance of LLMs. Human professionals must participate in the entire life cycle of adopting LLM in healthcare, including training data curation, model training, and inference. In this review, we discuss the approaches, tools, and applications of alignments between healthcare stakeholders and LLMs. We demonstrate that LLMs can better follow human values by properly enhancing healthcare knowledge integration, task understanding, and human guidance. We provide outlooks on enhancing the alignment between humans and LLMs to build trustworthy real-world healthcare applications.
zh

[NLP-45] Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

【速读】: 该论文试图解决如何评估大型语言模型(Large Language Model, LLM)对人类情感和社会认知的理解能力,而不仅仅是文本处理能力的问题。现有评估方法未能充分反映模型在多轮对话中对情感变化和内在心理状态的模拟能力。解决方案的关键在于提出SAGE(Sentient Agent as a Judge),该框架通过构建一个模拟人类情感变化和内心活动的“感知代理”,在交互过程中推理其情绪变化、感受及回应策略,从而生成可解释的情感轨迹和对话内容,实现对LLM社会认知能力的更真实评估。

链接: https://arxiv.org/abs/2505.02847
作者: Bang Zhang,Ruotian Ma,Qingxuan Jiang,Peisong Wang,Jiaqi Chen,Zheng Xie,Xingyu Chen,Yue Wang,Fanghua Ye,Jian Li,Yifan Yang,Zhaopeng Tu,Xiaolong Li
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM’s higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
zh

计算机视觉

[CV-0] Multi-Agent System for Comprehensive Soccer Understanding SOCC

【速读】:该论文旨在解决当前人工智能驱动的足球理解研究中任务孤立、领域知识不足及系统性推理能力有限的问题。其解决方案的关键在于构建了一个全面的框架,包括:(i)SoccerWiki,首个大规模多模态足球知识库,整合了球员、球队、裁判和场地等丰富的领域知识以支持知识驱动的推理;(ii)SoccerBench,最大且最全面的足球专用基准,包含约10K标准化多模态(文本、图像、视频)多选问答对,覆盖13个不同的理解任务;(iii)SoccerAgent,一种新型多智能体系统,通过协作推理分解复杂足球问题,利用SoccerWiki中的领域知识实现稳健性能。

链接: https://arxiv.org/abs/2505.03735
作者: Jiayuan Rao,Zifeng Li,Haoning Wu,Ya Zhang,Yanfeng Wang,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in AI-driven soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. Specifically, we make the following contributions in this paper: (i) we construct SoccerWiki, the first large-scale multimodal soccer knowledge base, integrating rich domain knowledge about players, teams, referees, and venues to enable knowledge-driven reasoning; (ii) we present SoccerBench, the largest and most comprehensive soccer-specific benchmark, featuring around 10K standardized multimodal (text, image, video) multi-choice QA pairs across 13 distinct understanding tasks, curated through automated pipelines and manual verification; (iii) we introduce SoccerAgent, a novel multi-agent system that decomposes complex soccer questions via collaborative reasoning, leveraging domain expertise from SoccerWiki and achieving robust performance; (iv) extensive evaluations and ablations that benchmark state-of-the-art MLLMs on SoccerBench, highlighting the superiority of our proposed agentic system. All data and code are publicly available at: this https URL.
zh

[CV-1] FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios SIGGRAPH2025

【速读】:该论文试图解决动作定制(action customization)中由于对空间结构(如布局、骨架和视角一致性)的严格约束而导致的适应性不足问题。其解决方案的关键在于提出FlexiAct,该方法通过RefAdapter实现对目标图像的空间结构适应与一致性保持,并结合FAE(Frequency-aware Action Extraction)在去噪过程中直接提取动作信息,从而允许参考视频与目标图像在布局、视角和骨骼结构上的差异,同时保持身份一致性。

链接: https://arxiv.org/abs/2505.03730
作者: Shiyi Zhang,Junhao Zhuang,Zhaoyang Zhang,Ying Shan,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Tencent ARC Lab (腾讯人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by Siggraph2025, Project Page: this https URL

点击查看摘要

Abstract:Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at this https URL
zh

[CV-2] Visual Imitation Enables Contextual Humanoid Control WWW

【速读】:该论文试图解决如何利用周围环境上下文教导双足机器人进行爬楼梯和坐椅子等复杂动作的问题。解决方案的关键在于提出VIDEOMIMIC,这是一个从真实世界到仿真再到现实的端到端管道,通过挖掘日常视频、联合重建人类和环境,并生成全身控制策略,使双足机器人能够执行相应的技能。该方法通过单一策略实现多种动态全身技能,且策略条件依赖于环境和全局根指令,从而为双足机器人在多样化现实环境中操作提供了一种可扩展的路径。

链接: https://arxiv.org/abs/2505.03729
作者: Arthur Allshire,Hongsuk Choi,Junyi Zhang,David McAllister,Anthony Zhang,Chung Min Kim,Trevor Darrell,Pieter Abbeel,Jitendra Malik,Angjoo Kanazawa
机构: UC Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:How can we teach humanoids to climb staircases and sit on chairs using the surrounding environment context? Arguably, the simplest way is to just show them-casually capture a human motion video and feed it to humanoids. We introduce VIDEOMIMIC, a real-to-sim-to-real pipeline that mines everyday videos, jointly reconstructs the humans and the environment, and produces whole-body control policies for humanoid robots that perform the corresponding skills. We demonstrate the results of our pipeline on real humanoid robots, showing robust, repeatable contextual control such as staircase ascents and descents, sitting and standing from chairs and benches, as well as other dynamic whole-body skills-all from a single policy, conditioned on the environment and global root commands. VIDEOMIMIC offers a scalable path towards teaching humanoids to operate in diverse real-world environments.
zh

[CV-3] DISARM: Beyond scanner-free harmonization

【速读】:该论文试图解决多扫描仪T1加权磁共振(T1-weighted MR)图像在神经影像学研究中的一致性问题,以确保下游分析的可靠性。其解决方案的关键在于提出一种直接图像调和方法,通过将图像映射到无扫描仪空间或转换为特定扫描仪域,从而实现跨扫描仪的统一外观和特征保持,该方法具备强大的泛化能力,即使面对训练阶段未包含的扫描仪也能有效工作。

链接: https://arxiv.org/abs/2505.03715
作者: Luca Caldera,Lara Cavinato,Alessio Cirone,Isabella Cama,Sara Garbarino,Raffaele Lodi,Fabrizio Tagliavini,Anna Nigri,Silvia De Francesco,Andrea Cappozzo,Michele Piana,Francesca Ieva
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Harmonization of T1-weighted MR images across different scanners is crucial for ensuring consistency in neuroimaging studies. This study introduces a novel approach to direct image harmonization, moving beyond feature standardization to ensure that extracted features remain inherently reliable for downstream analysis. Our method enables image transfer in two ways: (1) mapping images to a scanner-free space for uniform appearance across all scanners, and (2) transforming images into the domain of a specific scanner used in model training, embedding its unique characteristics. Our approach presents strong generalization capability, even for unseen scanners not included in the training phase. We validated our method using MR images from diverse cohorts, including healthy controls, traveling subjects, and individuals with Alzheimer’s disease (AD). The model’s effectiveness is tested in multiple applications, such as brain age prediction (R2 = 0.60 \pm 0.05), biomarker extraction, AD classification (Test Accuracy = 0.86 \pm 0.03), and diagnosis prediction (AUC = 0.95). In all cases, our harmonization technique outperforms state-of-the-art methods, showing improvements in both reliability and predictive accuracy. Moreover, our approach eliminates the need for extensive preprocessing steps, such as skull-stripping, which can introduce errors by misclassifying brain and non-brain structures. This makes our method particularly suitable for applications that require full-head analysis, including research on head trauma and cranial deformities. Additionally, our harmonization model does not require retraining for new datasets, allowing smooth integration into various neuroimaging workflows. By ensuring scanner-invariant image quality, our approach provides a robust and efficient solution for improving neuroimaging studies across diverse settings. The code is available at this link.
zh

[CV-4] Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)中存在的模态间隙(modality gap)问题,即文本和图像在共享表示空间中的嵌入存在明显分离,这对下游任务如多模态检索、多模态聚类或零样本分类等产生不利影响。解决方案的关键在于提出新颖的评估指标和有效技术,包括基于谱方法和最优传输的方法,以精确评估并减少模态间隙。

链接: https://arxiv.org/abs/2505.03703
作者: François Role,Sébastien Meyer,Victor Amblard
机构: Université Paris-Cité (巴黎-城市大学); Pôle d’Expertise de la Régulation Numérique (PEReN) (数字监管专家中心(PEReN))
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the embeddings from one modality and another in the embedding space. While this misalignment is detrimental for downstream tasks such as multimodal retrieval, multimodal clustering or zero-shot classification, etc. no generic and practical methods have so far been proposed to assess it precisely and even reduce it. We therefore propose novel measures and effective techniques (spectral- and optimal transport-based methods) to achieve this goal. Extensive experiments conducted on several image-text datasets and models demonstrate their effectiveness and beneficial effects on downstream tasks. Our code is available at the URL provided in the paper’s abstract.
zh

[CV-5] Self-Supervised Learning for Robotic Leaf Manipulation: A Hybrid Geometric-Neural Approach

【速读】:该论文旨在解决农业环境中叶片操作自动化所面临的挑战,特别是植物形态的多样性以及叶片的可变形特性。其解决方案的关键在于提出了一种混合几何-神经方法,通过自监督学习将传统计算机视觉与神经网络相结合。该方法利用YOLOv8进行实例分割和RAFT-Stereo进行三维深度估计,构建丰富的叶片表示,并将其输入几何特征评分流程和神经精炼模块(GraspPointCNN)。核心创新点是置信度加权融合机制,该机制根据预测的确定性动态平衡各方法的贡献,从而提升自主叶片抓取的性能。

链接: https://arxiv.org/abs/2505.03702
作者: Srecharan Selvam,Abhishesh Silwal,George Kanter
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Automating leaf manipulation in agricultural settings faces significant challenges, including the variability of plant morphologies and deformable leaves. We propose a novel hybrid geometric-neural approach for autonomous leaf grasping that combines traditional computer vision with neural networks through self-supervised learning. Our method integrates YOLOv8 for instance segmentation and RAFT-Stereo for 3D depth estimation to build rich leaf representations, which feed into both a geometric feature scoring pipeline and a neural refinement module (GraspPointCNN). The key innovation is our confidence-weighted fusion mechanism that dynamically balances the contribution of each approach based on prediction certainty. Our self-supervised framework uses the geometric pipeline as an expert teacher to automatically generate training data. Experiments demonstrate that our approach achieves an 88.0% success rate in controlled environments and 84.7% in real greenhouse conditions, significantly outperforming both purely geometric (75.3%) and neural (60.2%) methods. This work establishes a new paradigm for agricultural robotics where domain expertise is seamlessly integrated with machine learning capabilities, providing a foundation for fully automated crop monitoring systems.
zh

[CV-6] Matching Distance and Geometric Distribution Aided Learning Multiview Point Cloud Registration

【速读】:该论文旨在解决多视角点云配准中的位姿图构建与运动同步问题。传统方法在位姿图构建中常通过剪枝全连接图或利用局部描述符聚合的全局特征构建稀疏图,这可能导致结果不可靠;而在运动同步中则依赖于不准确的手工设计损失函数。论文的关键解决方案是设计一种网络模型,通过提取点云对之间的匹配距离信息来识别可靠的位姿图配对,并提出另一种神经网络模型以数据驱动的方式计算绝对位姿,同时考虑几何分布信息并采用改进的注意力机制以实现灵活且可靠的特征交互。

链接: https://arxiv.org/abs/2505.03692
作者: Shiqi Li,Jihua Zhu,Yifan Xie,Naiwen Hu,Di Wang
机构: Xi’an Jiaotong University (西安交通大学); Shaanxi Joint Key Laboratory for Artifact Intelligence (陕西 artifact 智能联合重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Multiview point cloud registration plays a crucial role in robotics, automation, and computer vision fields. This paper concentrates on pose graph construction and motion synchronization within multiview registration. Previous methods for pose graph construction often pruned fully connected graphs or constructed sparse graph using global feature aggregated from local descriptors, which may not consistently yield reliable results. To identify dependable pairs for pose graph construction, we design a network model that extracts information from the matching distance between point cloud pairs. For motion synchronization, we propose another neural network model to calculate the absolute pose in a data-driven manner, rather than optimizing inaccurate handcrafted loss functions. Our model takes into account geometric distribution information and employs a modified attention mechanism to facilitate flexible and reliable feature interaction. Experimental results on diverse indoor and outdoor datasets confirm the effectiveness and generalizability of our approach. The source code is available at this https URL.
zh

[CV-7] CaRaFFusion: Improving 2D Semantic Segmentation with Camera-Radar Point Cloud Fusion and Zero-Shot Image Inpainting

【速读】:该论文旨在解决在恶劣天气条件下,仅依赖摄像头进行语义分割时因视觉信息受限而导致的性能下降问题,以及雷达传感器数据稀疏且噪声大的挑战。其解决方案的关键在于通过将扩散模型集成到摄像头-雷达融合架构中,利用雷达点特征生成伪掩码,并结合去噪单元提升分割精度,从而有效补充原始图像中缺失的信息,提高语义分割的准确性。

链接: https://arxiv.org/abs/2505.03679
作者: Huawei Sun,Bora Kunter Sahin,Georg Stettinger,Maximilian Bernhard,Matthias Schubert,Robert Wille
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at RA-L 2025

点击查看摘要

Abstract:Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera-only baselines by integrating a diffusion model into a camera-radar fusion architecture. We leverage radar point features to create pseudo-masks using the Segment-Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo-masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera-only segmentation baseline by 2.63% in mIoU and enhances our camera-radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera-radar fusion under adverse weather conditions.
zh

[CV-8] Distribution-Conditional Generation: From Class Distribution to Creative Generation

【速读】:该论文旨在解决文本到图像(Text-to-image, T2I)扩散模型在生成真正新颖、分布外概念时的局限性,即其对训练数据分布的依赖限制了创造力。解决方案的关键在于提出一种名为DisTok的编码器-解码器框架,该框架通过将类别分布条件化生成图像,实现了语义无约束的创造性生成。DisTok通过动态概念池和迭代采样与融合概念对,结合从高斯先验中采样的潜在向量解码为图像,并利用视觉-语言模型预测的类别分布来监督生成标记与视觉语义的一致性,从而实现高效的令牌级生成。

链接: https://arxiv.org/abs/2505.03667
作者: Fu Feng,Yucheng Xie,Xu Yang,Jing Wang,Xin Geng
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models are effective at producing semantically aligned images, but their reliance on training data distributions limits their ability to synthesize truly novel, out-of-distribution concepts. Existing methods typically enhance creativity by combining pairs of known concepts, yielding compositions that, while out-of-distribution, remain linguistically describable and bounded within the existing semantic space. Inspired by the soft probabilistic outputs of classifiers on ambiguous inputs, we propose Distribution-Conditional Generation, a novel formulation that models creativity as image synthesis conditioned on class distributions, enabling semantically unconstrained creative generation. Building on this, we propose DisTok, an encoder-decoder framework that maps class distributions into a latent space and decodes them into tokens of creative concept. DisTok maintains a dynamic concept pool and iteratively sampling and fusing concept pairs, enabling the generation of tokens aligned with increasingly complex class distributions. To enforce distributional consistency, latent vectors sampled from a Gaussian prior are decoded into tokens and rendered into images, whose class distributions-predicted by a vision-language model-supervise the alignment between input distributions and the visual semantics of generated tokens. The resulting tokens are added to the concept pool for subsequent composition. Extensive experiments demonstrate that DisTok, by unifying distribution-conditioned fusion and sampling-based synthesis, enables efficient and flexible token-level generation, achieving state-of-the-art performance with superior text-image alignment and human preference scores.
zh

[CV-9] Revolutionizing Brain Tumor Imaging: Generating Synthetic 3D FA Maps from T1-Weighted MRI using CycleGAN Models

【速读】:该论文旨在解决FA(Fractional Anisotropy)图与纤维追踪图谱之间的空间不对齐问题,这一问题限制了其在预测模型中的有效整合。解决方案的关键在于提出一种基于CycleGAN的方法,直接从T1加权MRI扫描生成FA图,这是该技术首次应用于健康和肿瘤病变组织的生成,通过使用未配对数据进行训练,实现了高保真度的FA图生成,并在肿瘤区域表现出优异的性能。

链接: https://arxiv.org/abs/2505.03662
作者: Xin Du,Francesca M. Cozzi,Rajesh Jena
机构: University of Cambridge(剑桥大学); Cambridge Brain Tumour Imaging Lab(剑桥脑肿瘤影像实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Fractional anisotropy (FA) and directionally encoded colour (DEC) maps are essential for evaluating white matter integrity and structural connectivity in neuroimaging. However, the spatial misalignment between FA maps and tractography atlases hinders their effective integration into predictive models. To address this issue, we propose a CycleGAN based approach for generating FA maps directly from T1-weighted MRI scans, representing the first application of this technique to both healthy and tumour-affected tissues. Our model, trained on unpaired data, produces high fidelity maps, which have been rigorously evaluated using Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR), demonstrating particularly robust performance in tumour regions. Radiological assessments further underscore the model’s potential to enhance clinical workflows by providing an AI-driven alternative that reduces the necessity for additional scans.
zh

[CV-10] ReGraP-LLaVA: Reasoning enabled Graph-based Personalized Large Language and Vision Assistant

【速读】:该论文旨在解决个性化多模态大语言模型(Multimodal Large Language Models, MLLMs)在关系推理和知识连接能力方面的不足,具体表现为训练数据缺乏可学习的多对象关系、模型未能有效建模不同个性化概念之间的关系以及实验仅限于单一个性化概念的识别与描述任务。其解决方案的关键在于构建一个名为ReGraP的新数据集,包含120组个性化知识,每组包含图像、知识图谱(Knowledge Graphs, KGs)和链式思维问答(Chain-of-Thought QA, CoT QA)对,以支持更结构化和复杂的推理路径,并提出ReGraP-LLaVA模型,通过软硬图提示方法将KGs对齐到模型的语义空间中,从而提升模型的关系推理能力和知识连接能力。

链接: https://arxiv.org/abs/2505.03654
作者: Yifan Xiang,Zhenxi Zhang,Bin Li,Yixuan Weng,Shoujun Zhou,Yangfan He,Keqin Li
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Engineering, Westlake University (西湖大学工程学院); University of Minnesota – Twin Cities (明尼苏达大学双城分校); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Recent advances in personalized MLLMs enable effective capture of user-specific concepts, supporting both recognition of personalized concepts and contextual captioning. However, humans typically explore and reason over relations among objects and individuals, transcending surface-level information to achieve more personalized and contextual understanding. To this end, existing methods may face three main limitations: Their training data lacks multi-object sets in which relations among objects are learnable. Building on the limited training data, their models overlook the relations between different personalized concepts and fail to reason over them. Their experiments mainly focus on a single personalized concept, where evaluations are limited to recognition and captioning tasks. To address the limitations, we present a new dataset named ReGraP, consisting of 120 sets of personalized knowledge. Each set includes images, KGs, and CoT QA pairs derived from the KGs, enabling more structured and sophisticated reasoning pathways. We propose ReGraP-LLaVA, an MLLM trained with the corresponding KGs and CoT QA pairs, where soft and hard graph prompting methods are designed to align KGs within the model’s semantic space. We establish the ReGraP Benchmark, which contains diverse task types: multiple-choice, fill-in-the-blank, True/False, and descriptive questions in both open- and closed-ended settings. The proposed benchmark is designed to evaluate the relational reasoning and knowledge-connection capability of personalized MLLMs. We conduct experiments on the proposed ReGraP-LLaVA and other competitive MLLMs. Results show that the proposed model not only learns personalized knowledge but also performs relational reasoning in responses, achieving the SoTA performance compared with the competitive methods. All the codes and datasets are released at: this https URL.
zh

[CV-11] ALMA: Aggregated Lipschitz Maximization Attack on Auto-encoders

【速读】:该论文旨在解决深度自编码器(Deep Autoencoders, AEs)在对抗攻击下的鲁棒性不足问题,特别是针对其中间层条件不良所带来的漏洞。现有基于白盒攻击的鲁棒性评估框架未能充分挖掘这些中间层的脆弱性,导致在优化不可感知的范数有界加性扰动以最大化输出破坏时,现有方法难以有效传播对抗损失梯度,从而收敛到效果较差的扰动。论文提出的解决方案关键在于设计一种基于层条件的对抗优化目标,通过增强攻击优化过程中损失梯度信息的传播,引导对抗映射向局部Lipschitz边界区域,从而生成更强大的对抗样本。

链接: https://arxiv.org/abs/2505.03646
作者: Chethan Krishnamurthy Ramanaik,Arjun Roy,Eirini Ntoutsi
机构: University of the Bundeswehr Munich (德国联邦国防军大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the extensive use of deep autoencoders (AEs) in critical applications, their adversarial robustness remains relatively underexplored compared to classification models. AE robustness is characterized by the Lipschitz bounds of its components. Existing robustness evaluation frameworks based on white-box attacks do not fully exploit the vulnerabilities of intermediate ill-conditioned layers in AEs. In the context of optimizing imperceptible norm-bounded additive perturbations to maximize output damage, existing methods struggle to effectively propagate adversarial loss gradients throughout the network, often converging to less effective perturbations. To address this, we propose a novel layer-conditioning-based adversarial optimization objective that effectively guides the adversarial map toward regions of local Lipschitz bounds by enhancing loss gradient information propagation during attack optimization. We demonstrate through extensive experiments on state-of-the-art AEs that our adversarial objective results in stronger attacks, outperforming existing methods in both universal and sample-specific scenarios. As a defense method against this attack, we introduce an inference-time adversarially trained defense plugin that mitigates the effects of adversarial examples.
zh

[CV-12] owards Smart Point-and-Shoot Photography CVPR2025

【速读】:该论文试图解决普通用户在使用智能手机作为点对点拍摄(PAS)相机时,缺乏摄影构图技巧的问题。传统PAS相机虽然能确保照片的对焦和亮度,但无法指导用户如何优化画面构图。解决方案的关键在于提出一种首创的智能点对点拍摄(SPAS)系统,通过自动引导用户调整相机姿态来优化画面构图。该系统的核心技术包括基于CLIP的构图质量评估(CCQA)模型和相机姿态调整模型(CPAM),其中CCQA利用可学习的文本嵌入技术对图像进行伪标签标注,而CPAM则通过混合专家模型与门控损失函数实现端到端训练,以提供实时的相机姿态调整建议。

链接: https://arxiv.org/abs/2505.03638
作者: Jiawan Li,Fei Zhou,Zhipeng Zhong,Jiongzhi Lin,Guoping Qiu
机构: Shenzhen University (深圳大学); Guangdong Provincial Key Laboratory of Intelligent Information Processing (广东省智能信息处理重点实验室); Guangdong-Hong Kong Joint Laboratory for Big Data Imaging and Communication (粤港大数据成像与通信联合实验室); Shenzhen Key Laboratory of Digital Creative Technology (深圳市数字创意技术重点实验室); Loughborough University (洛桑大学); University of Nottingham (诺丁汉大学); Everimaging Ltd (Everimaging有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025 Accepted

点击查看摘要

Abstract:Hundreds of millions of people routinely take photos using their smartphones as point and shoot (PAS) cameras, yet very few would have the photography skills to compose a good shot of a scene. While traditional PAS cameras have built-in functions to ensure a photo is well focused and has the right brightness, they cannot tell the users how to compose the best shot of a scene. In this paper, we present a first of its kind smart point and shoot (SPAS) system to help users to take good photos. Our SPAS proposes to help users to compose a good shot of a scene by automatically guiding the users to adjust the camera pose live on the scene. We first constructed a large dataset containing 320K images with camera pose information from 4000 scenes. We then developed an innovative CLIP-based Composition Quality Assessment (CCQA) model to assign pseudo labels to these images. The CCQA introduces a unique learnable text embedding technique to learn continuous word embeddings capable of discerning subtle visual quality differences in the range covered by five levels of quality description words bad, poor, fair, good, perfect. And finally we have developed a camera pose adjustment model (CPAM) which first determines if the current view can be further improved and if so it outputs the adjust suggestion in the form of two camera pose adjustment angles. The two tasks of CPAM make decisions in a sequential manner and each involves different sets of training samples, we have developed a mixture-of-experts model with a gated loss function to train the CPAM in an end-to-end manner. We will present extensive results to demonstrate the performances of our SPAS system using publicly available image composition datasets.
zh

[CV-13] Breaking Annotation Barriers: Generalized Video Quality Assessment via Ranking-based Self-Supervision

【速读】:该论文旨在解决传统视频质量评估(Video Quality Assessment, VQA)模型依赖人工标注数据集所带来的劳动密集、成本高且难以扩展的问题,从而限制了模型在未见视频内容和失真类型上的泛化能力。其解决方案的关键在于提出一种自监督学习框架,通过大规模无标签网络视频进行训练,利用学习排序(learning-to-rank)范式,结合现有VQA模型生成的伪标签和基于合成失真模拟的相对质量排序,自动构建训练数据,并引入迭代自我改进训练策略,使模型在训练过程中不断优化标注质量,从而提升模型的泛化能力和性能。

链接: https://arxiv.org/abs/2505.03631
作者: Linhan Cao,Wei Sun,Kaiwei Zhang,Yicong Peng,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video quality assessment (VQA) is essential for quantifying perceptual quality in various video processing workflows, spanning from camera capture systems to over-the-top streaming platforms. While recent supervised VQA models have made substantial progress, the reliance on manually annotated datasets – a process that is labor-intensive, costly, and difficult to scale up – has hindered further optimization of their generalization to unseen video content and distortions. To bridge this gap, we introduce a self-supervised learning framework for VQA to learn quality assessment capabilities from large-scale, unlabeled web videos. Our approach leverages a \textbflearning-to-rank paradigm to train a large multimodal model (LMM) on video pairs automatically labeled via two manners, including quality pseudo-labeling by existing VQA models and relative quality ranking based on synthetic distortion simulations. Furthermore, we introduce a novel \textbfiterative self-improvement training strategy, where the trained model acts an improved annotator to iteratively refine the annotation quality of training data. By training on a dataset 10\times larger than the existing VQA benchmarks, our model: (1) achieves zero-shot performance on in-domain VQA benchmarks that matches or surpasses supervised models; (2) demonstrates superior out-of-distribution (OOD) generalization across diverse video content and distortions; and (3) sets a new state-of-the-art when fine-tuned on human-labeled datasets. Extensive experimental results validate the effectiveness of our self-supervised approach in training generalized VQA models. The datasets and code will be publicly released to facilitate future research.
zh

[CV-14] Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map CVPR2025

【速读】:该论文旨在解决工业场景中缺陷分割任务中高质量标注数据获取成本高且耗时的问题。其解决方案的关键在于提出了一种基于扩散模型的无监督数据生成流程,通过在增强的边界框表示上对扩散模型进行条件约束,生成精确的分割掩码,从而实现缺陷的逼真且空间定位准确的合成。此方法相较于现有的布局条件生成方法,在缺陷一致性和空间准确性方面有所提升。

链接: https://arxiv.org/abs/2505.03623
作者: Alessandro Simoni,Francesco Pelosin
机构: Covision Lab(科维森实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Synthetic Data for Computer Vision Workshop - CVPR 2025

点击查看摘要

Abstract:Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at this https URL.
zh

[CV-15] PhysLLM : Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing

【速读】:该论文旨在解决远程光电容积描记术(rPPG)在光照变化、运动伪影和有限的时间建模方面易受干扰的问题。其解决方案的关键在于提出PhysLLM框架,通过将大语言模型(LLMs)与领域特定的rPPG组件协同优化,实现跨模态对齐与信号稳定性增强。具体而言,Text Prototype Guidance(TPG)策略通过将血流动力学特征投影到LLM可解释的语义空间,解决了生理信号与语言标记之间的表征差距;同时,Dual-Domain Stationary(DDS)算法通过自适应时频域特征重加权提升信号稳定性,结合rPPG任务特定线索注入生理先验信息,从而实现对复杂场景的动态适应。

链接: https://arxiv.org/abs/2505.03621
作者: Yiping Xie,Bo Zhao,Mingtong Dai,Jian-Ping Zhou,Yue Sun,Tao Tan,Weicheng Xie,Linlin Shen,Zitong Yu
机构: Shenzhen University (深圳大学); Great Bay University; National Engineering Laboratory for Big Data System Computing Technology (国家大数据系统计算技术工程实验室); Dongguan Key Laboratory for Intelligence and Information Technology (东莞智能信息技术重点实验室); Guangdong Medical University (广东医科大学); Southern Medical University (南方医科大学); Macao Polytechnic University (澳门理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
zh

[CV-16] Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images

【速读】:该论文旨在解决人脸反欺骗技术在不同场景下泛化能力不足的问题,其核心挑战源于协变量偏移(covariate shift)和语义偏移(semantic shift),分别由数据采集的外部变化和新型攻击类型之间的显著差异引起。解决方案的关键在于提出一种仅依赖单一源域真实人脸图像来学习未知欺骗提示(spoof prompt)的新方法,通过利用视觉-语言模型中的通用知识生成真实人脸和潜在未知欺骗攻击的文本提示,从而提升模型对未见目标域的泛化能力。该方法引入了一个多样化的欺骗提示优化框架,通过约束未知欺骗提示在宽松先验知识空间内并最大化其与真实人脸图像的距离,同时强制不同欺骗提示之间的语义独立性,以捕捉广泛的欺骗模式。

链接: https://arxiv.org/abs/2505.03611
作者: Fangling Jiang,Qi Li,Weining Wang,Wei Shen,Bing Liu,Zhenan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face anti-spoofing is a critical technology for ensuring the security of face recognition systems. However, its ability to generalize across diverse scenarios remains a significant challenge. In this paper, we attribute the limited generalization ability to two key factors: covariate shift, which arises from external data collection variations, and semantic shift, which results from substantial differences in emerging attack types. To address both challenges, we propose a novel approach for learning unknown spoof prompts, relying solely on real face images from a single source domain. Our method generates textual prompts for real faces and potential unknown spoof attacks by leveraging the general knowledge embedded in vision-language models, thereby enhancing the model’s ability to generalize to unseen target domains. Specifically, we introduce a diverse spoof prompt optimization framework to learn effective prompts. This framework constrains unknown spoof prompts within a relaxed prior knowledge space while maximizing their distance from real face images. Moreover, it enforces semantic independence among different spoof prompts to capture a broad range of spoof patterns. Experimental results on nine datasets demonstrate that the learned prompts effectively transfer the knowledge of vision-language models, enabling state-of-the-art generalization ability against diverse unknown attack types across unseen target domains without using any spoof face images.
zh

[CV-17] Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection

【速读】:该论文旨在解决3D mask presentation attack detection(3D面具呈现攻击检测)问题,即如何有效区分真实人脸与3D面具以保护人脸识别系统免受攻击。现有方法依赖多模态特征或远程光体积描记信号(rPPG),但面临传感器成本高和泛化能力有限的挑战。该论文的关键解决方案是提出一种基于知识的提示学习框架,通过将知识图谱中的实体和三元组融入提示学习过程,生成细粒度、任务相关的显式提示,从而充分利用预训练视觉-语言模型中的知识。此外,引入基于注意力机制的视觉特定知识过滤器和因果图理论,以提升模型的泛化能力。

链接: https://arxiv.org/abs/2505.03610
作者: Fangling Jiang,Qi Li,Bing Liu,Weining Wang,Caifeng Shan,Zhenan Sun,Ming-Hsuan Yang
机构: University of South China (南华大学); Chinese Academy of Sciences (中国科学院); Nanjing University (南京大学); University of California, Merced (加州大学默塞德分校); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.
zh

[CV-18] PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model

【速读】:该论文旨在解决音频驱动的人体动画技术中存在的生成质量不佳、特定前景区域表现差以及音频与动作一致性不足的问题,这些问题主要源于缺乏局部细粒度的监督引导。其解决方案的关键在于提出PAHA框架,该框架采用扩散模型实现端到端的上半身人体动画生成,并引入两种关键方法:Parts-Aware Re-weighting (PAR) 和 Parts Consistency Enhancement (PCE),分别通过动态调整区域训练损失权重和构建基于扩散的区域性视听分类器来提升视觉质量和音频-动作一致性。此外,还设计了两种新的推理引导方法以平衡效率与质量。

链接: https://arxiv.org/abs/2505.03603
作者: Y.B. Wang,S.Z. Zhou,J.F. Wu,T. Hu,J.N. Zhang,Y. Liu
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose PAHA, an end-to-end audio-driven upper-body human animation framework with diffusion model. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.
zh

[CV-19] From Pixels to Polygons: A Survey of Deep Learning Approaches for Medical Image-to-Mesh Reconstruction

【速读】:该论文旨在解决医学影像到三维网格模型重建的问题,即如何从医学影像数据中生成高精度、拓扑正确的三维几何模型,以支持计算医学和计算机仿真试验。其解决方案的关键在于系统性地分类和分析现有方法,包括模板模型、统计模型、生成式模型和隐式模型,并深入探讨各类方法的理论基础、优势、局限性及其在不同解剖结构和成像模态中的适用性。此外,论文还通过定量比较和对公开数据集、评估指标及损失函数的分析,为该领域的研究提供了全面的技术参考。

链接: https://arxiv.org/abs/2505.03599
作者: Fengming Lin,Arezoo Zakeri,Yidan Xue,Michael MacRaild,Haoran Dou,Zherui Zhou,Ziwei Zou,Ali Sarrami-Foroushani,Jinming Duan,Alejandro F. Frangi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning-based medical image-to-mesh reconstruction has rapidly evolved, enabling the transformation of medical imaging data into three-dimensional mesh models that are critical in computational medicine and in silico trials for advancing our understanding of disease mechanisms, and diagnostic and therapeutic techniques in modern medicine. This survey systematically categorizes existing approaches into four main categories: template models, statistical models, generative models, and implicit models. Each category is analysed in detail, examining their methodological foundations, strengths, limitations, and applicability to different anatomical structures and imaging modalities. We provide an extensive evaluation of these methods across various anatomical applications, from cardiac imaging to neurological studies, supported by quantitative comparisons using standard metrics. Additionally, we compile and analyze major public datasets available for medical mesh reconstruction tasks and discuss commonly used evaluation metrics and loss functions. The survey identifies current challenges in the field, including requirements for topological correctness, geometric accuracy, and multi-modality integration. Finally, we present promising future research directions in this domain. This systematic review aims to serve as a comprehensive reference for researchers and practitioners in medical image analysis and computational medicine.
zh

[CV-20] Fixed-Length Dense Fingerprint Representation

【速读】:该论文旨在解决固定长度指纹表示在处理多种指纹模态、姿态变化和噪声干扰时的鲁棒性不足问题。其解决方案的关键在于提出一种基于三维密集描述符的固定长度指纹表示,并引入FLARE框架,该框架结合了基于姿态的对齐(pose-based Alignment)和鲁棒增强(Robust Enhancement)策略,以确保密集特征空间的一致性并提升指纹匹配的准确性。

链接: https://arxiv.org/abs/2505.03597
作者: Zhiyu Pan,Xiongjun Guan,Yongjie Duan,Jianjiang Feng,Jie Zhou
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE Transactions on Information Forensics and Security (TIFS)

点击查看摘要

Abstract:Fixed-length fingerprint representations, which map each fingerprint to a compact and fixed-size feature vector, are computationally efficient and well-suited for large-scale matching. However, designing a robust representation that effectively handles diverse fingerprint modalities, pose variations, and noise interference remains a significant challenge. In this work, we propose a fixed-length dense descriptor of fingerprints, and introduce FLARE-a fingerprint matching framework that integrates the Fixed-Length dense descriptor with pose-based Alignment and Robust Enhancement. This fixed-length representation employs a three-dimensional dense descriptor to effectively capture spatial relationships among fingerprint ridge structures, enabling robust and locally discriminative representations. To ensure consistency within this dense feature space, FLARE incorporates pose-based alignment using complementary estimation methods, along with dual enhancement strategies that refine ridge clarity while preserving the original fingerprint modality. The proposed dense descriptor supports fixed-length representation while maintaining spatial correspondence, enabling fast and accurate similarity computation. Extensive experiments demonstrate that FLARE achieves superior performance across rolled, plain, latent, and contactless fingerprints, significantly outperforming existing methods in cross-modality and low-quality scenarios. Further analysis validates the effectiveness of the dense descriptor design, as well as the impact of alignment and enhancement modules on the accuracy of dense descriptor matching. Experimental results highlight the effectiveness and generalizability of FLARE as a unified and scalable solution for robust fingerprint representation and matching. The implementation and code will be publicly available at this https URL.
zh

[CV-21] DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

【速读】:该论文旨在解决动态环境中事件分析的问题,这是开发能够与人类交互的智能代理和机器人所面临的基本挑战。现有方法主要依赖视觉模型,但这些方法通常从图像中隐式地捕获信息,缺乏可解释的空间-时间物体表示。解决方案的关键在于提出DyGEnc——一种用于编码动态图的新方法,该方法将压缩的空间-时间结构观测表示与大语言模型的认知能力相结合,以实现基于文本场景图序列的高级问答。

链接: https://arxiv.org/abs/2505.03581
作者: Sergey Linok,Vadim Semenov,Anastasia Trunova,Oleg Bulichev,Dmitry Yudin
机构: Moscow Institute of Physics and Technology (莫斯科物理技术学院); Innopolis University (伊诺波利斯大学); AIRI (AIRI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 6 tables

点击查看摘要

Abstract:The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc - a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at this http URL.
zh

[CV-22] Supervised and Unsupervised Textile Classification via Near-Infrared Hyperspectral Imaging and Deep Learning

【速读】:该论文试图解决纺织品纤维回收过程中高效分类与分拣的问题,以减少纺织行业对环境的影响。解决方案的关键在于结合高光谱近红外(NIR)成像与先进的深度学习算法,特别是优化的卷积神经网络(CNN)和自编码器网络,这些方法在不同纺织结构上表现出强大的泛化能力,从而实现了准确且稳健的纤维分类。

链接: https://arxiv.org/abs/2505.03575
作者: Maria Kainz,Johannes K. Krondorfer,Malte Jaschik,Maria Jernej,Harald Ganster
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph)
备注: Accepted at: Proceedings of OCM 2025 - 7th International Conference on Optical Characterization of Materials, March 26-27, 2025, Karlsruhe, Germany, pp. 319-328

点击查看摘要

Abstract:Recycling textile fibers is critical to reducing the environmental impact of the textile industry. Hyperspectral near-infrared (NIR) imaging combined with advanced deep learning algorithms offers a promising solution for efficient fiber classification and sorting. In this study, we investigate supervised and unsupervised deep learning models and test their generalization capabilities on different textile structures. We show that optimized convolutional neural networks (CNNs) and autoencoder networks achieve robust generalization under varying conditions. These results highlight the potential of hyperspectral imaging and deep learning to advance sustainable textile recycling through accurate and robust classification.
zh

[CV-23] Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

【速读】:该论文试图解决图像中背景的虚假相关性(spurious correlations)对模型预测的影响问题,特别是在不同类别中由于位置(object position)和尺寸(object size)偏差导致的模型对背景虚假特征的依赖问题。解决方案的关键在于构建一个合成数据集Hard-Spurious-ImageNet,该数据集基于ImageNet1k,包含多样化的背景、物体位置和物体尺寸,以系统地分析模型在不同条件下对虚假特征的依赖程度,并揭示现有方法在应对这些因素时的不足。

链接: https://arxiv.org/abs/2505.03569
作者: Mishal Fatima,Steffen Jung,Margret Keuper
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.
zh

[CV-24] Uncertainty-Aware Prototype Semantic Decoupling for Text-Based Person Search in Full Images

【速读】:该论文旨在解决在复杂场景下基于文本的行人检索(Text-based Pedestrian Search, TBPS)中由于检测和匹配不确定性导致性能下降的问题。其解决方案的关键在于提出UPD-TBPS框架,该框架包含三个核心模块:多粒度不确定性估计(Multi-granularity Uncertainty Estimation, MUE)、基于原型的不确定性解耦(Prototype-based Uncertainty Decoupling, PUD)以及跨模态重识别(Cross-modal Re-identification, ReID),通过减少早期不确定性、解耦行人原型表示及提升候选者评估精度,从而有效提升检索效果。

链接: https://arxiv.org/abs/2505.03567
作者: Zengli Luo,Canlong Zhang,Xiaochun Lu,Zhixin Li,Zhiwen Wang
机构: Guangxi Normal University (广西师范大学); Guangxi University of Science and Technology (广西科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9pages,5figures

点击查看摘要

Abstract:Text-based pedestrian search (TBPS) in full images aims to locate a target pedestrian in untrimmed images using natural language descriptions. However, in complex scenes with multiple pedestrians, existing methods are limited by uncertainties in detection and matching, leading to degraded performance. To address this, we propose UPD-TBPS, a novel framework comprising three modules: Multi-granularity Uncertainty Estimation (MUE), Prototype-based Uncertainty Decoupling (PUD), and Cross-modal Re-identification (ReID). MUE conducts multi-granularity queries to identify potential targets and assigns confidence scores to reduce early-stage uncertainty. PUD leverages visual context decoupling and prototype mining to extract features of the target pedestrian described in the query. It separates and learns pedestrian prototype representations at both the coarse-grained cluster level and the fine-grained individual level, thereby reducing matching uncertainty. ReID evaluates candidates with varying confidence levels, improving detection and retrieval accuracy. Experiments on CUHK-SYSU-TBPS and PRW-TBPS datasets validate the effectiveness of our framework.
zh

[CV-25] Real-Time Person Image Synthesis Using a Flow Matching Model

【速读】:该论文旨在解决姿态引导的人像合成(Pose-Guided Person Image Synthesis, PGPIS)中实时性不足的问题,特别是在需要快速生成高质量图像的应用场景中,如手语视频生成、AR/VR、游戏和直播等。现有基于扩散的方法虽然在图像质量上表现优异,但其采样速度缓慢,难以满足实时性要求。论文提出的解决方案关键在于采用基于流匹配(Flow Matching, FM)的生成模型,该模型实现了更快、更稳定和更高效的训练与采样过程,同时支持条件生成和潜在空间操作,从而在保证图像质量的前提下显著提升了生成速度,实现了近实时的性能。

链接: https://arxiv.org/abs/2505.03562
作者: Jiwoo Jeong,Kirok Kim,Wooju Kim,Nam-Joon Kim
机构: Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pose-Guided Person Image Synthesis (PGPIS) generates realistic person images conditioned on a target pose and a source image. This task plays a key role in various real-world applications, such as sign language video generation, AR/VR, gaming, and live streaming. In these scenarios, real-time PGPIS is critical for providing immediate visual feedback and maintaining user this http URL, achieving real-time performance remains a significant challenge due to the complexity of synthesizing high-fidelity images from diverse and dynamic human poses. Recent diffusion-based methods have shown impressive image quality in PGPIS, but their slow sampling speeds hinder deployment in time-sensitive applications. This latency is particularly problematic in tasks like generating sign language videos during live broadcasts, where rapid image updates are required. Therefore, developing a fast and reliable PGPIS model is a crucial step toward enabling real-time interactive systems. To address this challenge, we propose a generative model based on flow matching (FM). Our approach enables faster, more stable, and more efficient training and sampling. Furthermore, the proposed model supports conditional generation and can operate in latent space, making it especially suitable for real-time PGPIS applications where both speed and quality are critical. We evaluate our proposed method, Real-Time Person Image Synthesis Using a Flow Matching Model (RPFM), on the widely used DeepFashion dataset for PGPIS tasks. Our results show that RPFM achieves near-real-time sampling speeds while maintaining performance comparable to the state-of-the-art models. Our methodology trades off a slight, acceptable decrease in generated-image accuracy for over a twofold increase in generation speed, thereby ensuring real-time performance.
zh

[CV-26] Generating Synthetic Data via Augmentations for Improved Facial Resemblance in DreamBooth and InstantID CVPR2025

【速读】:该论文试图解决在使用Stable Diffusion(SD)生成专业肖像时,如何通过增强策略提升生成图像与原始主体的面部相似性问题。解决方案的关键在于评估不同增强策略对生成头像保真度的影响,并引入FaceDistance(一种基于FaceNet的封装工具)来量化和排名生成结果的面部相似性,从而为后续应用提供有效的优化策略。

链接: https://arxiv.org/abs/2505.03557
作者: Koray Ulusan,Benjamin Kiefer
机构: University of Tuebingen (图宾根大学); LOOKOUT
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2025 Workshop “Synthetic Data for Computer Vision Workshop”, this https URL

点击查看摘要

Abstract:The personalization of Stable Diffusion for generating professional portraits from amateur photographs is a burgeoning area, with applications in various downstream contexts. This paper investigates the impact of augmentations on improving facial resemblance when using two prominent personalization techniques: DreamBooth and InstantID. Through a series of experiments with diverse subject datasets, we assessed the effectiveness of various augmentation strategies on the generated headshots’ fidelity to the original subject. We introduce FaceDistance, a wrapper around FaceNet, to rank the generations based on facial similarity, which aided in our assessment. Ultimately, this research provides insights into the role of augmentations in enhancing facial resemblance in SDXL-generated portraits, informing strategies for their effective deployment in downstream applications.
zh

[CV-27] Read My Ears! Horse Ear Movement Detection for Equine Affective State Assessment

【速读】:该论文试图解决马匹情感状态评估中因标注数据稀缺而导致的自动化面部动作单元(Action Unit, AU)检测困难问题。解决方案的关键在于开发一种针对马匹视频中特定耳部AU检测与定位的方法,通过结合基于深度学习的视频特征提取、循环神经网络以及经典的光流方法,实现了87.5%的耳部运动存在的分类准确率,展示了其方法的潜力。

链接: https://arxiv.org/abs/2505.03554
作者: João Alves,Pia Haubro Andersen,Rikke Gade
机构: Aalborg University (奥尔堡大学); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Equine Facial Action Coding System (EquiFACS) enables the systematic annotation of facial movements through distinct Action Units (AUs). It serves as a crucial tool for assessing affective states in horses by identifying subtle facial expressions associated with discomfort. However, the field of horse affective state assessment is constrained by the scarcity of annotated data, as manually labelling facial AUs is both time-consuming and costly. To address this challenge, automated annotation systems are essential for leveraging existing datasets and improving affective states detection tools. In this work, we study different methods for specific ear AU detection and localization from horse videos. We leverage past works on deep learning-based video feature extraction combined with recurrent neural networks for the video classification task, as well as a classic optical flow based approach. We achieve 87.5% classification accuracy of ear movement presence on a public horse video dataset, demonstrating the potential of our approach. We discuss future directions to develop these systems, with the aim of bridging the gap between automated AU detection and practical applications in equine welfare and veterinary diagnostics. Our code will be made publicly available at this https URL.
zh

[CV-28] Panoramic Out-of-Distribution Segmentation

【速读】:该论文旨在解决全景图像中异常分割(Out-of-distribution Segmentation, OoS)的难题,当前的全景语义分割方法无法有效识别异常点,而针孔相机的OoS模型在全景域中因背景杂乱和像素畸变导致性能不佳。其解决方案的关键在于提出了一种新的任务——全景异常分割(Panoramic Out-of-distribution Segmentation, PanOoS),并引入了首个针对该任务的解决方案POS,该方案通过文本引导的提示分布学习适配全景图像特性,结合解耦策略、基于提示的恢复注意力机制以及双层提示分布学习,提升了模型的跨域泛化能力和分割精度。

链接: https://arxiv.org/abs/2505.03539
作者: Mengfei Duan,Kailun Yang,Yuheng Zhang,Yihong Cao,Fei Teng,Kai Luo,Jiaming Zhang,Zhiyong Li,Shutao Li
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Code and datasets will be available at this https URL

点击查看摘要

Abstract:Panoramic imaging enables capturing 360° images with an ultra-wide Field-of-View (FoV) for dense omnidirectional perception. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out-of-distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to background clutter and pixel distortions. To address these issues, we introduce a new task, Panoramic Out-of-distribution Segmentation (PanOoS), achieving OoS for panoramas. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text-guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross-domain generalization capability of CLIP. The proposed Prompt-based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self-adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per-pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state-of-the-art pinhole-OoS methods. Moreover, POS achieves leading closed-set segmentation capabilities. Code and datasets will be available at this https URL.
zh

[CV-29] RAIL: Region-Aware Instructive Learning for Semi-Supervised Tooth Segmentation in CBCT

【速读】:该论文旨在解决3D牙科锥形束计算机断层扫描(CBCT)图像中牙齿分割任务中因标注数据稀缺而导致的半监督学习问题,具体包括在结构模糊或错误标注区域中监督信号不足以及无标签数据中不可靠伪标签导致的性能下降。其解决方案的关键在于提出一种名为区域感知指导学习(Region-Aware Instructive Learning, RAIL)的双组双学生半监督框架,通过交替训练两个小组,促进组间知识迁移和协作式区域感知指导,同时减少对单一模型特征的过拟合。RAIL引入了两种指导机制:基于分歧的监督(Disagreement-Focused Supervision, DFS)控制器,在监督学习阶段仅对学生输出与真实标签及最优学生存在差异的区域进行指导,以集中监督资源;置信度感知学习(Confidence-Aware Learning, CAL)调制器在无监督阶段增强高置信度区域的一致性,降低低置信度预测的影响,从而提升伪标签的整体可靠性。

链接: https://arxiv.org/abs/2505.03538
作者: Chuyu Zhao,Hao Huang,Jiashuo Guo,Ziyu Shen,Zhongwei Zhou,Jie Liu,Zekuan Yu
机构: School of Computer Science & Technology, Beijing Jiaotong University(北京交通大学); Academy for Engineering and Technology, Fudan University(复旦大学); Department of Oral and Maxillofacial Surgery, General Hospital of Ningxia Medical University(宁夏医科大学总医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning has become a compelling approach for 3D tooth segmentation from CBCT scans, where labeled data is minimal. However, existing methods still face two persistent challenges: limited corrective supervision in structurally ambiguous or mislabeled regions during supervised training and performance degradation caused by unreliable pseudo-labels on unlabeled data. To address these problems, we propose Region-Aware Instructive Learning (RAIL), a dual-group dual-student, semi-supervised framework. Each group contains two student models guided by a shared teacher network. By alternating training between the two groups, RAIL promotes intergroup knowledge transfer and collaborative region-aware instruction while reducing overfitting to the characteristics of any single model. Specifically, RAIL introduces two instructive mechanisms. Disagreement-Focused Supervision (DFS) Controller improves supervised learning by instructing predictions only within areas where student outputs diverge from both ground truth and the best student, thereby concentrating supervision on structurally ambiguous or mislabeled areas. In the unsupervised phase, Confidence-Aware Learning (CAL) Modulator reinforces agreement in regions with high model certainty while reducing the effect of low-confidence predictions during training. This helps prevent our model from learning unstable patterns and improves the overall reliability of pseudo-labels. Extensive experiments on four CBCT tooth segmentation datasets show that RAIL surpasses state-of-the-art methods under limited annotation. Our code will be available at this https URL.
zh

[CV-30] Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

【速读】:该论文旨在解决车辆间通信(V2V)信道失真对协同感知性能的影响问题,特别是现有研究在不同失真水平下的泛化能力不足。其解决方案的关键在于提出一种联合加权与去噪框架Coop-WD,通过分层使用自监督对比模型和条件扩散概率模型,实现车辆级和像素级特征增强;同时引入高效变体模型Coop-WD-eco,通过选择性关闭去噪操作以降低计算开销,在严重失真条件下实现高达50%的计算成本减少,同时保持与信道条件改善后的精度相当。

链接: https://arxiv.org/abs/2505.03528
作者: Chenguang Liu,Jianjun Chen,Yunfei Chen,Yubei He,Zhuangkun Wei,Hongjian Sun,Haiyan Lu,Qi Hao
机构: Durham University (杜伦大学); University of Technology, Sydney (悉尼科技大学); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.
zh

[CV-31] Optimization of Module Transferability in Single Image Super-Resolution: Universality Assessment and Cycle Residual Blocks

【速读】:该论文试图解决深度学习在单图像超分辨率(Single Image Super-Resolution, SISR)任务中,现有研究主要关注性能提升而忽视了模型组件的可迁移性问题。其解决方案的关键在于引入“通用性”(Universality)概念及其相关定义,扩展了传统“泛化能力”的范畴,以衡量模块的可迁移性,并提出Universality Assessment Equation(UAE)作为量化模块跨模型移植难易程度的指标。基于UAE结果,设计了两种优化模块:Cycle Residual Block(CRB)和Depth-Wise Cycle Residual Block(DCRB),实验表明这些模块在多个基准数据集和应用场景中均表现出优越性能。

链接: https://arxiv.org/abs/2505.03522
作者: Haotong Cheng,Zhiqi Zhang,Hao Li,Xinshang Zhang
机构: Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has substantially advanced the Single Image Super-Resolution (SISR). However, existing researches have predominantly focused on raw performance gains, with little attention paid to quantifying the transferability of architectural components. In this paper, we introduce the concept of “Universality” and its associated definitions which extend the traditional notion of “Generalization” to encompass the modules’ ease of transferability, thus revealing the relationships between module universality and model generalizability. Then we propose the Universality Assessment Equation (UAE), a metric for quantifying how readily a given module could be transplanted across models. Guided by the UAE results of standard residual blocks and other plug-and-play modules, we further design two optimized modules, Cycle Residual Block (CRB) and Depth-Wise Cycle Residual Block (DCRB). Through comprehensive experiments on natural-scene benchmarks, remote-sensing datasets, extreme-industrial imagery and on-device deployments, we demonstrate that networks embedded with the proposed plug-and-play modules outperform several state-of-the-arts, reaching a PSNR enhancement of up to 0.83dB or enabling a 71.3% reduction in parameters with negligible loss in reconstruction fidelity.
zh

[CV-32] From Neurons to Computation: Biological Reservoir Computing for Pattern Recognition

【速读】:该论文试图解决传统人工神经网络在模式识别任务中计算资源消耗大、效率有限的问题,提出了一种基于生物神经网络的新型储层计算(reservoir computing, RC)范式。解决方案的关键在于利用培养的生物神经元作为储层基质,构建生物储层计算(biological reservoir computing, BRC)系统,通过多电极阵列(MEA)记录神经元活动,将输入数据映射到高维生物特征空间,从而实现高效的模式识别任务。

链接: https://arxiv.org/abs/2505.03510
作者: Ludovico Iannello,Luca Ciampi,Gabriele Lagani,Fabrizio Tonelli,Eleonora Crocco,Lucio Maria Calcagnile,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel paradigm for reservoir computing (RC) that leverages a pool of cultured biological neurons as the reservoir substrate, creating a biological reservoir computing (BRC). This system operates similarly to an echo state network (ESN), with the key distinction that the neural activity is generated by a network of cultured neurons, rather than being modeled by traditional artificial computational units. The neuronal activity is recorded using a multi-electrode array (MEA), which enables high-throughput recording of neural signals. In our approach, inputs are introduced into the network through a subset of the MEA electrodes, while the remaining electrodes capture the resulting neural activity. This generates a nonlinear mapping of the input data to a high-dimensional biological feature space, where distinguishing between data becomes more efficient and straightforward, allowing a simple linear classifier to perform pattern recognition tasks effectively. To evaluate the performance of our proposed system, we present an experimental study that includes various input patterns, such as positional codes, bars with different orientations, and a digit recognition task. The results demonstrate the feasibility of using biological neural networks to perform tasks traditionally handled by artificial neural networks, paving the way for further exploration of biologically-inspired computing systems, with potential applications in neuromorphic engineering and bio-hybrid computing.
zh

[CV-33] Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking IJCAI2025

【速读】:该论文旨在解决自监督RGB-T跟踪中由于错误伪标签导致的目标区域遗漏以及背景噪声引入所影响的模态融合效率问题,同时应对由相似目标噪声引发的伪标签噪声对跟踪性能的影响。其解决方案的关键在于提出GDSTrack,该方法引入了动态图融合与时间扩散机制,通过构建邻近帧的动态图结构并利用生成模型的去噪能力,有效聚焦于目标的连贯区域,并提升对相似目标噪声的鲁棒性。

链接: https://arxiv.org/abs/2505.03507
作者: Shenglan Li,Rui Yao,Yong Zhou,Hancheng Zhu,Kunyang Sun,Bing Liu,Zhiwen Shao,Jiaqi Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object’s coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at this https URL.
zh

[CV-34] MRI motion correction via efficient residual-guided denoising diffusion probabilistic models

【速读】:该论文旨在解决磁共振成像(MRI)中运动伪影导致的图像质量下降和定量分析受阻问题。传统方法如重复采集或运动跟踪虽然有效,但成本高且流程复杂。该研究提出的Res-MoCoDiff是一种针对MRI运动伪影校正的高效去噪扩散概率模型,其关键在于在前向扩散过程中引入了一种新颖的残差误差转移机制,使噪声分布与运动伪影数据对齐,并实现了高效的四步反向扩散过程。此外,采用增强型U-Net架构结合Swin-Transformer块,提升了模型在不同分辨率下的适应性。

链接: https://arxiv.org/abs/2505.03498
作者: Mojtaba Safari,Shansong Wang,Qiang Li,Zach Eidex,Richard L.J. Qiu,Chih-Wei Chang,Hui Mao,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Purpose: Motion artifacts in magnetic resonance imaging (MRI) significantly degrade image quality and impair quantitative analysis. Conventional mitigation strategies, such as repeated acquisitions or motion tracking, are costly and workflow-intensive. This study introduces Res-MoCoDiff, an efficient denoising diffusion probabilistic model tailored for MRI motion artifact correction. Methods: Res-MoCoDiff incorporates a novel residual error shifting mechanism in the forward diffusion process, aligning the noise distribution with motion-corrupted data and enabling an efficient four-step reverse diffusion. A U-net backbone enhanced with Swin-Transformer blocks conventional attention layers, improving adaptability across resolutions. Training employs a combined l1+l2 loss, which promotes image sharpness and reduces pixel-level errors. Res-MoCoDiff was evaluated on synthetic dataset generated using a realistic motion simulation framework and on an in-vivo dataset. Comparative analyses were conducted against established methods, including CycleGAN, Pix2pix, and MT-DDPM using quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and normalized mean squared error (NMSE). Results: The proposed method demonstrated superior performance in removing motion artifacts across all motion severity levels. Res-MoCoDiff consistently achieved the highest SSIM and the lowest NMSE values, with a PSNR of up to 41.91±2.94 dB for minor distortions. Notably, the average sampling time was reduced to 0.37 seconds per batch of two image slices, compared with 101.74 seconds for conventional approaches.
zh

[CV-35] UPMAD-Net: A Brain Tumor Segmentation Network with Uncertainty Guidance and Adaptive Multimodal Feature Fusion

【速读】:该论文旨在解决脑肿瘤分割中的挑战性问题,包括肿瘤的不规则形状、模糊边界以及高变异性带来的准确分割难题。其解决方案的关键在于将深度学习与基于区域生长算法的先验知识相结合,通过引入多尺度特征融合(MSFF)模块和自适应注意力机制(AAM)来提取多尺度特征并捕捉全局上下文信息,同时采用蒙特卡洛Dropout(MC Dropout)策略进行不确定性估计,以提升模型在低置信度区域的鲁棒性。

链接: https://arxiv.org/abs/2505.03494
作者: Zhanyuan Jia,Ni Yao,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Fubao Zhu,Chen Zhao,Weihua Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Background: Brain tumor segmentation has a significant impact on the diagnosis and treatment of brain tumors. Accurate brain tumor segmentation remains challenging due to their irregular shapes, vague boundaries, and high variability. Objective: We propose a brain tumor segmentation method that combines deep learning with prior knowledge derived from a region-growing algorithm. Methods: The proposed method utilizes a multi-scale feature fusion (MSFF) module and adaptive attention mechanisms (AAM) to extract multi-scale features and capture global contextual information. To enhance the model’s robustness in low-confidence regions, the Monte Carlo Dropout (MC Dropout) strategy is employed for uncertainty estimation. Results: Extensive experiments demonstrate that the proposed method achieves superior performance on Brain Tumor Segmentation (BraTS) datasets, significantly outperforming various state-of-the-art methods. On the BraTS2021 dataset, the test Dice scores are 89.18% for Enhancing Tumor (ET) segmentation, 93.67% for Whole Tumor (WT) segmentation, and 91.23% for Tumor Core (TC) segmentation. On the BraTS2019 validation set, the validation Dice scores are 87.43%, 90.92%, and 90.40% for ET, WT, and TC segmentation, respectively. Ablation studies further confirmed the contribution of each module to segmentation accuracy, indicating that each component played a vital role in overall performance improvement. Conclusion: This study proposed a novel 3D brain tumor segmentation network based on the U-Net architecture. By incorporating the prior knowledge and employing the uncertainty estimation method, the robustness and performance were improved. The code for the proposed method is available at this https URL.
zh

[CV-36] Blending 3D Geometry and Machine Learning for Multi-View Stereopsis

【速读】:该论文旨在解决传统多视角立体(MVS)方法依赖于光度和几何一致性约束,而现代基于学习的算法在学习过程中缺乏对几何一致性的直接监督的问题。其解决方案的关键在于提出GC MVSNet plus plus,该方法在学习阶段主动在多个源视图和不同尺度上强制参考视图深度图的几何一致性(GC),并通过直接惩罚几何不一致的像素点显著加速了学习过程。此外,引入了一个具有两种不同块设计的密集连接成本正则化网络,以增强正则化效果。

链接: https://arxiv.org/abs/2505.03470
作者: Vibhas Vats,Md. Alimoor Reza,David Crandall,Soon-heung Jung
机构: Indiana University, Bloomington, IN 47408, USA (印第安纳大学,布卢明顿,印第安纳州 47408,美国); Drake University, Des Moines, IA 50311, USA (德雷克大学,德梅因,爱荷华州 50311,美国); Electronics and Telecommunications Research Institute, Daejeon 34129, South Korea (电子与电信研究院,大田 34129,韩国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Machine Learning (cs.LG)
备注: A pre-print – paper under-review. arXiv admin note: substantial text overlap with arXiv:2310.19583

点击查看摘要

Abstract:Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.
zh

[CV-37] Nonperiodic dynamic CT reconstruction using backward-warping INR with regularization of diffeomorphism (BIRD)

【速读】:该论文旨在解决非周期性动态CT重建中的运动伪影问题,特别是在快速心脏运动等非周期性快速运动场景下的极端有限角度问题。传统方法和深度学习方法在泛化能力和细节保留方面存在局限,而隐式神经表示(Implicit Neural Representation, INR)技术虽具潜力,但面临计算效率低、变形场(DVF)复杂度与解剖合理性难以平衡以及缺乏额外患者特异性扫描时细节保持困难等问题。该论文提出的BIRD框架通过四个关键贡献解决上述问题:基于反向变形的直接动态体素计算以降低计算成本,基于微分同胚的DVF正则化确保解剖合理性,运动补偿的解析重建提升细节而不依赖额外扫描,以及用于高效4D坐标编码的维度缩减设计。

链接: https://arxiv.org/abs/2505.03463
作者: Muge Du,Zhuozhao Zheng,Wenying Wang,Guotao Quan,Wuliang Shi,Le Shen,Li Zhang,Liang Li,Yinong Liu,Yuxiang Xing
机构: Tsinghua University (清华大学); Institute of Precision Medicine, Tsinghua University (清华大学精准医学研究所); Department of Radiology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University (北京清华长庚医院放射科,临床医学院,清华医学,清华大学); United Imaging Healthcare (联影医疗); Shanghai United Imaging Healthcare Co., Ltd. (上海联影医疗有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Dynamic computed tomography (CT) reconstruction faces significant challenges in addressing motion artifacts, particularly for nonperiodic rapid movements such as cardiac imaging with fast heart rates. Traditional methods struggle with the extreme limited-angle problems inherent in nonperiodic cases. Deep learning methods have improved performance but face generalization challenges. Recent implicit neural representation (INR) techniques show promise through self-supervised deep learning, but have critical limitations: computational inefficiency due to forward-warping modeling, difficulty balancing DVF complexity with anatomical plausibility, and challenges in preserving fine details without additional patient-specific pre-scans. This paper presents a novel INR-based framework, BIRD, for nonperiodic dynamic CT reconstruction. It addresses these challenges through four key contributions: (1) backward-warping deformation that enables direct computation of each dynamic voxel with significantly reduced computational cost, (2) diffeomorphism-based DVF regularization that ensures anatomically plausible deformations while maintaining representational capacity, (3) motion-compensated analytical reconstruction that enhances fine details without requiring additional pre-scans, and (4) dimensional-reduction design for efficient 4D coordinate encoding. Through various simulations and practical studies, including digital and physical phantoms and retrospective patient data, we demonstrate the effectiveness of our approach for nonperiodic dynamic CT reconstruction with enhanced details and reduced motion artifacts. The proposed framework enables more accurate dynamic CT reconstruction with potential clinical applications, such as one-beat cardiac reconstruction, cinematic image sequences for functional imaging, and motion artifact reduction in conventional CT scans.
zh

[CV-38] Polar Coordinate-Based 2D Pose Prior with Neural Distance Field CVPR

【速读】:该论文旨在解决在真实体育场景中,基于RGB视频的深度学习人体姿态估计(HPE)模型因运动模糊、遮挡和不同姿态表示之间的领域偏移而导致性能下降的问题。其解决方案的关键在于提出一种基于神经距离场(NDF)的2D姿态先验引导优化方法,引入基于极坐标的姿态表示以显式包含关节连接长度,从而更准确地修正错误的姿态估计,并定义了一种新型非测地距离度量以更好地适应极坐标表示。此外,通过梯度驱动的批量投影增强策略缓解数据稀缺问题,提升了模型在多种姿态表示下的泛化能力。

链接: https://arxiv.org/abs/2505.03445
作者: Qi Gan,Sao Mai Nguyen,Eric Fenaux,Stephan Clémençon,Mounîm El Yacoubi
机构: LTCI, Télécom Paris, Institut Polytechnique de Paris, France; U2IS, ENSTA Paris, Institut Polytechnique de Paris, France; Ef-e-science, France; SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPRW 2025

点击查看摘要

Abstract:Human pose capture is essential for sports analysis, enabling precise evaluation of athletes’ movements. While deep learning-based human pose estimation (HPE) models from RGB videos have achieved impressive performance on public datasets, their effectiveness in real-world sports scenarios is often hindered by motion blur, occlusions, and domain shifts across different pose representations. Fine-tuning these models can partially alleviate such challenges but typically requires large-scale annotated data and still struggles to generalize across diverse sports environments. To address these limitations, we propose a 2D pose prior-guided refinement approach based on Neural Distance Fields (NDF). Unlike existing approaches that rely solely on angular representations of human poses, we introduce a polar coordinate-based representation that explicitly incorporates joint connection lengths, enabling a more accurate correction of erroneous pose estimations. Additionally, we define a novel non-geodesic distance metric that separates angular and radial discrepancies, which we demonstrate is better suited for polar representations than traditional geodesic distances. To mitigate data scarcity, we develop a gradient-based batch-projection augmentation strategy, which synthesizes realistic pose samples through iterative refinement. Our method is evaluated on a long jump dataset, demonstrating its ability to improve 2D pose estimation across multiple pose representations, making it robust across different domains. Experimental results show that our approach enhances pose plausibility while requiring only limited training data. Code is available at: this https URL.
zh

[CV-39] Robustness in AI-Generated Detection: Enhancing Resistance to Adversarial Attacks

【速读】:该论文旨在解决当前生成式AI(Generative AI)生成人脸检测系统在面对对抗攻击时鲁棒性不足的问题。研究发现,尽管现有检测方法在标准条件下表现出较高的准确性,但其对对抗样本的抵抗能力有限。解决方案的关键在于引入对抗训练以减轻对抗样本的影响,并结合扩散反演与重建技术进一步提升检测系统的鲁棒性。实验结果表明,该方法显著增强了检测系统对对抗扰动的抵抗力。

链接: https://arxiv.org/abs/2505.03435
作者: Sun Haoxuan,Hong Yan,Zhan Jiahui,Chen Haoxing,Lan Jun,Zhu Huijia,Wang Weiqiang,Zhang Liqing,Zhang Jianfu
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative image technology has introduced significant security concerns, particularly in the domain of face generation detection. This paper investigates the vulnerabilities of current AI-generated face detection systems. Our study reveals that while existing detection methods often achieve high accuracy under standard conditions, they exhibit limited robustness against adversarial attacks. To address these challenges, we propose an approach that integrates adversarial training to mitigate the impact of adversarial examples. Furthermore, we utilize diffusion inversion and reconstruction to further enhance detection robustness. Experimental results demonstrate that minor adversarial perturbations can easily bypass existing detection systems, but our method significantly improves the robustness of these systems. Additionally, we provide an in-depth analysis of adversarial and benign examples, offering insights into the intrinsic characteristics of AI-generated content. All associated code will be made publicly available in a dedicated repository to facilitate further research and verification.
zh

[CV-40] A Fusion-Guided Inception Network for Hyperspectral Image Super-Resolution

【速读】:该论文旨在解决低空间分辨率高光谱图像(HSI)与高空间分辨率全色或RGB图像融合过程中对精确配准的依赖问题,这一配准在实际应用中往往难以实现。其解决方案的关键在于提出一种名为Fusion-Guided Inception Network (FGIN) 的单图像超分辨率模型,该模型通过早期的光谱-空间融合模块、类似Inception的多尺度特征提取策略以及优化的上采样模块,有效整合光谱与空间信息并提升重建质量。

链接: https://arxiv.org/abs/2505.03431
作者: Usman Muhammad,Jorma Laaksonen
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The fusion of low-spatial-resolution hyperspectral images (HSIs) with high-spatial-resolution conventional images (e.g., panchromatic or RGB) has played a significant role in recent advancements in HSI super-resolution. However, this fusion process relies on the availability of precise alignment between image pairs, which is often challenging in real-world scenarios. To mitigate this limitation, we propose a single-image super-resolution model called the Fusion-Guided Inception Network (FGIN). Specifically, we first employ a spectral-spatial fusion module to effectively integrate spectral and spatial information at an early stage. Next, an Inception-like hierarchical feature extraction strategy is used to capture multiscale spatial dependencies, followed by a dedicated multi-scale fusion block. To further enhance reconstruction quality, we incorporate an optimized upsampling module that combines bilinear interpolation with depthwise separable convolutions. Experimental evaluations on two publicly available hyperspectral datasets demonstrate the competitive performance of our method.
zh

[CV-41] Phenotype-Guided Generative Model for High-Fidelity Cardiac MRI Synthesis: Advancing Pretraining and Clinical Applications

【速读】:该论文旨在解决心脏磁共振(Cardiac Magnetic Resonance, CMR)成像领域中由于缺乏大规模、高质量标注数据而导致的人工智能(Artificial Intelligence, AI)模型性能受限的问题。其关键解决方案是提出了一种名为心脏表型引导的CMR生成方法(Cardiac Phenotype-Guided CMR Generation, CPGG),该方法通过两个阶段生成覆盖广泛心脏健康状态的多样化CMR数据:第一阶段利用从CMR数据中提取的心脏表型训练生成模型;第二阶段则基于这些表型,采用掩码自回归扩散模型生成高保真度的CMR电影序列,从而有效扩展预训练数据并提升下游任务(如诊断和心脏表型预测)的性能。

链接: https://arxiv.org/abs/2505.03426
作者: Ziyu Li,Yujian Hu,Zhengyao Ding,Yiheng Mao,Haitao Li,Fan Yi,Hongkun Zhang,Zhengxing Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cardiac Magnetic Resonance (CMR) imaging is a vital non-invasive tool for diagnosing heart diseases and evaluating cardiac health. However, the limited availability of large-scale, high-quality CMR datasets poses a major challenge to the effective application of artificial intelligence (AI) in this domain. Even the amount of unlabeled data and the health status it covers are difficult to meet the needs of model pretraining, which hinders the performance of AI models on downstream tasks. In this study, we present Cardiac Phenotype-Guided CMR Generation (CPGG), a novel approach for generating diverse CMR data that covers a wide spectrum of cardiac health status. The CPGG framework consists of two stages: in the first stage, a generative model is trained using cardiac phenotypes derived from CMR data; in the second stage, a masked autoregressive diffusion model, conditioned on these phenotypes, generates high-fidelity CMR cine sequences that capture both structural and functional features of the heart in a fine-grained manner. We synthesized a massive amount of CMR to expand the pretraining data. Experimental results show that CPGG generates high-quality synthetic CMR data, significantly improving performance on various downstream tasks, including diagnosis and cardiac phenotypes prediction. These gains are demonstrated across both public and private datasets, highlighting the effectiveness of our approach. Code is availabel at this https URL.
zh

[CV-42] LiftFeat: 3D Geometry-Aware Local Feature Matching ICRA2025

【速读】:该论文旨在解决在极端光照变化、低纹理区域或重复模式等复杂场景下,提取鲁棒且具有区分性的视觉特征的挑战。其解决方案的关键在于提出一种轻量级网络LiftFeat,通过聚合三维几何特征来提升原始描述符的鲁棒性。具体而言,首先利用预训练的单目深度估计模型生成伪表面法线标签,以监督三维几何特征的提取;随后设计了一个三维几何感知的特征提升模块,将表面法线特征与原始二维描述符特征进行融合,从而增强在极端条件下的二维特征描述能力。

链接: https://arxiv.org/abs/2505.03422
作者: Yepeng Liu,Wenpeng Lai,Zhou Zhao,Yuxuan Xiong,Jinchi Zhu,Jun Cheng,Yongchao Xu
机构: Wuhan University (武汉大学); SF Technology (深圳深信服科技有限公司); Central China Normal University (华中师范大学); Institute for Infocomm Research, A*STAR (新加坡科技研究局信息通信研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at ICRA 2025

点击查看摘要

Abstract:Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textitLiftFeat, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : this https URL.
zh

[CV-43] Mitigating Image Captioning Hallucinations in Vision-Language Models

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中的幻觉问题,该问题通常源于预训练数据与测试样本之间的分布偏移,从而影响模型的可靠性和实际应用。其解决方案的关键在于提出一种基于强化学习的测试时自适应框架,在不进行重新训练或引入辅助VLM的情况下,通过仅更新语言模型中层归一化部分的可学习参数(约占模型参数的0.003%),减少测试样本与预训练样本之间的分布偏移。同时,采用基于CLIP的幻觉评估模型为VLM提供双重奖励,以有效降低幻觉率。

链接: https://arxiv.org/abs/2505.03420
作者: Fei Zhao,Chengcui Zhang,Runlin Zhang,Tianyang Wang,Xi Li
机构: The University of Alabama at Birmingham(阿拉巴马大学伯明翰分校); University of Waterloo(滑铁卢大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hallucinations in vision-language models (VLMs) hinder reliability and real-world applicability, usually stemming from distribution shifts between pretraining data and test samples. Existing solutions, such as retraining or fine-tuning on additional data, demand significant computational resources and labor-intensive data collection, while ensemble-based methods incur additional costs by introducing auxiliary VLMs. To address these challenges, we propose a novel test-time adaptation framework using reinforcement learning to mitigate hallucinations during inference without retraining or any auxiliary VLMs. By updating only the learnable parameters in the layer normalization of the language model (approximately 0.003% of the model parameters), our method reduces distribution shifts between test samples and pretraining samples. A CLIP-based hallucination evaluation model is proposed to provide dual rewards to VLMs. Experimental results demonstrate a 15.4% and 17.3% reduction in hallucination rates on LLaVA and InstructBLIP, respectively. Our approach outperforms state-of-the-art baselines with a 68.3% improvement in hallucination mitigation, demonstrating its effectiveness.
zh

[CV-44] CXR-AD: Component X-ray Image Dataset for Industrial Anomaly Detection

【速读】:该论文旨在解决工业组件内部缺陷检测的问题,当前异常检测数据集主要关注可见光图像中的表面缺陷,缺乏针对组件内部缺陷的公开X射线数据集。解决方案的关键在于构建首个公开可用的组件X射线异常检测(CXR-AD)数据集,该数据集包含真实世界的X射线图像,并涵盖五类工业组件,提供精确的像素级掩码标注,以推动内部缺陷检测算法的发展。

链接: https://arxiv.org/abs/2505.03412
作者: Haoyu Bai,Jie Wang,Gaomin Li,Xuan Li,Xiaohu Zhang,Xia Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Internal defect detection constitutes a critical process in ensuring component quality, for which anomaly detection serves as an effective solution. However, existing anomaly detection datasets predominantly focus on surface defects in visible-light images, lacking publicly available X-ray datasets targeting internal defects in components. To address this gap, we construct the first publicly accessible component X-ray anomaly detection (CXR-AD) dataset, comprising real-world X-ray images. The dataset covers five industrial component categories, including 653 normal samples and 561 defect samples with precise pixel-level mask annotations. We systematically analyze the dataset characteristics and identify three major technical challenges: (1) strong coupling between complex internal structures and defect regions, (2) inherent low contrast and high noise interference in X-ray imaging, and (3) significant variations in defect scales and morphologies. To evaluate dataset complexity, we benchmark three state-of-the-art anomaly detection frameworks (feature-based, reconstruction-based, and zero-shot learning methods). Experimental results demonstrate a 29.78% average performance degradation on CXR-AD compared to MVTec AD, highlighting the limitations of current algorithms in handling internal defect detection tasks. To the best of our knowledge, CXR-AD represents the first publicly available X-ray dataset for component anomaly detection, providing a real-world industrial benchmark to advance algorithm development and enhance precision in internal defect inspection technologies.
zh

[CV-45] DDaTR: Dynamic Difference-aware Temporal Residual Network for Longitudinal Radiology Report Generation

【速读】:该论文旨在解决纵向放射学报告生成(LRRG)中特征提取过程难以有效捕捉空间和时间相关性的问题,导致跨检查的信息差异未能被充分建模,从而影响了报告生成的准确性。其解决方案的关键在于提出一种动态差异感知的时间残差网络(DDaTR),通过在视觉编码器的每个阶段引入两个模块:动态特征对齐模块(DFAM)用于跨模态对齐先验特征以保持临床信息的完整性,以及动态差异感知模块(DDAM)通过识别检查间的关联来捕获有利的差异信息;同时,采用动态残差网络单向传递纵向信息,以有效建模时间相关性。

链接: https://arxiv.org/abs/2505.03401
作者: Shanshan Song,Hui Tang,Honglong Yang,Xiaomeng Li
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Radiology Report Generation (RRG) automates the creation of radiology reports from medical imaging, enhancing the efficiency of the reporting process. Longitudinal Radiology Report Generation (LRRG) extends RRG by incorporating the ability to compare current and prior exams, facilitating the tracking of temporal changes in clinical findings. Existing LRRG approaches only extract features from prior and current images using a visual pre-trained encoder, which are then concatenated to generate the final report. However, these methods struggle to effectively capture both spatial and temporal correlations during the feature extraction process. Consequently, the extracted features inadequately capture the information of difference across exams and thus underrepresent the expected progressions, leading to sub-optimal performance in LRRG. To address this, we develop a novel dynamic difference-aware temporal residual network (DDaTR). In DDaTR, we introduce two modules at each stage of the visual encoder to capture multi-level spatial correlations. The Dynamic Feature Alignment Module (DFAM) is designed to align prior features across modalities for the integrity of prior clinical information. Prompted by the enriched prior features, the dynamic difference-aware module (DDAM) captures favorable difference information by identifying relationships across exams. Furthermore, our DDaTR employs the dynamic residual network to unidirectionally transmit longitudinal information, effectively modelling temporal correlations. Extensive experiments demonstrated superior performance over existing methods on three benchmarks, proving its efficacy in both RRG and LRRG tasks.
zh

[CV-46] EOPose : Exemplar-based object reposing using Generalized Pose Correspondences CVPR2025

【速读】:该论文旨在解决图像中物体重定位(reposing)的问题,特别是在电子商务领域中需要快速生成多种产品图像的场景。其解决方案的关键在于利用不同同类物体图像之间的无监督关键点对应检测技术,提出了一种端到端的通用物体重定位框架EOPose。该方法通过目标姿态引导图像与源物体图像的关键点对应关系,采用三步新颖的方法对后者进行变形和重新渲染,从而在保持物体细粒度细节(如颜色、纹理和品牌标识)的同时实现高质量的重定位输出。

链接: https://arxiv.org/abs/2505.03394
作者: Sarthak Mehrotra,Rishabh Jain,Mayur Hemani,Balaji Krishnamurthy,Mausoom Sarkar
机构: Indian Institute of Technology, Bombay (印度理工学院, 孟买); MDSR Lab, Adobe (Adobe多媒体与智能感知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVPR 2025 AI4CC workshop

点击查看摘要

Abstract:Reposing objects in images has a myriad of applications, especially for e-commerce where several variants of product images need to be produced quickly. In this work, we leverage the recent advances in unsupervised keypoint correspondence detection between different object images of the same class to propose an end-to-end framework for generic object reposing. Our method, EOPose, takes a target pose-guidance image as input and uses its keypoint correspondence with the source object image to warp and re-render the latter into the target pose using a novel three-step approach. Unlike generative approaches, our method also preserves the fine-grained details of the object such as its exact colors, textures, and brand marks. We also prepare a new dataset of paired objects based on the Objaverse dataset to train and test our network. EOPose produces high-quality reposing output as evidenced by different image quality metrics (PSNR, SSIM and FID). Besides a description of the method and the dataset, the paper also includes detailed ablation and user studies to indicate the efficacy of the proposed method
zh

[CV-47] Attention-aggregated Attack for Boosting the Transferability of Facial Adversarial Examples

【速读】:该论文旨在解决在细粒度视觉任务(如人脸识别)中,针对特定类别深度模型的对抗样本迁移性不足的问题。现有方法未充分考虑人脸识别(Face Recognition, FR)模型对特定面部特征的依赖性,导致攻击效果不理想。该工作的关键在于提出一种名为注意力聚合攻击(Attention-aggregated Attack, AAA)的新方法,其核心思想是通过模拟其他FR模型对干净人脸图像的注意力机制,破坏对其他模型决策至关重要的面部特征,从而提升对抗样本的迁移能力。

链接: https://arxiv.org/abs/2505.03383
作者: Jian-Wei Li,Wen-Ze Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-specific deep models for fine-grained vision tasks, such as face recognition (FR), giving rise to unsatisfactory attacking performance. In this work, we first investigate what in a face exactly contributes to the embedding learning of FR models and find that both decisive and auxiliary facial features are specific to each FR model, which is quite different from the biological mechanism of human visual system. Accordingly we then propose a novel attack method named Attention-aggregated Attack (AAA) to enhance the transferability of adversarial examples against FR, which is inspired by the attention divergence and aims to destroy the facial features that are critical for the decision-making of other FR models by imitating their attentions on the clean face images. Extensive experiments conducted on various FR models validate the superiority and robust effectiveness of the proposed method over existing methods.
zh

[CV-48] Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant

【速读】:该论文旨在解决医疗人工智能助手在临床应用中面临的挑战,包括对多模态内容的有限准确性以及在真实场景中的验证不足。其解决方案的关键在于提出RCMed,一个全栈式AI助手,通过改进输入和输出中的多模态对齐,实现精确的解剖分割、准确的定位和可靠的诊断。RCMed的核心创新是自增强相关机制,使视觉特征能够影响语言上下文,同时语言语义引导像素级注意力,形成闭环以优化两种模态。此外,颜色区域描述策略进一步增强了这种相关性,将解剖结构转化为语义丰富的文本,从而学习跨尺度的形状-位置-文本关系。

链接: https://arxiv.org/abs/2505.03380
作者: Haonan Wang,Jiaji Mao,Lehan Wang,Qixiang Zhang,Marawan Elbatel,Yi Qin,Huijun Hu,Baoxun Li,Wenhui Deng,Weifeng Qin,Hongrui Li,Jialin Liang,Jun Shen,Xiaomeng Li
机构: HKUST (香港科技大学); Sun Yat-Sen Memorial Hospital (中山大学附属第一医院); Sun Yat-Sen University (中山大学); Guangdong Provincial Key Laboratory of Malignant Tumor Epigenetics and Gene Regulation (广东省恶性肿瘤表观遗传与基因调控重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise anatomical delineation, accurate localization, and reliable diagnosis through hierarchical vision-language grounding. A self-reinforcing correlation mechanism allows visual features to inform language context, while language semantics guide pixel-wise attention, forming a closed loop that refines both modalities. This correlation is enhanced by a color region description strategy, translating anatomical structures into semantically rich text to learn shape-location-text relationships across scales. Trained on 20 million image-mask-description triplets, RCMed achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries, excelling in 165 clinical tasks across 9 modalities. It achieved a 23.5% relative improvement in cell segmentation from microscopy images over prior methods. RCMed’s strong vision-language alignment enables exceptional generalization, with state-of-the-art performance in external validation across 20 clinically significant cancer types, including novel tasks. This work demonstrates how integrated multimodal models capture fine-grained patterns, enabling human-level interpretation in complex scenarios and advancing human-centric AI healthcare.
zh

[CV-49] Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models

【速读】:该论文旨在解决如何高效标注可穿戴设备在自由生活场景中采集的图像数据以用于物理活动行为分析的问题,尤其是在减少人工标注负担方面。其解决方案的关键在于利用开源的视觉语言模型(VLM)和微调的判别模型(DM),通过自动识别图像中的久坐行为,从而实现对日常活动中最普遍行为的自动化标注,尤其在相似人群的数据集中表现出较好的性能。

链接: https://arxiv.org/abs/2505.03374
作者: Abram Schonfeldt,Benjamin Maylor,Xiaofang Chen,Ronald Clark,Aiden Doherty
机构: University of Oxford (牛津大学); Chengdu Medical College (成都医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Introduction: Data from wearable devices collected in free-living settings, and labelled with physical activity behaviours compatible with health research, are essential for both validating existing wearable-based measurement approaches and developing novel machine learning approaches. One common way of obtaining these labels relies on laborious annotation of sequences of images captured by cameras worn by participants through the course of a day. Methods: We compare the performance of three vision language models and two discriminative models on two free-living validation studies with 161 and 111 participants, collected in Oxfordshire, United Kingdom and Sichuan, China, respectively, using the Autographer (OMG Life, defunct) wearable camera. Results: We found that the best open-source vision-language model (VLM) and fine-tuned discriminative model (DM) achieved comparable performance when predicting sedentary behaviour from single images on unseen participants in the Oxfordshire study; median F1-scores: VLM = 0.89 (0.84, 0.92), DM = 0.91 (0.86, 0.95). Performance declined for light (VLM = 0.60 (0.56,0.67), DM = 0.70 (0.63, 0.79)), and moderate-to-vigorous intensity physical activity (VLM = 0.66 (0.53, 0.85); DM = 0.72 (0.58, 0.84)). When applied to the external Sichuan study, performance fell across all intensity categories, with median Cohen’s kappa-scores falling from 0.54 (0.49, 0.64) to 0.26 (0.15, 0.37) for the VLM, and from 0.67 (0.60, 0.74) to 0.19 (0.10, 0.30) for the DM. Conclusion: Freely available computer vision models could help annotate sedentary behaviour, typically the most prevalent activity of daily living, from wearable camera images within similar populations to seen data, reducing the annotation burden.
zh

[CV-50] 3D Surface Reconstruction with Enhanced High-Frequency Details

【速读】:该论文旨在解决神经隐式三维重建中表面细节不足的问题,现有方法由于随机采样整个图像,难以学习到高频率的表面细节,导致重建结果过于平滑。其解决方案的关键在于利用高频率信息(high-frequency information)来引导表面细节的重建,具体包括通过像素梯度变化获取高频率区域,并据此动态调整光线采样策略,同时设计了一种高频率加权方法,在重建过程中对高频率细节进行约束,从而提升表面重建的质量和细节表现。

链接: https://arxiv.org/abs/2505.03362
作者: Shikun Zhang,Yiqun Wang,Cunjian Chen,Yong Li,Qiuhong Ke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Journal of Visual Communication and Image Representation

点击查看摘要

Abstract:Neural implicit 3D reconstruction can reproduce shapes without 3D supervision, and it learns the 3D scene through volume rendering methods and neural implicit representations. Current neural surface reconstruction methods tend to randomly sample the entire image, making it difficult to learn high-frequency details on the surface, and thus the reconstruction results tend to be too smooth. We designed a method (FreNeuS) based on high-frequency information to solve the problem of insufficient surface detail. Specifically, FreNeuS uses pixel gradient changes to easily acquire high-frequency regions in an image and uses the obtained high-frequency information to guide surface detail reconstruction. High-frequency information is first used to guide the dynamic sampling of rays, applying different sampling strategies according to variations in high-frequency regions. To further enhance the focus on surface details, we have designed a high-frequency weighting method that constrains the representation of high-frequency details during the reconstruction process. Qualitative and quantitative results show that our method can reconstruct fine surface details and obtain better surface reconstruction quality compared to existing methods. In addition, our method is more applicable and can be generalized to any NeuS-based work.
zh

[CV-51] Interpretable Zero-shot Learning with Infinite Class Concepts

【速读】:该论文旨在解决零样本学习(Zero-shot Learning, ZSL)中因依赖人工标注语义或类定义而导致的泛化能力不足以及大语言模型(Large-scale Language Models, LLMs)生成的类语义可能缺乏视觉关联性的问题。其解决方案的关键在于重新定义类语义,强调可迁移性和判别性,并提出一种名为InfZSL的框架,该框架利用LLMs动态生成无限数量的短语级类概念,同时通过基于熵的评分过程与“优良”概念选择机制,有效缓解了LLMs的幻觉问题,从而确保所选概念具有高度的可迁移性和判别性。

链接: https://arxiv.org/abs/2505.03361
作者: Zihan Ye,Shreyank N Gowda,Shiming Chen,Yaochu Jin,Kaizhu Huang,Xiaobo Jin
机构: Xian Jiaotong-Liverpool University (西安交通大学-利物浦大学); University of Nottingham (诺丁汉大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Westlake University (西湖大学); Duke Kunshan University (杜克-昆山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes by aligning images with intermediate class semantics, like human-annotated concepts or class definitions. An emerging alternative leverages Large-scale Language Models (LLMs) to automatically generate class documents. However, these methods often face challenges with transparency in the classification process and may suffer from the notorious hallucination problem in LLMs, resulting in non-visual class semantics. This paper redefines class semantics in ZSL with a focus on transferability and discriminability, introducing a novel framework called Zero-shot Learning with Infinite Class Concepts (InfZSL). Our approach leverages the powerful capabilities of LLMs to dynamically generate an unlimited array of phrase-level class concepts. To address the hallucination challenge, we introduce an entropy-based scoring process that incorporates a ``goodness" concept selection mechanism, ensuring that only the most transferable and discriminative concepts are selected. Our InfZSL framework not only demonstrates significant improvements on three popular benchmark datasets but also generates highly interpretable, image-grounded concepts. Code will be released upon acceptance.
zh

[CV-52] GUAVA: Generalizable Upper Body 3D Gaussian Avatar

【速读】:该论文旨在解决从单张图像中重建高质量、可动画化的3D人体化身(3D human avatar)的问题,特别是实现具有丰富面部和手部动作的表达能力。传统方法通常依赖多视角或单视角视频,并需要针对个体ID进行训练,过程复杂且耗时,同时受限于SMPLX模型的表达能力,难以有效捕捉面部表情。该论文的关键解决方案是引入一种具有更强面部表达能力的可表达人体模型(Expressive Human Model, EHM),并提出GUAVA框架,通过逆向纹理映射和投影采样技术,从单张图像中快速生成上半身3D高斯人体模型,结合神经精修模块提升渲染质量,实现了亚秒级的重建速度和实时动画渲染能力。

链接: https://arxiv.org/abs/2505.03351
作者: Dongbin Zhang,Yunfei Liu,Lijian Lin,Ye Zhu,Yang Li,Minghan Qin,Yu Li,Haoqian Wang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); International Digital Economy Academy (IDEA) (国际数字经济发展研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing a high-quality, animatable 3D human avatar with expressive facial and hand motions from a single image has gained significant attention due to its broad application potential. 3D human avatar reconstruction typically requires multi-view or monocular videos and training on individual IDs, which is both complex and time-consuming. Furthermore, limited by SMPLX’s expressiveness, these methods often focus on body motion but struggle with facial expressions. To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. We leverage inverse texture mapping and projection sampling techniques to infer Ubody (upper-body) Gaussians from a single image. The rendered images are refined through a neural refiner. Experimental results demonstrate that GUAVA significantly outperforms previous methods in rendering quality and offers significant speed improvements, with reconstruction times in the sub-second range (0.1s), and supports real-time animation and rendering.
zh

[CV-53] A Vision-Language Model for Focal Liver Lesion Classification ALT

【速读】:该论文旨在解决肝病学中焦点肝病变(Focal Liver Lesions, FLLs)准确分类的问题,这一问题对于诊断和治疗至关重要。传统监督深度学习模型依赖于大规模标注数据集,而在医学影像领域,这类数据往往有限。为了解决这一问题,论文提出了一种专门针对FLLs分类的Liver-VLM模型,其关键在于利用视觉-语言模型(Vision-Language Models, VLMs)的多模态学习能力,通过将类别信息嵌入文本编码器,并计算图像与文本嵌入之间的余弦相似性,结合交叉熵损失优化模型,从而在有限标注数据下实现有效的图像特征与类别级文本特征对齐。实验结果表明,Liver-VLM在MPCT-FLLs数据集上优于标准CLIP和MedCLIP模型,尤其在使用轻量级ResNet18主干网络时,在数据受限条件下表现更优。

链接: https://arxiv.org/abs/2505.03350
作者: Song Jian,Hu Yuchang,Wang Hui,Chen Yen-Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,4 figures, 4 tables,Innovation in Medicine and Healthcare Proceedings of 13th KES-InMed 2025

点击查看摘要

Abstract:Accurate classification of focal liver lesions is crucial for diagnosis and treatment in hepatology. However, traditional supervised deep learning models depend on large-scale annotated datasets, which are often limited in medical imaging. Recently, Vision-Language models (VLMs) such as Contrastive Language-Image Pre-training model (CLIP) has been applied to image classifications. Compared to the conventional convolutional neural network (CNN), which classifiers image based on visual information only, VLM leverages multimodal learning with text and images, allowing it to learn effectively even with a limited amount of labeled data. Inspired by CLIP, we pro-pose a Liver-VLM, a model specifically designed for focal liver lesions (FLLs) classification. First, Liver-VLM incorporates class information into the text encoder without introducing additional inference overhead. Second, by calculating the pairwise cosine similarities between image and text embeddings and optimizing the model with a cross-entropy loss, Liver-VLM ef-fectively aligns image features with class-level text features. Experimental results on MPCT-FLLs dataset demonstrate that the Liver-VLM model out-performs both the standard CLIP and MedCLIP models in terms of accuracy and Area Under the Curve (AUC). Further analysis shows that using a lightweight ResNet18 backbone enhances classification performance, particularly under data-constrained conditions.
zh

[CV-54] From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection

【速读】:该论文旨在解决语言引导的开放世界航空目标检测中因数据集有限而导致的细粒度开放集检测能力不足的问题。其解决方案的关键在于构建一个大规模的语言引导开放集航空检测数据集(Multi-instance Open-set Aerial Dataset, MI-OAD),该数据集包含从词汇到短语再到句子的多层级语言引导,并通过基于开源大视觉-语言模型的自动标注流程(OS-W2S Label Engine)实现对航空图像的多样化场景标注,从而提升开放集检测的有效性。

链接: https://arxiv.org/abs/2505.03334
作者: Guoting Wei,Yu Liu,Xia Yuan,Xizhe Xue,Linlin Guo,Yifan Yang,Chunxia Zhao,Zongwen Bai,Haokui Zhang,Rong Xiao
机构: Nanjing University of Science and Technology (南京理工大学); Intellifusion Inc. (智融科技); Northwestern Polytechnical University (西北工业大学); Yan’an University (延安大学); Zhejiang Lab (浙江实验室); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:In recent years, language-guided open-world aerial object detection has gained significant attention due to its better alignment with real-world application needs. However, due to limited datasets, most existing language-guided methods primarily focus on vocabulary, which fails to meet the demands of more fine-grained open-world detection. To address this limitation, we propose constructing a large-scale language-guided open-set aerial detection dataset, encompassing three levels of language guidance: from words to phrases, and ultimately to sentences. Centered around an open-source large vision-language model and integrating image-operation-based preprocessing with BERT-based postprocessing, we present the OS-W2S Label Engine, an automatic annotation pipeline capable of handling diverse scene annotations for aerial images. Using this label engine, we expand existing aerial detection datasets with rich textual annotations and construct a novel benchmark dataset, called Multi-instance Open-set Aerial Dataset (MI-OAD), addressing the limitations of current remote sensing grounding data and enabling effective open-set aerial detection. Specifically, MI-OAD contains 163,023 images and 2 million image-caption pairs, approximately 40 times larger than comparable datasets. We also employ state-of-the-art open-set methods from the natural image domain, trained on our proposed dataset, to validate the model’s open-set detection capabilities. For instance, when trained on our dataset, Grounding DINO achieves improvements of 29.5 AP_50 and 33.7 Recall@10 for sentence inputs under zero-shot transfer conditions. Both the dataset and the label engine will be released publicly.
zh

[CV-55] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

【速读】:该论文旨在解决场景文本编辑中的问题,即在保持新生成文本的保真度和与背景视觉一致性的同时,对图像中的文本进行修改或添加。现有基于潜在扩散模型(latent diffusion models, LDM)的方法在处理非拉丁文字(如中文)时仍存在生成不准确或难以识别字符的问题。论文提出的解决方案关键在于FLUX-Text框架,该框架通过仔细研究字形条件(glyph conditioning),结合视觉和文本模态,并引入轻量级的字形和文本嵌入模块,在保留FLUX-Fill原始生成能力的基础上提升其对字形的理解与生成能力。

链接: https://arxiv.org/abs/2505.03329
作者: Rui Lan,Yancheng Bai,Xu Duan,Mingxing Li,Lei Sun,Xiangxiang Chu
机构: Amap, Alibaba Group(高德地图,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:The task of scene text editing is to modify or add texts on images while maintaining the fidelity of newly generated text and visual coherence with the background. Recent works based on latent diffusion models (LDM) show improved text editing results, yet still face challenges and often generate inaccurate or unrecognizable characters, especially for non-Latin ones (\eg, Chinese), which have complex glyph structures. To address these issues, we present FLUX-Text, a simple and advanced multilingual scene text editing framework based on FLUX-Fill. Specifically, we carefully investigate glyph conditioning, considering both visual and textual modalities. To retain the original generative capabilities of FLUX-Fill while enhancing its understanding and generation of glyphs, we propose lightweight glyph and text embedding modules. Owning to the lightweight design, FLUX-Text is trained only with 100K training examples compared to current popular methods trained with 2.9M ones. With no bells and whistles, our method achieves state-of-the-art performance on text editing tasks. Qualitative and quantitative experiments on the public datasets demonstrate that our method surpasses previous works in text fidelity.
zh

[CV-56] Very High-Resolution Forest Mapping with TanDEM-X InSAR Data and Self-Supervised Learning

【速读】:该论文旨在解决在高分辨率下利用TanDEM-X干涉合成孔径雷达(InSAR)数据进行森林制图的挑战,特别是克服中等分辨率产品在检测植被区域内的狭窄道路和精确划分森林边界方面的固有局限性。其解决方案的关键在于采用自监督学习技术,从输入特征中提取高度信息化的表示,随后进行监督训练,从而在仅有少量可靠标注数据的情况下提升分类精度。这种框架在亚马逊雨林的实际案例中表现出优于全监督方法的性能,为大规模、超高分辨率的森林制图提供了有前景的起点。

链接: https://arxiv.org/abs/2505.03327
作者: José-Luis Bueso-Bello,Benjamin Chauvel,Daniel Carcereri,Philipp Posovszky,Pietro Milillo,Jennifer Ruiz,Juan-Carlos Fernández-Diaz,Carolina González,Michele Martone,Ronny Hänsch,Paola Rizzoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Preprint submitted to Remote Sensing of Environment

点击查看摘要

Abstract:Deep learning models have shown encouraging capabilities for mapping accurately forests at medium resolution with TanDEM-X interferometric SAR data. Such models, as most of current state-of-the-art deep learning techniques in remote sensing, are trained in a fully-supervised way, which requires a large amount of labeled data for training and validation. In this work, our aim is to exploit the high-resolution capabilities of the TanDEM-X mission to map forests at 6 m. The goal is to overcome the intrinsic limitations posed by midresolution products, which affect, e.g., the detection of narrow roads within vegetated areas and the precise delineation of forested regions contours. To cope with the lack of extended reliable reference datasets at such a high resolution, we investigate self-supervised learning techniques for extracting highly informative representations from the input features, followed by a supervised training step with a significantly smaller number of reliable labels. A 1 m resolution forest/non-forest reference map over Pennsylvania, USA, allows for comparing different training approaches for the development of an effective forest mapping framework with limited labeled samples. We select the best-performing approach over this test region and apply it in a real-case forest mapping scenario over the Amazon rainforest, where only very few labeled data at high resolution are available. In this challenging scenario, the proposed self-supervised framework significantly enhances the classification accuracy with respect to fully-supervised methods, trained using the same amount of labeled data, representing an extremely promising starting point for large-scale, very high-resolution forest mapping with TanDEM-X data.
zh

[CV-57] SD-VSum: A Method and Dataset for Script-Driven Video Summarization

【速读】:该论文试图解决的是脚本驱动的视频摘要问题,即根据用户提供的脚本内容生成与之相关的视频摘要。解决方案的关键在于提出了一种新的网络架构SD-VSum,其核心是利用跨模态注意力机制对视觉和文本模态的信息进行对齐和融合,从而实现基于用户提供脚本内容的个性化视频摘要生成。

链接: https://arxiv.org/abs/2505.03319
作者: Manolis Mylonas,Evlampios Apostolidis,Vasileios Mezaris
机构: CERTH-ITI (CERTH-ITI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Under review

点击查看摘要

Abstract:In this work, we introduce the task of script-driven video summarization, which aims to produce a summary of the full-length video by selecting the parts that are most relevant to a user-provided script outlining the visual content of the desired summary. Following, we extend a recently-introduced large-scale dataset for generic video summarization (VideoXum) by producing natural language descriptions of the different human-annotated summaries that are available per video. In this way we make it compatible with the introduced task, since the available triplets of ``video, summary and summary description’’ can be used for training a method that is able to produce different summaries for a given video, driven by the provided script about the content of each summary. Finally, we develop a new network architecture for script-driven video summarization (SD-VSum), that relies on the use of a cross-modal attention mechanism for aligning and fusing information from the visual and text modalities. Our experimental evaluations demonstrate the advanced performance of SD-VSum against state-of-the-art approaches for query-driven and generic (unimodal and multimodal) summarization from the literature, and document its capacity to produce video summaries that are adapted to each user’s needs about their content.
zh

[CV-58] Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

【速读】:该论文旨在解决当前多模态奖励模型(Multimodal Reward Models, RMs)在提供奖励信号时存在推理深度不足、准确性较低的问题。其关键解决方案是引入显式的长链思维(Chain of Thought, CoT)机制,以增强模型的推理可靠性和鲁棒性,并通过隐式推理能力提升直接响应的准确性。具体而言,论文提出了UnifiedReward-Think,一个基于CoT的统一多模态奖励模型,采用探索驱动的强化微调方法,包括利用少量图像生成偏好数据进行冷启动训练、构建大规模统一多模态偏好数据以激发模型跨视觉任务的推理能力,以及通过Group Relative Policy Optimization (GRPO) 进行强化微调,从而优化模型的推理路径和输出质量。

链接: https://arxiv.org/abs/2505.03318
作者: Yibin Wang,Zhimin Li,Yuhang Zang,Chunyu Wang,Qinglin Lu,Cheng Jin,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model’s latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model’s cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model’s prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model’s reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.
zh

[CV-59] 3D Gaussian Splatting Data Compression with Mixture of Priors

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 数据压缩中熵模型不足和量化策略不优的问题,特别是在无损和有损压缩场景下,现有方法未能充分利用超先验信息构建鲁棒的条件熵模型,也未采用细粒度的逐元素量化策略以提升压缩精度。其解决方案的关键在于提出一种新的先验混合(Mixture of Priors, MoP)策略,通过多个轻量级MLP处理超先验信息生成多样化的先验特征,并通过门控机制将其整合到MoP特征中。该特征既用于增强无损压缩中的条件熵建模,又作为指导信息在有损压缩中实现基于先验引导的粗到精量化(C2FQ),从而提升压缩的粒度和效率。

链接: https://arxiv.org/abs/2505.03310
作者: Lei Liu,Zhenghao Chen,Dong Xu
机构: The University of Hong Kong (香港大学); The University of Newcastle (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) data compression is crucial for enabling efficient storage and transmission in 3D scene modeling. However, its development remains limited due to inadequate entropy models and suboptimal quantization strategies for both lossless and lossy compression scenarios, where existing methods have yet to 1) fully leverage hyperprior information to construct robust conditional entropy models, and 2) apply fine-grained, element-wise quantization strategies for improved compression granularity. In this work, we propose a novel Mixture of Priors (MoP) strategy to simultaneously address these two challenges. Specifically, inspired by the Mixture-of-Experts (MoE) paradigm, our MoP approach processes hyperprior information through multiple lightweight MLPs to generate diverse prior features, which are subsequently integrated into the MoP feature via a gating mechanism. To enhance lossless compression, the resulting MoP feature is utilized as a hyperprior to improve conditional entropy modeling. Meanwhile, for lossy compression, we employ the MoP feature as guidance information in an element-wise quantization procedure, leveraging a prior-guided Coarse-to-Fine Quantization (C2FQ) strategy with a predefined quantization step value. Specifically, we expand the quantization step value into a matrix and adaptively refine it from coarse to fine granularity, guided by the MoP feature, thereby obtaining a quantization step matrix that facilitates element-wise quantization. Extensive experiments demonstrate that our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks, including Mip-NeRF360, BungeeNeRF, DeepBlending, and TankTemples.
zh

[CV-60] Comparative Analysis of Lightweight Deep Learning Models for Memory-Constrained Devices

【速读】:该论文旨在解决在资源受限环境中部署轻量级深度学习模型进行图像分类的问题,其核心挑战在于平衡模型的准确性与计算效率。解决方案的关键在于对多种先进轻量级架构(如MobileNetV3 Small、EfficientNetV2-S等)进行系统评估,并通过调整超参数、数据增强和训练策略优化模型性能,特别是在复杂数据集上的表现。研究揭示了迁移学习在提升模型准确性和计算效率方面的重要性,同时明确了不同模型在准确率、推理速度、模型大小和浮点运算量(FLOPs)之间的权衡关系,为实际应用中有限计算资源下的模型部署提供了重要参考。

链接: https://arxiv.org/abs/2505.03303
作者: Tasnim Shahriar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 10 figures, 4 tables, submitted to Springer - Pattern Recognition and Image Analysis

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of lightweight deep learning models for image classification, emphasizing their suitability for deployment in resource-constrained environments such as low-memory devices. Five state-of-the-art architectures - MobileNetV3 Small, ResNet18, SqueezeNet, EfficientNetV2-S, and ShuffleNetV2 - are benchmarked across three diverse datasets: CIFAR-10, CIFAR-100, and Tiny ImageNet. The models are assessed using four key performance metrics: classification accuracy, inference time, floating-point operations (FLOPs), and model size. Additionally, we investigate the impact of hyperparameter tuning, data augmentation, and training paradigms by comparing pretrained models with scratch-trained counterparts, focusing on MobileNetV3 Small. Our findings reveal that transfer learning significantly enhances model accuracy and computational efficiency, particularly for complex datasets like Tiny ImageNet. EfficientNetV2 consistently achieves the highest accuracy, while MobileNetV3 offers the best balance between accuracy and efficiency, and SqueezeNet excels in inference speed and compactness. This study highlights critical trade-offs between accuracy and efficiency, offering actionable insights for deploying lightweight models in real-world applications where computational resources are limited. By addressing these challenges, this research contributes to optimizing deep learning systems for edge computing and mobile platforms.
zh

[CV-61] 3D Can Be Explored In 2D: Pseudo-Label Generation for LiDAR Point Clouds Using Sensor-Intensity-Based 2D Semantic Segmentation

【速读】:该论文旨在解决3D LiDAR点云语义分割中依赖大量标注数据以及面临领域迁移问题的挑战。其解决方案的关键在于引入一种新的3D语义分割流程,该流程利用对齐场景和最先进的2D语义分割方法,避免了直接进行3D标注或在推理阶段依赖其他模态(如相机图像)。通过从LiDAR扫描中生成由传感器强度着色的2D视图,并使用预训练于相机域的模型进行2D语义分割,再将分割结果反投影到3D点云,并采用简单的投票估计器融合每个3D点的标签,从而实现无需先验3D标注且不依赖其他模态的推理过程。

链接: https://arxiv.org/abs/2505.03300
作者: Andrew Caunes,Thierry Chateau,Vincent Frémont
机构: Logiroad; LS2N - Ecole Centrale de Nantes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IV2024

点击查看摘要

Abstract:Semantic segmentation of 3D LiDAR point clouds, essential for autonomous driving and infrastructure management, is best achieved by supervised learning, which demands extensive annotated datasets and faces the problem of domain shifts. We introduce a new 3D semantic segmentation pipeline that leverages aligned scenes and state-of-the-art 2D segmentation methods, avoiding the need for direct 3D annotation or reliance on additional modalities such as camera images at inference time. Our approach generates 2D views from LiDAR scans colored by sensor intensity and applies 2D semantic segmentation to these views using a camera-domain pretrained model. The segmented 2D outputs are then back-projected onto the 3D points, with a simple voting-based estimator that merges the labels associated to each 3D point. Our main contribution is a global pipeline for 3D semantic segmentation requiring no prior 3D annotation and not other modality for inference, which can be used for pseudo-label generation. We conduct a thorough ablation study and demonstrate the potential of the generated pseudo-labels for the Unsupervised Domain Adaptation task.
zh

[CV-62] owards Efficient Benchmarking of Foundation Models in Remote Sensing: A Capabilities Encoding Approach CVPR2025

【速读】:该论文试图解决在地球观测领域中,尽管已开发出大量遥感视觉基础模型,但没有一个模型能在所有下游任务中持续优于其他模型的问题。解决方案的关键在于提出一种成本效益高的方法,通过“能力编码”(capabilities encoding)来预测模型在多个下游任务上的性能,而无需对每个任务进行微调,从而简化基础模型的选择并为现有文献提供新的研究视角。

链接: https://arxiv.org/abs/2505.03299
作者: Pierre Adorni,Minh-Tan Pham,Stéphane May,Sébastien Lefèvre
机构: IRISA, Université Bretagne Sud, UMR 6074, Vannes, France; Centre National d’Études Spatiales (CNES), Toulouse, France; UiT The Arctic University of Norway, Tromsø, Norway
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the MORSE workshop of CVPR 2025

点击查看摘要

Abstract:Foundation models constitute a significant advancement in computer vision: after a single, albeit costly, training phase, they can address a wide array of tasks. In the field of Earth observation, over 75 remote sensing vision foundation models have been developed in the past four years. However, none has consistently outperformed the others across all available downstream tasks. To facilitate their comparison, we propose a cost-effective method for predicting a model’s performance on multiple downstream tasks without the need for fine-tuning on each one. This method is based on what we call “capabilities encoding.” The utility of this novel approach is twofold: we demonstrate its potential to simplify the selection of a foundation model for a given new task, and we employ it to offer a fresh perspective on the existing literature, suggesting avenues for future research. Codes are available at this https URL.
zh

[CV-63] Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification IJCAI2025

【速读】:该论文旨在解决可见光-红外行人重识别(VIReID)任务中由于可见光(VIS)与红外(IR)模态间存在显著差异而导致的性能不理想问题。现有方法未能充分挖掘多模态信息,主要关注于从共享信息中提取区分性特征,而忽视了模态特异性细节。解决方案的关键在于提出一种基础-细节特征学习框架(BDLF),通过无损细节特征提取模块和互补基础嵌入生成机制,分别挖掘细节和基础特征,并借助一种新颖的相关性限制方法,确保所获特征在VIS和IR特征中同时丰富细节和基础知识,从而充分利用模态共享与模态特异性信息。

链接: https://arxiv.org/abs/2505.03286
作者: Zhihao Gong,Lian Wu,Yong Xu
机构: Harbin Institute of Technology (Shenzhen); GuiZhou Education University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, 2025 34th International Joint Conference on Artificial Intelligence (IJCAI 2025)

点击查看摘要

Abstract:Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.
zh

[CV-64] OccCylindrical: Multi-Modal Fusion with Cylindrical Representation for 3D Semantic Occupancy Prediction

【速读】:该论文旨在解决自动驾驶车辆(AV)在复杂环境下的3D语义占据预测问题,即如何更准确地理解周围环境以确保安全运行。现有基于多传感器融合的方法主要依赖笛卡尔坐标系中的传感器信息,忽视了传感器读数的分布特性,导致细粒度细节丢失和性能下降。该论文提出的解决方案——OccCylindrical,关键在于在圆柱坐标系下融合和优化不同模态特征,从而保留更多几何细节,提升预测性能。

链接: https://arxiv.org/abs/2505.03284
作者: Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Yaoqi Huang,Hongyu Lyu,Nguyen Hoang Khoi Tran,Tzu-Yun Tseng,Stewart Worrall
机构: Australian Centre for Robotics (澳大利亚机器人中心); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The safe operation of autonomous vehicles (AVs) is highly dependent on their understanding of the surroundings. For this, the task of 3D semantic occupancy prediction divides the space around the sensors into voxels, and labels each voxel with both occupancy and semantic information. Recent perception models have used multisensor fusion to perform this task. However, existing multisensor fusion-based approaches focus mainly on using sensor information in the Cartesian coordinate system. This ignores the distribution of the sensor readings, leading to a loss of fine-grained details and performance degradation. In this paper, we propose OccCylindrical that merges and refines the different modality features under cylindrical coordinates. Our method preserves more fine-grained geometry detail that leads to better performance. Extensive experiments conducted on the nuScenes dataset, including challenging rainy and nighttime scenarios, confirm our approach’s effectiveness and state-of-the-art performance. The code will be available at: this https URL
zh

[CV-65] DiffVQA: Video Quality Assessment Using Diffusion Feature Extractor

【速读】:该论文旨在解决视频质量评估(Video Quality Assessment, VQA)中现有方法(如卷积神经网络和视觉Transformer)在复杂真实场景下与人类感知不一致的问题,以及数据集规模和多样性不足带来的挑战。其解决方案的关键在于引入一种新颖的VQA框架DiffVQA,该框架利用大规模数据预训练的扩散模型的强大泛化能力,并通过控制模块实现输入帧的重建,结合重采样分支和裁剪分支提取语义与失真特征,同时引入并行Mamba模块增强对长期时序动态的建模能力,从而提升模型的性能与跨数据集的泛化能力。

链接: https://arxiv.org/abs/2505.03261
作者: Wei-Ting Chen,Yu-Jiet Vong,Yi-Tsung Lee,Sy-Yen Kuo,Qiang Gao,Sizhuo Ma,Jian Wang
机构: National Taiwan University; Snap Inc.; Microsoft
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Video Quality Assessment (VQA) aims to evaluate video quality based on perceptual distortions and human preferences. Despite the promising performance of existing methods using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), they often struggle to align closely with human perceptions, particularly in diverse real-world scenarios. This challenge is exacerbated by the limited scale and diversity of available datasets. To address this limitation, we introduce a novel VQA framework, DiffVQA, which harnesses the robust generalization capabilities of diffusion models pre-trained on extensive datasets. Our framework adapts these models to reconstruct identical input frames through a control module. The adapted diffusion model is then used to extract semantic and distortion features from a resizing branch and a cropping branch, respectively. To enhance the model’s ability to handle long-term temporal dynamics, a parallel Mamba module is introduced, which extracts temporal coherence augmented features that are merged with the diffusion features to predict the final score. Experiments across multiple datasets demonstrate DiffVQA’s superior performance on intra-dataset evaluations and its exceptional generalization across datasets. These results confirm that leveraging a diffusion model as a feature extractor can offer enhanced VQA performance compared to CNN and ViT backbones.
zh

[CV-66] PROM: Prioritize Reduction of Multiplications Over Lower Bit-Widths for Efficient CNNs

【速读】:该论文旨在解决深度可分离卷积网络在量化过程中因计算成本分布不均而导致的效率提升受限问题。现有量化方法未能充分利用这种不平衡的计算分布,从而无法实现最大化的能效和存储优化。解决方案的关键在于提出PROM方法,通过选择性地对点卷积(pointwise convolutions)使用三值权重(ternary weights)而其他模块保持8位权重(8-bit weights),并结合激活值的8位量化,将点卷积转换为int8加法运算,从而显著降低能耗并减少存储需求。

链接: https://arxiv.org/abs/2505.03254
作者: Lukas Meiner,Jens Mehnert,Alexandru Paul Condurache
机构: Robert Bosch GmbH (罗伯特·博世有限公司); Universität zu Lübeck (吕贝克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) are crucial for computer vision tasks on resource-constrained devices. Quantization effectively compresses these models, reducing storage size and energy cost. However, in modern depthwise-separable architectures, the computational cost is distributed unevenly across its components, with pointwise operations being the most expensive. By applying a general quantization scheme to this imbalanced cost distribution, existing quantization approaches fail to fully exploit potential efficiency gains. To this end, we introduce PROM, a straightforward approach for quantizing modern depthwise-separable convolutional networks by selectively using two distinct bit-widths. Specifically, pointwise convolutions are quantized to ternary weights, while the remaining modules use 8-bit weights, which is achieved through a simple quantization-aware training procedure. Additionally, by quantizing activations to 8-bit, our method transforms pointwise convolutions with ternary weights into int8 additions, which enjoy broad support across hardware platforms and effectively eliminates the need for expensive multiplications. Applying PROM to MobileNetV2 reduces the model’s energy cost by more than an order of magnitude (23.9x) and its storage size by 2.7x compared to the float16 baseline while retaining similar classification performance on ImageNet. Our method advances the Pareto frontier for energy consumption vs. top-1 accuracy for quantized convolutional models on ImageNet. PROM addresses the challenges of quantizing depthwise-separable convolutional networks to both ternary and 8-bit weights, offering a simple way to reduce energy cost and storage size.
zh

[CV-67] Seeing the Abstract: Translating the Abstract Language for Vision Language Models CVPR25

【速读】:该论文试图解决当前视觉语言模型(Vision Language Models, VLMs)在处理抽象语言表达方面的不足,特别是由于文本语料库中缺乏足够的抽象词汇,导致模型难以有效表征抽象导向的语言。解决方案的关键在于提出一种无需训练且与模型无关的方法——抽象到具体翻译器(Abstract-to-Concrete Translator, ACT),通过利用预训练模型和现有多模态数据库,将抽象表示迁移至VLM潜在空间中更充分表示的具体表示,从而提升模型在文本到图像检索任务中的性能。

链接: https://arxiv.org/abs/2505.03242
作者: Davide Talon,Federico Girella,Ziyue Liu,Marco Cristani,Yiming Wang
机构: Fondazione Bruno Kessler(布鲁诺·克雷斯特罗基金会); University of Verona(维罗纳大学); Polytechnic Institute of Turin(都灵理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR25. Project page: this https URL

点击查看摘要

Abstract:Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.
zh

[CV-68] Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)中由于标注数据稀缺而限制深度学习,尤其是基于Transformer架构模型性能的问题。其解决方案的关键在于提出了一种自监督预训练策略——空间-频率掩码图像建模(Spatial-Frequency Masked Image Modeling, SFMIM),通过利用大量未标注数据,引入了在空间和频域同时进行掩码的双域掩码机制,使模型能够通过重建被掩码的成分来学习高阶光谱-空间相关性。

链接: https://arxiv.org/abs/2505.03220
作者: Shaheer Mohamed,Tharindu Fernando,Sridha Sridharan,Peyman Moghadam,Clinton Fookes
机构: Queensland University of Technology (昆士兰科技大学); CSIRO Robotics, Data61, CSIRO (澳大利亚联邦科学与工业研究组织机器人与数据61)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint to appear in IEEE IGARSS 2025

点击查看摘要

Abstract:Hyperspectral images (HSIs) capture rich spectral signatures that reveal vital material properties, offering broad applicability across various domains. However, the scarcity of labeled HSI data limits the full potential of deep learning, especially for transformer-based architectures that require large-scale training. To address this constraint, we propose Spatial-Frequency Masked Image Modeling (SFMIM), a self-supervised pretraining strategy for hyperspectral data that utilizes the large portion of unlabeled data. Our method introduces a novel dual-domain masking mechanism that operates in both spatial and frequency domains. The input HSI cube is initially divided into non-overlapping patches along the spatial dimension, with each patch comprising the entire spectrum of its corresponding spatial location. In spatial masking, we randomly mask selected patches and train the model to reconstruct the masked inputs using the visible patches. Concurrently, in frequency masking, we remove portions of the frequency components of the input spectra and predict the missing frequencies. By learning to reconstruct these masked components, the transformer-based encoder captures higher-order spectral-spatial correlations. We evaluate our approach on three publicly available HSI classification benchmarks and demonstrate that it achieves state-of-the-art performance. Notably, our model shows rapid convergence during fine-tuning, highlighting the efficiency of our pretraining strategy.
zh

[CV-69] DCS-ST for Classification of Breast Cancer Histopathology Images with Limited Annotations

【速读】:该论文试图解决在标注数据有限的情况下,深度学习方法在乳腺癌组织病理学图像分类中的性能下降问题(performance decline with limited annotated data),这一问题是医学影像领域的一个关键挑战,因为标注需要高昂的成本和专业知识。解决方案的关键在于提升模型在小样本情况下的泛化能力和分类准确性。

链接: https://arxiv.org/abs/2505.03204
作者: Liu Suxing,Byungwon Min
机构: Jiangxi Arts & Ceramics Technology Institute (江西艺术陶瓷职业技术学院); Mokwon University (木溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning methods have shown promise in classifying breast cancer histopathology images, but their performance often declines with limited annotated data, a critical challenge in medical imaging due to the high cost and expertise required for annotations.
zh

[CV-70] PiCo: Enhancing Text-Image Alignment with Improved Noise Selection and Precise Mask Control in Diffusion Models

【速读】:该论文旨在解决复杂文本提示下文本与图像对齐的问题,这是现有生成式 AI (Generative AI) 模型在文本到图像组合生成任务中面临的主要挑战。论文提出了一种无需训练的新型方法 PiCo,其关键在于两个核心组件:噪声选择模块和指代掩码模块。噪声选择模块用于评估随机噪声的质量并判断其是否适合目标文本,而指代掩码模块则用于生成像素级掩码以精确调控跨注意力图,从而提升文本与图像特征之间的合理交互。

链接: https://arxiv.org/abs/2505.03203
作者: Chang Xie,Chenyi Zhuang,Pan Gao
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advanced diffusion models have made notable progress in text-to-image compositional generation. However, it is still a challenge for existing models to achieve text-image alignment when confronted with complex text prompts. In this work, we highlight two factors that affect this alignment: the quality of the randomly initialized noise and the reliability of the generated controlling mask. We then propose PiCo (Pick-and-Control), a novel training-free approach with two key components to tackle these two factors. First, we develop a noise selection module to assess the quality of the random noise and determine whether the noise is suitable for the target text. A fast sampling strategy is utilized to ensure efficiency in the noise selection stage. Second, we introduce a referring mask module to generate pixel-level masks and to precisely modulate the cross-attention maps. The referring mask is applied to the standard diffusion process to guide the reasonable interaction between text and image features. Extensive experiments have been conducted to verify the effectiveness of PiCo in liberating users from the tedious process of random generation and in enhancing the text-image alignment for diverse text descriptions.
zh

[CV-71] CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization

【速读】:该论文旨在解决在复杂环境下传统仅依赖音频的语音处理系统性能下降的问题,通过利用说话人唇部运动、语音和语言内容之间的固有同步性,提升语音识别及相关任务的鲁棒性与准确性。其解决方案的关键在于提出一种名为CoGenAV的高效模型,该模型通过优化由自然音视频同步性衍生出的双重目标——对比特征对齐与生成式文本预测,仅使用223小时的LRS2数据集标注数据,便能够学习到跨模态的通用表示,从而有效捕捉多模态间的本质关联。

链接: https://arxiv.org/abs/2505.03186
作者: Detao Bai,Zhiheng Ma,Xihan Wei,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The inherent synchronization between a speaker’s lip movements, voice, and the underlying linguistic content offers a rich source of information for improving speech processing tasks, especially in challenging conditions where traditional audio-only systems falter. We introduce CoGenAV, a powerful and data-efficient model designed to learn versatile audio-visual representations applicable across a wide range of speech and audio-visual tasks. CoGenAV is trained by optimizing a dual objective derived from natural audio-visual synchrony, contrastive feature alignment and generative text prediction, using only 223 hours of labeled data from the LRS2 dataset. This contrastive-generative synchronization strategy effectively captures fundamental cross-modal correlations. We showcase the effectiveness and versatility of the learned CoGenAV representations on multiple benchmarks. When utilized for Audio-Visual Speech Recognition (AVSR) on LRS2, these representations contribute to achieving a state-of-the-art Word Error Rate (WER) of 1.27. They also enable strong performance in Visual Speech Recognition (VSR) with a WER of 22.0 on LRS2, and significantly improve performance in noisy environments by over 70%. Furthermore, CoGenAV representations benefit speech reconstruction tasks, boosting performance in Speech Enhancement and Separation, and achieve competitive results in audio-visual synchronization tasks like Active Speaker Detection (ASD). Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.
zh

[CV-72] Interactive Instance Annotation with Siamese Networks

【速读】:该论文旨在解决实例掩码标注过程耗时且劳动强度大的问题,提出了一种基于Siamese网络的框架SiamAnno,以实现跨领域标注任务的有效性。解决方案的关键在于利用单样本学习(one-shot learning),通过输入边界框预测目标边界,并允许标注者进行调整,从而在未经过微调的情况下,在不同数据集上实现了最先进的性能,展示了其处理领域和环境变化的能力。

链接: https://arxiv.org/abs/2505.03184
作者: Xiang Xu,Ruotong Li,Mengjun Yi,Baile XU,Furao Shen,Jian Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Annotating instance masks is time-consuming and labor-intensive. A promising solution is to predict contours using a deep learning model and then allow users to refine them. However, most existing methods focus on in-domain scenarios, limiting their effectiveness for cross-domain annotation tasks. In this paper, we propose SiamAnno, a framework inspired by the use of Siamese networks in object tracking. SiamAnno leverages one-shot learning to annotate previously unseen objects by taking a bounding box as input and predicting object boundaries, which can then be adjusted by annotators. Trained on one dataset and tested on another without fine-tuning, SiamAnno achieves state-of-the-art (SOTA) performance across multiple datasets, demonstrating its ability to handle domain and environment shifts in cross-domain tasks. We also provide more comprehensive results compared to previous work, establishing a strong baseline for future research. To our knowledge, SiamAnno is the first model to explore Siamese architecture for instance annotation.
zh

[CV-73] seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

【速读】:该论文试图解决当前自监督算法在学习视觉表示时存在的性能权衡问题,即在保持对变换的不变性(invariance)以适应如图像分类等任务与保持对变换的等变性(equivariance)以适应更细粒度任务之间的矛盾。解决方案的关键在于提出seq-JEPA,这是一种基于联合嵌入预测架构(joint-embedding predictive architecture)的世界建模范式,通过架构的归纳偏置(inductive biases)同时学习两种分离的表示:一种对指定变换具有等变性,另一种对变换具有不变性,从而避免了传统方法中需要额外等变性预测器或损失项的限制。

链接: https://arxiv.org/abs/2505.03176
作者: Hafez Ghaemi,Eilif Muller,Shahab Bakhtiari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Current self-supervised algorithms mostly rely on transformations such as data augmentation and masking to learn visual representations. This is achieved by inducing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm can limit the flexibility of learned representations for downstream adaptation by creating performance trade-offs between invariance-related tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we introduce \emphseq-JEPA, a world modeling paradigm based on joint-embedding predictive architecture that leverages architectural inductive biases to resolve this trade-off. Without requiring an additional equivariance predictor or loss term, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to the specified transformations and another invariant to them and suited for tasks such as classification. To do so, our model processes a short sequence of different views (observations) of an input image. Each encoded view is concatenated with embeddings corresponding to the relative transformation (action) producing the next observation in the sequence. A transformer encoder outputs an aggregate representation of this sequence, which is subsequently conditioned on the action leading to the next observation to predict its representation. Empirically, seq-JEPA achieves strong performance on equivariant benchmarks and image classification without sacrificing one for the other. Additionally, our framework excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.
zh

[CV-74] Automated Data Curation Using GPS NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets

【速读】:该论文试图解决在训练机器人系统(尤其是自动驾驶车辆)时,依赖人工标注指令-动作(Instruction-Action, IA)数据对所带来的成本高和效率低的问题。其解决方案的关键在于利用移动应用的全球定位系统(GPS)语音指令与自然语言处理(NLP)技术,实现IA命令与响应的自动化生成,从而无需人工生成或事后标记数据。通过这一方法,研究者构建了一个完全自动化的数据采集原型系统ADVLAT-Engine,并展示了如何从自由获取的移动应用中收集和分类多样化的指令,结合视频数据形成完整的视觉-语言-动作三元组。

链接: https://arxiv.org/abs/2505.03174
作者: Guillermo Roque,Erika Maquiling,Jose Giovanni Tapia Lopez,Ross Greer
机构: University of California, Merced (加利福尼亚大学默塞德分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction-Action (IA) data pairs are valuable for training robotic systems, especially autonomous vehicles (AVs), but having humans manually annotate this data is costly and time-inefficient. This paper explores the potential of using mobile application Global Positioning System (GPS) references and Natural Language Processing (NLP) to automatically generate large volumes of IA commands and responses without having a human generate or retroactively tag the data. In our pilot data collection, by driving to various destinations and collecting voice instructions from GPS applications, we demonstrate a means to collect and categorize the diverse sets of instructions, further accompanied by video data to form complete vision-language-action triads. We provide details on our completely automated data collection prototype system, ADVLAT-Engine. We characterize collected GPS voice instructions into eight different classifications, highlighting the breadth of commands and referentialities available for curation from freely available mobile applications. Through research and exploration into the automation of IA data pairs using GPS references, the potential to increase the speed and volume at which high-quality IA datasets are created, while minimizing cost, can pave the way for robust vision-language-action (VLA) models to serve tasks in vision-language navigation (VLN) and human-interactive autonomous systems.
zh

[CV-75] RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

【速读】:该论文试图解决长视频理解对于大型多模态模型(Large Multi-modal Models, LMMs)而言的挑战,特别是由于缺乏显式记忆和检索机制导致的无法有效处理分钟至小时级视频的问题。其解决方案的关键在于提出RAVU(Retrieval Augmented Video Understanding)框架,该框架通过在时空图上进行组合推理来增强视频理解,构建视频的图表示以捕捉实体之间的时空关系,并将其作为长期记忆,从而实现对对象及其动作的跨时间跟踪。通过将复杂查询分解为一系列推理步骤并在图上执行,该方法能够有效检索相关关键信息,提升对长视频的理解精度,尤其在需要多跳推理和跨帧追踪的查询上表现突出。

链接: https://arxiv.org/abs/2505.03173
作者: Sameer Malik,Moyuru Yamada,Ayush Singh,Dishank Aggarwal
机构: Fujitsu Research of India Private Limited (富士通印度研究有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames. Our approach demonstrate superior performances with limited retrieved frames (5-10) compared with other SOTA methods and baselines on two major video QA datasets, NExT-QA and EgoSchema.
zh

[CV-76] StableMotion: Training Motion Cleanup Models with Unpaired Corrupted Data

【速读】:该论文旨在解决运动捕捉(Motion Capture, mocap)数据中由于传感器不准确和后期处理导致的视觉上不连贯的伪影问题,传统的人工清理过程成本高且耗时。其解决方案的关键在于提出StableMotion方法,该方法通过引入运动质量指标(motion quality indicators),能够在无需成对的损坏-干净数据的情况下,直接从未配对的损坏数据集中训练运动清理模型,从而实现自动化清理。

链接: https://arxiv.org/abs/2505.03154
作者: Yuxuan Mu,Hung Yu Ling,Yi Shi,Ismael Baira Ojeda,Pengcheng Xi,Chang Shu,Fabio Zinno,Xue Bin Peng
机构: Simon Fraser University (西蒙菲莎大学); Electronic Arts (电子艺界); National Research Council Canada (加拿大国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 17 pages, 13 figures

点击查看摘要

Abstract:Motion capture (mocap) data often exhibits visually jarring artifacts due to inaccurate sensors and post-processing. Cleaning this corrupted data can require substantial manual effort from human experts, which can be a costly and time-consuming process. Previous data-driven motion cleanup methods offer the promise of automating this cleanup process, but often require in-domain paired corrupted-to-clean training data. Constructing such paired datasets requires access to high-quality, relatively artifact-free motion clips, which often necessitates laborious manual cleanup. In this work, we present StableMotion, a simple yet effective method for training motion cleanup models directly from unpaired corrupted datasets that need cleanup. The core component of our method is the introduction of motion quality indicators, which can be easily annotated through manual labeling or heuristic algorithms and enable training of quality-aware motion generation models on raw motion data with mixed quality. At test time, the model can be prompted to generate high-quality motions using the quality indicators. Our method can be implemented through a simple diffusion-based framework, leading to a unified motion generate-discriminate model, which can be used to both identify and fix corrupted frames. We demonstrate that our proposed method is effective for training motion cleanup models on raw mocap data in production scenarios by applying StableMotion to SoccerMocap, a 245-hour soccer mocap dataset containing real-world motion artifacts. The trained model effectively corrects a wide range of motion artifacts, reducing motion pops and frozen frames by 68% and 81%, respectively. See this https URL for more results.
zh

[CV-77] Robust Fairness Vision-Language Learning for Medical Image Analysis

【速读】:该论文旨在解决医学图像分析中视觉-语言模型(Vision-Language Models, VLMs)在公平性和鲁棒性方面的不足,以确保模型在不同患者群体中的表现一致性。其解决方案的关键在于提出一种框架,通过动态不良样本挖掘算法识别并调整有偏差的图像-文本对,并利用Sinkhorn距离保证受保护群体的损失分布不偏离总体损失,从而提升模型的公平性和鲁棒性。实验结果表明,该框架在公平性调整后的AUC指标上提升了最高8.6%。

链接: https://arxiv.org/abs/2505.03153
作者: Sparsh Bansal,Mingyang Wu,Xin Wang,Shu Hu
机构: Purdue University (普渡大学); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The advent of Vision-Language Models (VLMs) in medical image analysis has the potential to help process multimodal inputs and increase performance over traditional inference methods. However, when considering the domain in which these models will be implemented, fairness and robustness are important to ensure the model stays true for any patient. In this paper, we introduce a framework for ensuring robustness and fairness of VLM models. This framework modifies the loss function at training by identifying and adjusting faulty image-text pairs through a Dynamic Bad Pair Mining algorithm and also utilizing Sinkhorn distance to ensure the loss distributions of protected groups do not deviate from the total loss. Experimental testing of our framework shows up to a 8.6% improvement when looking at equity-scaled AUC.
zh

[CV-78] Motion-compensated cardiac MRI using low-rank diffeomorphic flow (DMoCo)

【速读】:该论文旨在解决自由呼吸且未门控的三维心脏磁共振成像(3D cardiac MRI)中的运动补偿图像重建问题。其核心挑战在于如何在缺乏外部门控信号的情况下,准确建模和补偿心脏在不同呼吸阶段的运动。解决方案的关键在于提出了一种低秩模型(low-rank model),用于紧凑地表示由运动相位参数化的微分同胚(diffeomorphisms)族,通过将每个特定运动相位对应的图像体积表示为单一静态图像模板的变形,并利用参数化速度场沿参考模板相位到运动相位的路径积分来获得各相位的微分同胚,从而实现更精确的运动补偿与图像恢复。

链接: https://arxiv.org/abs/2505.03149
作者: Joseph William Kettelkamp,Ludovica Romanin,Sarv Priya,Mathews Jacob
机构: University of Virginia (弗吉尼亚大学); Siemens Healthineers International AG (西门子医疗健康国际公司); University of Wisconsin in Madison (威斯康星大学麦迪逊分校); University of Iowa Hospitals and Clinics (爱荷华大学医院和诊所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce an unsupervised motion-compensated image reconstruction algorithm for free-breathing and ungated 3D cardiac magnetic resonance imaging (MRI). We express the image volume corresponding to each specific motion phase as the deformation of a single static image template. The main contribution of the work is the low-rank model for the compact joint representation of the family of diffeomorphisms, parameterized by the motion phases. The diffeomorphism at a specific motion phase is obtained by integrating a parametric velocity field along a path connecting the reference template phase to the motion phase. The velocity field at different phases is represented using a low-rank model. The static template and the low-rank motion model parameters are learned directly from the k-space data in an unsupervised fashion. The more constrained motion model is observed to offer improved recovery compared to current motion-resolved and motion-compensated algorithms for free-breathing 3D cine MRI.
zh

[CV-79] Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control

【速读】:该论文旨在解决工业玻璃制造中视觉缺陷检测的问题,特别是由于缺陷产品出现频率低导致的数据集不平衡问题,这限制了深度学习模型和计算机视觉系统的性能。解决方案的关键在于使用去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)生成合成缺陷玻璃产品图像,以进行数据增强,从而有效缓解类别不平衡问题,并提升标准卷积神经网络(CNN)架构(如ResNet50V2、EfficientNetB0和MobileNetV2)在异常检测中的图像分类性能。

链接: https://arxiv.org/abs/2505.03134
作者: Sajjad Rezvani Boroujeni,Hossein Abedi,Tom Bush
机构: Actual Reality Technologies (Actual Reality Technologies)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 7 figures, submitted to Computer and Decision Making An International Journal (COMDEM)

点击查看摘要

Abstract:Visual defect detection in industrial glass manufacturing remains a critical challenge due to the low frequency of defective products, leading to imbalanced datasets that limit the performance of deep learning models and computer vision systems. This paper presents a novel approach using Denoising Diffusion Probabilistic Models (DDPMs) to generate synthetic defective glass product images for data augmentation, effectively addressing class imbalance issues in manufacturing quality control and automated visual inspection. The methodology significantly enhances image classification performance of standard CNN architectures (ResNet50V2, EfficientNetB0, and MobileNetV2) in detecting anomalies by increasing the minority class representation. Experimental results demonstrate substantial improvements in key machine learning metrics, particularly in recall for defective samples across all tested deep neural network architectures while maintaining perfect precision. The most dramatic improvement was observed in ResNet50V2’s overall classification accuracy, which increased from 78 percent to 93 percent when trained with the augmented data. This work provides a scalable, cost-effective approach to enhancing automated defect detection in glass manufacturing that can potentially be extended to other industrial quality assurance systems and industries with similar class imbalance challenges.
zh

[CV-80] VISLIX: An XAI Framework for Validating Vision Models with Slice Discovery and Analysis

【速读】:该论文旨在解决计算机视觉模型验证中数据切片(data slicing)所面临的挑战,包括对额外图像元数据或视觉概念的依赖、数据切片理解的高人力成本以及缺乏人机交互的解决方案。其解决方案的关键在于提出VISLIX框架,该框架利用先进的基础模型,无需图像元数据或视觉概念,能够自动生成自然语言洞察,并允许用户交互式地测试数据切片假设,从而提升模型验证的效率和深度。

链接: https://arxiv.org/abs/2505.03132
作者: Xinyuan Yan,Xiwei Xuan,Jorge Piazentin Ono,Jiajing Guo,Vikram Mohanty,Shekar Arvind Kumar,Liang Gou,Bei Wang,Liu Ren
机构: Scientific Computing and Imaging Institute, University of Utah, USA; University of California, Davis, USA; Bosch Research North America and Bosch Center for Artificial Intelligence (BCAI), USA; Robert Bosch GmbH, Germany; Splunk Technology, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Real-world machine learning models require rigorous evaluation before deployment, especially in safety-critical domains like autonomous driving and surveillance. The evaluation of machine learning models often focuses on data slices, which are subsets of the data that share a set of characteristics. Data slice finding automatically identifies conditions or data subgroups where models underperform, aiding developers in mitigating performance issues. Despite its popularity and effectiveness, data slicing for vision model validation faces several challenges. First, data slicing often needs additional image metadata or visual concepts, and falls short in certain computer vision tasks, such as object detection. Second, understanding data slices is a labor-intensive and mentally demanding process that heavily relies on the expert’s domain knowledge. Third, data slicing lacks a human-in-the-loop solution that allows experts to form hypothesis and test them interactively. To overcome these limitations and better support the machine learning operations lifecycle, we introduce VISLIX, a novel visual analytics framework that employs state-of-the-art foundation models to help domain experts analyze slices in computer vision models. Our approach does not require image metadata or visual concepts, automatically generates natural language insights, and allows users to test data slice hypothesis interactively. We evaluate VISLIX with an expert study and three use cases, that demonstrate the effectiveness of our tool in providing comprehensive insights for validating object detection models.
zh

[CV-81] meTracker: Event-based Continuous Point Tracking for Video Frame Interpolation with Non-linear Motion CVPR2025

【速读】:该论文旨在解决基于事件的视频帧插值(event-based VFI)中非线性运动处理的问题,即如何有效应对场景中运动方向和速度动态变化带来的挑战。现有方法通过事件估计稀疏光流或融合事件与图像特征估计稠密光流,但由于事件提供的连续运动线索与图像的稠密空间信息在时间维度上不一致,常导致运动误差影响插值质量。该论文的关键解决方案是提出一种基于连续点跟踪的VFI框架TimeTracker,其核心在于通过场景感知区域分割(SARS)模块将场景划分为相似块,并利用连续轨迹引导的运动估计(CTME)模块通过事件跟踪每个块的连续运动轨迹,从而更准确地识别时空特征相关性,最终通过全局运动优化和帧细化生成中间帧。

链接: https://arxiv.org/abs/2505.03116
作者: Haoyue Liu,Jinghan Xu,Yi Chang,Hanyu Zhou,Haozhi Zhao,Lin Wang,Luxin Yan
机构: National Key Lab of Multispectral Information Intelligent Processing Technology (国家多光谱信息智能处理技术重点实验室); Huazhong University of Science and Technology (华中科技大学); School of Artificial Intelligence and Automation (人工智能与自动化学院); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Video frame interpolation (VFI) that leverages the bio-inspired event cameras as guidance has recently shown better performance and memory efficiency than the frame-based methods, thanks to the event cameras’ advantages, such as high temporal resolution. A hurdle for event-based VFI is how to effectively deal with non-linear motion, caused by the dynamic changes in motion direction and speed within the scene. Existing methods either use events to estimate sparse optical flow or fuse events with image features to estimate dense optical flow. Unfortunately, motion errors often degrade the VFI quality as the continuous motion cues from events do not align with the dense spatial information of images in the temporal dimension. In this paper, we find that object motion is continuous in space, tracking local regions over continuous time enables more accurate identification of spatiotemporal feature correlations. In light of this, we propose a novel continuous point tracking-based VFI framework, named TimeTracker. Specifically, we first design a Scene-Aware Region Segmentation (SARS) module to divide the scene into similar patches. Then, a Continuous Trajectory guided Motion Estimation (CTME) module is proposed to track the continuous motion trajectory of each patch through events. Finally, intermediate frames at any given time are generated through global motion optimization and frame refinement. Moreover, we collect a real-world dataset that features fast non-linear motion. Extensive experiments show that our method outperforms prior arts in both motion estimation and frame interpolation quality.
zh

[CV-82] Path and Bone-Contour Regularized Unpaired MRI-to-CT Translation

【速读】:该论文旨在解决在缺乏配对MRI和CT扫描数据的情况下,准确进行MRI到CT图像翻译的问题,特别是针对CT中明显而MRI中不易区分的解剖结构(如骨结构)的翻译挑战。其解决方案的关键在于提出一种基于路径和骨轮廓正则化的无配对MRI到CT翻译方法,通过将MRI和CT图像映射到共享潜在空间,并利用神经常微分方程建模连续流,以最小化流的路径长度获得最优映射,同时引入可训练网络生成骨轮廓并鼓励模型关注骨轮廓及其邻近区域,从而提升骨结构的翻译精度。

链接: https://arxiv.org/abs/2505.03114
作者: Teng Zhou,Jax Luo,Yuping Sun,Yiheng Tan,Shun Yao,Nazim Haouchine,Scott Raymond
机构: School of Computer Science and Technology, Guangdong University of Technology (计算机学院,广东工业大学); Cleveland Clinic (克利夫兰诊所); Department of Neurosurgery, The First Affiliated Hospital, Sun Yat-sen University (神经外科,中山大学附属第一医院); Brigham and Women’s Hospital, Harvard Medical School (布莱根妇女医院,哈佛医学院); Department of Radiology, Medical Imaging Center, University Medical Center Groningen, University of Groningen (放射科,医学影像中心,格罗宁根大学医学中心,格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate MRI-to-CT translation promises the integration of complementary imaging information without the need for additional imaging sessions. Given the practical challenges associated with acquiring paired MRI and CT scans, the development of robust methods capable of leveraging unpaired datasets is essential for advancing the MRI-to-CT translation. Current unpaired MRI-to-CT translation methods, which predominantly rely on cycle consistency and contrastive learning frameworks, frequently encounter challenges in accurately translating anatomical features that are highly discernible on CT but less distinguishable on MRI, such as bone structures. This limitation renders these approaches less suitable for applications in radiation therapy, where precise bone representation is essential for accurate treatment planning. To address this challenge, we propose a path- and bone-contour regularized approach for unpaired MRI-to-CT translation. In our method, MRI and CT images are projected to a shared latent space, where the MRI-to-CT mapping is modeled as a continuous flow governed by neural ordinary differential equations. The optimal mapping is obtained by minimizing the transition path length of the flow. To enhance the accuracy of translated bone structures, we introduce a trainable neural network to generate bone contours from MRI and implement mechanisms to directly and indirectly encourage the model to focus on bone contours and their adjacent regions. Evaluations conducted on three datasets demonstrate that our method outperforms existing unpaired MRI-to-CT translation approaches, achieving lower overall error rates. Moreover, in a downstream bone segmentation task, our approach exhibits superior performance in preserving the fidelity of bone structures. Our code is available at: this https URL.
zh

[CV-83] Image Recognition with Online Lightweight Vision Transformer: A Survey

【速读】:该论文旨在解决视觉Transformer在计算和内存效率方面的不足,从而提升其在实际应用中的可行性。其解决方案的关键在于通过三种核心策略——高效组件设计、动态网络以及知识蒸馏,生成轻量级的视觉Transformer模型,以在保持识别精度的同时减少参数量和计算开销。

链接: https://arxiv.org/abs/2505.03113
作者: Zherui Zhang,Rongtao Xu,Jie Zhou,Changwei Wang,Xingtian Pei,Wenhao Xu,Jiguang Zhang,Li Guo,Longxiang Gao,Wenbo Xu,Shibiao Xu
机构: Beijing University of Posts and Telecommunications(北京邮电大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室); Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology, Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science(教育部计算智能网络与信息安全重点实验室,山东计算机科学中心,齐鲁工业大学,山东省计算互联网与服务计算重点实验室,山东省计算机科学基础研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: this https URL
zh

[CV-84] Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability CVPR2025

【速读】:该论文旨在解决扩散模型在生成图像时,同一网络层需同时学习结构和纹理信息导致的性能瓶颈问题,这一问题与传统深度学习架构(如ResNet或GANs)在不同层分别捕捉或生成语义信息的方式存在显著差异。解决方案的关键在于通过分析U-Net参数对去噪过程的贡献,发现适当置零某些参数(包括大参数)能够有效提升生成质量,并据此提出一种名为“MaskUNet”的方法,该方法通过充分利用时间步和样本相关的有效U-Net参数,在参数数量几乎不增加的情况下显著提升了生成效果。

链接: https://arxiv.org/abs/2505.03097
作者: Lei Wang,Senmao Li,Fei Yang,Jianye Wang,Ziheng Zhang,Yuhan Liu,Yaxing Wang,Jian Yang
机构: PCA Lab, VCIP, College of Computer Science, Nankai University (南开大学计算机学院); Shenzhen Futian, NKIARI (深圳福田,NKIARI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method-termed ``MaskUNet’'- that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations. Project page: this https URL
zh

[CV-85] Estimating the Diameter at Breast Height of Trees in a Forest With a Single 360 Camera

【速读】:该论文旨在解决森林资源调查中对胸径(DBH)准确测量的需求,传统方法依赖于成本高昂且操作复杂的LiDAR技术,而本文提出了一种低成本的替代方案。解决方案的关键在于使用消费级360视频相机结合半自动化处理流程,包括基于SfM的密集点云重建、通过Grounded Segment Anything模型进行语义树干分割以及基于RANSAC的截面形状和DBH估计,从而实现高精度的DBH测量。

链接: https://arxiv.org/abs/2505.03093
作者: Siming He,Zachary Osman,Fernando Cladera,Dexter Ong,Nitant Rai,Patrick Corey Green,Vijay Kumar,Pratik Chaudhari
机构: University of Pennsylvania (宾夕法尼亚大学); Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forest inventories rely on accurate measurements of the diameter at breast height (DBH) for ecological monitoring, resource management, and carbon accounting. While LiDAR-based techniques can achieve centimeter-level precision, they are cost-prohibitive and operationally complex. We present a low-cost alternative that only needs a consumer-grade 360 video camera. Our semi-automated pipeline comprises of (i) a dense point cloud reconstruction using Structure from Motion (SfM) photogrammetry software called Agisoft Metashape, (ii) semantic trunk segmentation by projecting Grounded Segment Anything (SAM) masks onto the 3D cloud, and (iii) a robust RANSAC-based technique to estimate cross section shape and DBH. We introduce an interactive visualization tool for inspecting segmented trees and their estimated DBH. On 61 acquisitions of 43 trees under a variety of conditions, our method attains median absolute relative errors of 5-9% with respect to “ground-truth” manual measurements. This is only 2-4% higher than LiDAR-based estimates, while employing a single 360 camera that costs orders of magnitude less, requires minimal setup, and is widely available.
zh

[CV-86] Sim2Real Transfer for Vision-Based Grasp Verification

【速读】:该论文试图解决机器人抓取验证的问题,特别是在处理柔性和非刚性物体时传统基于力和触觉传感器的方法存在局限性。解决方案的关键在于提出一种基于视觉的抓取验证方法,该方法采用两阶段架构:首先使用YOLO-based目标检测模型来检测和定位机械臂夹爪,然后通过ResNet-based分类器判断物体是否存在。此外,为弥补真实数据采集的不足,研究者引入了HSR-GraspSynth合成数据集,并探索了视觉问答能力作为零样本基线进行比较。

链接: https://arxiv.org/abs/2505.03046
作者: Pau Amargant,Peter Hönig,Markus Vincze
机构: Faculty of Electrical Engineering, Technical University of Vienna(电气工程学院,维也纳技术大学); Polytechnic University of Catalonia(加泰罗尼亚理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Austrian Robotics Workshop 2025

点击查看摘要

Abstract:The verification of successful grasps is a crucial aspect of robot manipulation, particularly when handling deformable objects. Traditional methods relying on force and tactile sensors often struggle with deformable and non-rigid objects. In this work, we present a vision-based approach for grasp verification to determine whether the robotic gripper has successfully grasped an object. Our method employs a two-stage architecture; first YOLO-based object detection model to detect and locate the robot’s gripper and then a ResNet-based classifier determines the presence of an object. To address the limitations of real-world data capture, we introduce HSR-GraspSynth, a synthetic dataset designed to simulate diverse grasping scenarios. Furthermore, we explore the use of Visual Question Answering capabilities as a zero-shot baseline to which we compare our model. Experimental results demonstrate that our approach achieves high accuracy in real-world environments, with potential for integration into grasping pipelines. Code and datasets are publicly available at this https URL .
zh

[CV-87] An Explainable Anomaly Detection Framework for Monitoring Depression and Anxiety Using Consumer Wearable Devices

【速读】:该论文旨在解决如何通过可解释的异常检测框架,利用消费级可穿戴设备数据实现对抑郁和焦虑症状恶化的早期识别问题。其解决方案的关键在于构建一个基于长短期记忆网络(LSTM)的自编码器模型,该模型能够学习健康个体的睡眠时长、步数和静息心率等生理指标的正常模式,并通过检测自我报告的抑郁或焦虑评分显著上升(≥5分)来识别症状恶化事件,同时借助SHAP分析提升模型的可解释性,从而支持临床意义的异常检测。

链接: https://arxiv.org/abs/2505.03039
作者: Yuezhou Zhang,Amos A. Folarin,Callum Stewart,Heet Sankesara,Yatharth Ranjan,Pauline Conde,Akash Roy Choudhury,Shaoxiong Sun,Zulqarnain Rashid,Richard J.B. Dobson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Continuous monitoring of behavior and physiology via wearable devices offers a novel, objective method for the early detection of worsening depression and anxiety. In this study, we present an explainable anomaly detection framework that identifies clinically meaningful increases in symptom severity using consumer-grade wearable data. Leveraging data from 2,023 participants with defined healthy baselines, our LSTM autoencoder model learned normal health patterns of sleep duration, step count, and resting heart rate. Anomalies were flagged when self-reported depression or anxiety scores increased by =5 points (a threshold considered clinically significant). The model achieved an adjusted F1-score of 0.80 (precision = 0.73, recall = 0.88) in detecting 393 symptom-worsening episodes across 341 participants, with higher performance observed for episodes involving concurrent depression and anxiety escalation (F1 = 0.84) and for more pronounced symptom changes (=10-point increases, F1 = 0.85). Model interpretability was supported by SHAP-based analysis, which identified resting heart rate as the most influential feature in 71.4 percentage of detected anomalies, followed by physical activity and sleep. Together, our findings highlight the potential of explainable anomaly detection to enable personalized, scalable, and proactive mental health monitoring in real-world settings.
zh

[CV-88] Lesion-Aware Generative Artificial Intelligence for Virtual Contrast-Enhanced Mammography in Breast Cancer

【速读】:该论文试图解决Contrast-Enhanced Spectral Mammography (CESM) 中因使用含碘对比剂而导致的辐射暴露增加和潜在副作用问题。其解决方案的关键在于提出一种名为Seg-CycleGAN的生成式深度学习框架,该框架通过从低能量图像合成高质量的双能减影图像来实现虚拟对比增强,利用病变分割图指导生成过程以提升病变区域的重建质量,并在标准CycleGAN架构基础上引入针对病变区域的局部损失项,从而优化诊断相关区域的合成效果。

链接: https://arxiv.org/abs/2505.03018
作者: Aurora Rofena,Arianna Manchia,Claudia Lucia Piccolo,Bruno Beomonte Zobel,Paolo Soda,Valerio Guarrasi
机构: Unit of Computer Systems and Bioinformatics, Department of Engineering, University Campus Bio-Medico of Rome, Italy; Department of Radiology, Fondazione Policlinico Campus Bio-Medico, Italy; Department of Radiology, Università Campus Bio-Medico di Roma, Italy; Department of Radiation Sciences, Radiation Physics, Biomedical Engineering, Umeå University, Sweden
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Contrast-Enhanced Spectral Mammography (CESM) is a dual-energy mammographic technique that improves lesion visibility through the administration of an iodinated contrast agent. It acquires both a low-energy image, comparable to standard mammography, and a high-energy image, which are then combined to produce a dual-energy subtracted image highlighting lesion contrast enhancement. While CESM offers superior diagnostic accuracy compared to standard mammography, its use entails higher radiation exposure and potential side effects associated with the contrast medium. To address these limitations, we propose Seg-CycleGAN, a generative deep learning framework for Virtual Contrast Enhancement in CESM. The model synthesizes high-fidelity dual-energy subtracted images from low-energy images, leveraging lesion segmentation maps to guide the generative process and improve lesion reconstruction. Building upon the standard CycleGAN architecture, Seg-CycleGAN introduces localized loss terms focused on lesion areas, enhancing the synthesis of diagnostically relevant regions. Experiments on the CESM@UCBM dataset demonstrate that Seg-CycleGAN outperforms the baseline in terms of PSNR and SSIM, while maintaining competitive MSE and VIF. Qualitative evaluations further confirm improved lesion fidelity in the generated images. These results suggest that segmentation-aware generative models offer a viable pathway toward contrast-free CESM alternatives.
zh

[CV-89] GIF: Generative Inspiration for Face Recognition at Scale

【速读】:该论文旨在解决大规模标签空间下人脸识别(Face Recognition, FR)中Softmax计算成本过高的问题。现有方法通过使用身份子集来估计输出,但计算成本与数据集中身份数量之间的关联仅在降低比例下保持线性关系。该论文的关键解决方案是将原子标量标签替换为结构化的身份代码(structured identity code),即一串整数,并通过一种分词方案将标量标签转换为结构化代码,随后训练FR主干网络预测每个输入的代码而非其标量标签,从而将计算成本与身份数量之间的关系从线性转变为对数关系。

链接: https://arxiv.org/abs/2505.03012
作者: Saeed Ebrahimi,Sahar Rahimi,Ali Dabouei,Srinjoy Das,Jeremy M. Dawson,Nasser M. Nasrabadi
机构: West Virginia University (西弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aiming to reduce the computational cost of Softmax in massive label space of Face Recognition (FR) benchmarks, recent studies estimate the output using a subset of identities. Although promising, the association between the computation cost and the number of identities in the dataset remains linear only with a reduced ratio. A shared characteristic among available FR methods is the employment of atomic scalar labels during training. Consequently, the input to label matching is through a dot product between the feature vector of the input and the Softmax centroids. Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. Specifically, we propose a tokenization scheme that transforms atomic scalar labels into structured identity codes. Then, we train an FR backbone to predict the code for each input instead of its scalar label. As a result, the associated computational cost becomes logarithmic w.r.t. number of identities. We demonstrate the benefits of the proposed method by conducting experiments. In particular, our method outperforms its competitors by 1.52%, and 0.6% at TAR@FAR =1e-4 on IJB-B and IJB-C, respectively, while transforming the association between computational cost and the number of identities from linear to logarithmic. See code at this https URL
zh

[CV-90] NTIRE 2025 Challenge on UGC Video Enhancement: Methods and Results

【速读】:该论文旨在解决用户生成内容(User-Generated Content, UGC)视频增强的问题,特别是在缺乏参考真值的情况下提升视频的视觉质量。解决方案的关键在于开发能够处理现实世界退化因素(如噪声、模糊、色彩褪色、压缩伪影等)的算法,并通过众包方式进行主观质量评估,以确保结果符合实际应用需求。

链接: https://arxiv.org/abs/2505.03007
作者: Nikolay Safonov,Alexey Bryncev,Andrey Moskalenko,Dmitry Kulikov,Dmitry Vatolin,Radu Timofte,Haibo Lei,Qifan Gao,Qing Luo,Yaqing Li,Jie Song,Shaozhe Hao,Meisong Zheng,Jingyi Xu,Chengbin Wu,Jiahui Liu,Ying Chen,Xin Deng,Mai Xu,Peipei Liang,Jie Ma,Junjie Jin,Yingxue Pang,Fangzhou Luo,Kai Chen,Shijie Zhao,Mingyang Wu,Renjie Li,Yushen Zuo,Shengyun Zhong,Zhengzhong Tu
机构: Moscow State University (莫斯科国立大学); University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an overview of the NTIRE 2025 Challenge on UGC Video Enhancement. The challenge constructed a set of 150 user-generated content videos without reference ground truth, which suffer from real-world degradations such as noise, blur, faded colors, compression artifacts, etc. The goal of the participants was to develop an algorithm capable of improving the visual quality of such videos. Given the widespread use of UGC on short-form video platforms, this task holds substantial practical importance. The evaluation was based on subjective quality assessment in crowdsourcing, obtaining votes from over 8000 assessors. The challenge attracted more than 25 teams submitting solutions, 7 of which passed the final phase with source code verification. The outcomes may provide insights into the state-of-the-art in UGC video enhancement and highlight emerging trends and effective strategies in this evolving research area. All data, including the processed videos and subjective comparison votes and scores, is made publicly available at this https URL.
zh

[CV-91] Completing Spatial Transcriptomics Data for Gene Expression Prediction Benchmarking

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics)技术在实际应用中面临的高成本、专业技能需求、临床整合缓慢以及基因捕获效率低下导致的数据缺失问题。其解决方案的关键在于提出SpaCKLE,一个基于Transformer的基因表达补全模型,该模型通过显著降低均方误差(超过82.5%),有效提升了从组织病理图像预测基因表达的准确性。

链接: https://arxiv.org/abs/2505.02980
作者: Daniela Ruiz,Paula Cardenas,Leonardo Manrique,Daniela Vega,Gabriel Mejia,Pablo Arbelaez
机构: Universidad de los Andes(安第斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2407.13027

点击查看摘要

Abstract:Spatial Transcriptomics is a groundbreaking technology that integrates histology images with spatially resolved gene expression profiles. Among the various Spatial Transcriptomics techniques available, Visium has emerged as the most widely adopted. However, its accessibility is limited by high costs, the need for specialized expertise, and slow clinical integration. Additionally, gene capture inefficiencies lead to significant dropout, corrupting acquired data. To address these challenges, the deep learning community has explored the gene expression prediction task directly from histology images. Yet, inconsistencies in datasets, preprocessing, and training protocols hinder fair comparisons between models. To bridge this gap, we introduce SpaRED, a systematically curated database comprising 26 public datasets, providing a standardized resource for model evaluation. We further propose SpaCKLE, a state-of-the-art transformer-based gene expression completion model that reduces mean squared error by over 82.5% compared to existing approaches. Finally, we establish the SpaRED benchmark, evaluating eight state-of-the-art prediction models on both raw and SpaCKLE-completed data, demonstrating SpaCKLE substantially improves the results across all the gene expression prediction models. Altogether, our contributions constitute the most comprehensive benchmark of gene expression prediction from histology images to date and a stepping stone for future research on Spatial Transcriptomics.
zh

[CV-92] Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation

【速读】:该论文试图解决视觉语言分割模型(Vision-Language Segmentation Models, VLSMs)在医学图像分析中的对抗鲁棒性问题,特别是针对不同模态的二维医学图像(如放射影像、摄影和内窥镜图像)进行评估。解决方案的关键在于对预训练的VLSMs进行微调,并引入对抗攻击方法(如投影梯度下降和快速梯度符号法)来测试模型的鲁棒性,从而分析对抗样本对模型性能的影响。

链接: https://arxiv.org/abs/2505.02971
作者: Anjila Budathoki,Manish Dhakal
机构: Georgia State University (佐治亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial attacks have been fairly explored for computer vision and vision-language models. However, the avenue of adversarial attack for the vision language segmentation models (VLSMs) is still under-explored, especially for medical image analysis. Thus, we have investigated the robustness of VLSMs against adversarial attacks for 2D medical images with different modalities with radiology, photography, and endoscopy. The main idea of this project was to assess the robustness of the fine-tuned VLSMs specially in the medical domain setting to address the high risk scenario. First, we have fine-tuned pre-trained VLSMs for medical image segmentation with adapters. Then, we have employed adversarial attacks – projected gradient descent (PGD) and fast gradient sign method (FGSM) – on that fine-tuned model to determine its robustness against adversaries. We have reported models’ performance decline to analyze the adversaries’ impact. The results exhibit significant drops in the DSC and IoU scores after the introduction of these adversaries. Furthermore, we also explored universal perturbation but were not able to find for the medical images. \footnotethis https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.02971 [cs.CV] (or arXiv:2505.02971v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.02971 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Anjila Budathoki [view email] [v1] Mon, 5 May 2025 18:54:41 UTC (7,996 KB) Full-text links: Access Paper: View a PDF of the paper titled Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation, by Anjila Budathoki and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-93] Generating Narrated Lecture Videos from Slides with Synchronized Highlights

【速读】:该论文旨在解决将静态幻灯片转换为引人入胜的视频课程所耗费大量时间和精力的问题,传统方法需要演讲者录制讲解并视觉引导观众。其解决方案的关键是一个新颖的生成式 AI (Generative AI) 亮点对齐模块,该模块通过多种策略(如Levenshtein距离、基于大语言模型(LLM)的语义分析)在可选粒度(行或词级别)上精确地将语音短语映射到幻灯片上的位置,并利用提供时间戳的文本转语音(TTS)实现同步,从而实现高质量的视频课程自动生成。

链接: https://arxiv.org/abs/2505.02966
作者: Alexander Holmberg
机构: KTH Royal Institute of Technology (KTH皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Turning static slides into engaging video lectures takes considerable time and effort, requiring presenters to record explanations and visually guide their audience through the material. We introduce an end-to-end system designed to automate this process entirely. Given a slide deck, this system synthesizes a video lecture featuring AI-generated narration synchronized precisely with dynamic visual highlights. These highlights automatically draw attention to the specific concept being discussed, much like an effective presenter would. The core technical contribution is a novel highlight alignment module. This module accurately maps spoken phrases to locations on a given slide using diverse strategies (e.g., Levenshtein distance, LLM-based semantic analysis) at selectable granularities (line or word level) and utilizes timestamp-providing Text-to-Speech (TTS) for timing synchronization. We demonstrate the system’s effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples, finding that LLM-based alignment achieves high location accuracy (F1 92%), significantly outperforming simpler methods, especially on complex, math-heavy content. Furthermore, the calculated generation cost averages under 1 per hour of video, offering potential savings of two orders of magnitude compared to conservative estimates of manual production costs. This combination of high accuracy and extremely low cost positions this approach as a practical and scalable tool for transforming static slides into effective, visually-guided video lectures.
zh

[CV-94] Gone With the Bits: Revealing Racial Bias in Low-Rate Neural Compression for Facial Images

【速读】:该论文试图解决神经图像压缩模型中的偏见问题,特别是种族偏见,这可能在不同群体的个体中导致不公平的结果。解决方案的关键在于提出一个通用、结构化且可扩展的框架,用于评估神经图像压缩模型中的偏见,并通过分析九种流行模型及其变体来验证该框架的有效性。该研究揭示了传统失真度量无法捕捉神经压缩模型中的偏见,并指出可以通过检查图像重构中的面部表型退化来检测种族偏见,同时探讨了偏见与解码图像真实感之间的权衡关系。

链接: https://arxiv.org/abs/2505.02949
作者: Tian Qiu,Arjun Nichani,Rasta Tadayontahmasebi,Haewon Jeong
机构: University of California, Santa Barbara (加利福尼亚大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACM FAccT '25

点击查看摘要

Abstract:Neural compression methods are gaining popularity due to their superior rate-distortion performance over traditional methods, even at extremely low bitrates below 0.1 bpp. As deep learning architectures, these models are prone to bias during the training process, potentially leading to unfair outcomes for individuals in different groups. In this paper, we present a general, structured, scalable framework for evaluating bias in neural image compression models. Using this framework, we investigate racial bias in neural compression algorithms by analyzing nine popular models and their variants. Through this investigation, we first demonstrate that traditional distortion metrics are ineffective in capturing bias in neural compression models. Next, we highlight that racial bias is present in all neural compression models and can be captured by examining facial phenotype degradation in image reconstructions. We then examine the relationship between bias and realism in the decoded images and demonstrate a trade-off across models. Finally, we show that utilizing a racially balanced training set can reduce bias but is not a sufficient bias mitigation strategy. We additionally show the bias can be attributed to compression model bias and classification model bias. We believe that this work is a first step towards evaluating and eliminating bias in neural image compression models.
zh

[CV-95] A Wireless Collaborated Inference Acceleration Framework for Plant Disease Recognition

【速读】:该论文试图解决植物病害识别中传统人工识别方法准确率低、成本高、效率差以及深度学习模型在资源受限的嵌入式设备上运行困难的问题,同时应对将模型迁移至云端服务器时通信带宽受限导致的推理延迟和高能耗问题。其解决方案的关键在于提出一种边缘设备与云服务器协同推理框架,通过深度强化学习对深度神经网络(DNN)模型进行剪枝以提升推理速度并降低能耗,再利用贪心策略确定最优模型分割点,实现高效的协同推理加速。

链接: https://arxiv.org/abs/2505.02877
作者: Hele Zhu,Xinyi Huang,Haojia Gao,Mengfei Jiang,Haohua Que,Lei Mu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plant disease is a critical factor affecting agricultural production. Traditional manual recognition methods face significant drawbacks, including low accuracy, high costs, and inefficiency. Deep learning techniques have demonstrated significant benefits in identifying plant diseases, but they still face challenges such as inference delays and high energy consumption. Deep learning algorithms are difficult to run on resource-limited embedded devices. Offloading these models to cloud servers is confronted with the restriction of communication bandwidth, and all of these factors will influence the inference’s efficiency. We propose a collaborative inference framework for recognizing plant diseases between edge devices and cloud servers to enhance inference speed. The DNN model for plant disease recognition is pruned through deep reinforcement learning to improve the inference speed and reduce energy consumption. Then the optimal split point is determined by a greedy strategy to achieve the best collaborated inference acceleration. Finally, the system for collaborative inference acceleration in plant disease recognition has been implemented using Gradio to facilitate friendly human-machine interaction. Experiments indicate that the proposed collaborative inference framework significantly increases inference speed while maintaining acceptable recognition accuracy, offering a novel solution for rapidly diagnosing and preventing plant diseases.
zh

[CV-96] RESAnything: Attribute Prompting for Arbitrary Referring Segmentation

【速读】:该论文旨在解决任意指代表达分割(arbitrary referring expression segmentation, RES)问题,特别是针对传统方法无法处理的更广泛、更复杂的指代表达,包括对象或部件级别的标签以及隐含指向对象/部件功能、设计、风格、材质等属性的引用。其解决方案的关键在于引入了基于Chain-of-Thoughts (CoT)推理的属性提示(attribute prompting)机制,通过系统性地引导大型语言模型(LLM)生成对象/部件属性的详细描述(如形状、颜色和位置),结合基础图像分割模型生成的潜在分割提议,从而实现对隐式查询的深度推理,无需任何部件标注进行训练或微调。

链接: https://arxiv.org/abs/2505.02867
作者: Ruiqi Wang,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages, 31 figures. For more details: this https URL

点击查看摘要

Abstract:We present an open-vocabulary and zero-shot method for arbitrary referring expression segmentation (RES), targeting input expressions that are more general than what prior works were designed to handle. Specifically, our inputs encompass both object- and part-level labels as well as implicit references pointing to properties or qualities of object/part function, design, style, material, etc. Our model, coined RESAnything, leverages Chain-of-Thoughts (CoT) reasoning, where the key idea is attribute prompting. We generate detailed descriptions of object/part attributes including shape, color, and location for potential segment proposals through systematic prompting of a large language model (LLM), where the proposals are produced by a foundational image segmentation model. Our approach encourages deep reasoning about object or part attributes related to function, style, design, etc., enabling the system to handle implicit queries without any part annotations for training or fine-tuning. As the first zero-shot and LLM-based RES method, RESAnything achieves clearly superior performance among zero-shot methods on traditional RES benchmarks and significantly outperforms existing methods on challenging scenarios involving implicit queries and complex part-level relations. Finally, we contribute a new benchmark dataset to offer ~3K carefully curated RES instances to assess part-level, arbitrary RES solutions.
zh

[CV-97] STG: Spatiotemporal Graph Neural Network with Fusion and Spatiotemporal Decoupling Learning for Prognostic Prediction of Colorectal Cancer Liver Metastasis

【速读】:该论文旨在解决结直肠癌肝转移(Colorectal Cancer Liver Metastasis, CRLM)进展预测中临床模型无法有效整合肿瘤空间异质性、动态演化及多模态数据关系的问题,从而提升预测准确性。解决方案的关键在于提出一种多模态时空图神经网络(Multimodal Spatiotemporal Graph Neural Network, STG)框架,通过将术前CT影像与临床数据构建为异构图结构,结合空间拓扑与跨模态边进行肿瘤分布与时间演化的联合建模,并利用GraphSAGE聚合时空邻域信息,结合监督与对比学习策略增强模型对时间特征的捕捉能力与鲁棒性。

链接: https://arxiv.org/abs/2505.03123
作者: Yiran Zhu,Wei Yang,Yan su,Zesheng Li,Chengchang Pan,Honggang Qi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 9 pages, 4 figures, 5 tables

点击查看摘要

Abstract:We propose a multimodal spatiotemporal graph neural network (STG) framework to predict colorectal cancer liver metastasis (CRLM) progression. Current clinical models do not effectively integrate the tumor’s spatial heterogeneity, dynamic evolution, and complex multimodal data relationships, limiting their predictive accuracy. Our STG framework combines preoperative CT imaging and clinical data into a heterogeneous graph structure, enabling joint modeling of tumor distribution and temporal evolution through spatial topology and cross-modal edges. The framework uses GraphSAGE to aggregate spatiotemporal neighborhood information and leverages supervised and contrastive learning strategies to enhance the model’s ability to capture temporal features and improve robustness. A lightweight version of the model reduces parameter count by 78.55%, maintaining near-state-of-the-art performance. The model jointly optimizes recurrence risk regression and survival analysis tasks, with contrastive loss improving feature representational discriminability and cross-modal consistency. Experimental results on the MSKCC CRLM dataset show a time-adjacent accuracy of 85% and a mean absolute error of 1.1005, significantly outperforming existing methods. The innovative heterogeneous graph construction and spatiotemporal decoupling mechanism effectively uncover the associations between dynamic tumor microenvironment changes and prognosis, providing reliable quantitative support for personalized treatment decisions.
zh

[CV-98] Dual Prompting for Diverse Count-level PET Denoising

【速读】:该论文旨在解决正电子发射断层扫描(PET)图像去噪中因不同计数水平(count levels)导致的统一模型泛化能力不足的问题。其关键解决方案是引入双提示机制(dual prompts),即通过显式计数水平提示提供特定先验信息,以及隐式通用去噪提示编码核心的PET去噪知识,结合新颖的提示融合模块和提示-特征交互模块,实现对异构提示的统一与注入,从而动态引导噪声条件下的去噪过程。

链接: https://arxiv.org/abs/2505.03037
作者: Xiaofeng Liu,Yongsong Huang,Thibault Marin,Samira Vafay Eslahi,Tiss Amal,Yanis Chemli,Keith Johnson,Georges El Fakhri,Jinsong Ouyang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Published in IEEE International Symposium on Biomedical Imaging (ISBI) 2025

点击查看摘要

Abstract:The to-be-denoised positron emission tomography (PET) volumes are inherent with diverse count levels, which imposes challenges for a unified model to tackle varied cases. In this work, we resort to the recently flourished prompt learning to achieve generalizable PET denoising with different count levels. Specifically, we propose dual prompts to guide the PET denoising in a divide-and-conquer manner, i.e., an explicitly count-level prompt to provide the specific prior information and an implicitly general denoising prompt to encode the essential PET denoising knowledge. Then, a novel prompt fusion module is developed to unify the heterogeneous prompts, followed by a prompt-feature interaction module to inject prompts into the features. The prompts are able to dynamically guide the noise-conditioned denoising process. Therefore, we are able to efficiently train a unified denoising model for various count levels, and deploy it to different cases with personalized prompts. We evaluated on 1940 low-count PET 3D volumes with uniformly randomly selected 13-22% fractions of events from 97 ^18 F-MK6240 tau PET studies. It shows our dual prompting can largely improve the performance with informed count-level and outperform the count-conditional model.
zh

[CV-99] Floating Car Observers in Intelligent Transportation Systems: Detection Modeling and Temporal Insights

【速读】:该论文旨在解决如何在微观交通仿真中建模浮游车辆观测器(FCO)检测数据,以评估其在智能交通系统(ITS)中的应用潜力。其关键解决方案是通过多种建模方法,从2D射线追踪到高保真联合仿真,结合三维目标检测算法,以精确复现FCO的检测行为,并引入基于神经网络的仿真技术,以高效且可扩展的方式近似高保真仿真结果。该方法不仅保留了FCO检测的独特特性,还为交通网络数字孪生中的数据应用提供了可行路径。

链接: https://arxiv.org/abs/2505.02845
作者: Jeremias Gerner,Klaus Bogenberger,Stefanie Schmidtner
机构: 未知
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Floating Car Observers (FCOs) extend traditional Floating Car Data (FCD) by integrating onboard sensors to detect and localize other traffic participants, providing richer and more detailed traffic data. In this work, we explore various modeling approaches for FCO detections within microscopic traffic simulations to evaluate their potential for Intelligent Transportation System (ITS) applications. These approaches range from 2D raytracing to high-fidelity co-simulations that emulate real-world sensors and integrate 3D object detection algorithms to closely replicate FCO detections. Additionally, we introduce a neural network-based emulation technique that effectively approximates the results of high-fidelity co-simulations. This approach captures the unique characteristics of FCO detections while offering a fast and scalable solution for modeling. Using this emulation method, we investigate the impact of FCO data in a digital twin of a traffic network modeled in SUMO. Results demonstrate that even at a 20% penetration rate, FCOs using LiDAR-based detections can identify 65% of vehicles across various intersections and traffic demand scenarios. Further potential emerges when temporal insights are integrated, enabling the recovery of previously detected but currently unseen vehicles. By employing data-driven methods, we recover over 80% of these vehicles with minimal positional deviations. These findings underscore the potential of FCOs for ITS, particularly in enhancing traffic state estimation and monitoring under varying penetration rates and traffic conditions.
zh

[CV-100] Physical foundations for trustworthy medical imaging: a review for artificial intelligence researchers

【速读】:该论文试图解决人工智能在医学影像领域应用中,从业人员对医学影像采集物理原理理解不足的问题,这一缺陷限制了其充分利用人工智能技术的潜力。解决方案的关键在于将物理知识整合到人工智能算法中,特别是通过引入基于物理的约束来增强生成模型和重建算法的可信度与鲁棒性,从而提升医学影像特征学习的效果。

链接: https://arxiv.org/abs/2505.02843
作者: Miriam Cobo,David Corral Fontecha,Wilson Silva,Lara Lloret Iglesias
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 17 pages, 2 figures

点击查看摘要

Abstract:Artificial intelligence in medical imaging has seen unprecedented growth in the last years, due to rapid advances in deep learning and computing resources. Applications cover the full range of existing medical imaging modalities, with unique characteristics driven by the physics of each technique. Yet, artificial intelligence professionals entering the field, and even experienced developers, often lack a comprehensive understanding of the physical principles underlying medical image acquisition, which hinders their ability to fully leverage its potential. The integration of physics knowledge into artificial intelligence algorithms enhances their trustworthiness and robustness in medical imaging, especially in scenarios with limited data availability. In this work, we review the fundamentals of physics in medical images and their impact on the latest advances in artificial intelligence, particularly, in generative models and reconstruction algorithms. Finally, we explore the integration of physics knowledge into physics-inspired machine learning models, which leverage physics-based constraints to enhance the learning of medical imaging features.
zh

人工智能

[AI-0] AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control

【速读】:该论文旨在解决高自由度(DoF)和非线性动力学下,实现真实人形机器人全身运动控制的挑战,特别是提升其操作空间和任务执行能力。解决方案的关键在于提出自适应运动优化(Adaptive Motion Optimization, AMO)框架,该框架结合了从仿真到现实的强化学习(sim-to-real reinforcement learning, RL)与轨迹优化,以实现实时、自适应的全身控制,并通过构建混合数据集和训练具备鲁棒性适应能力的网络来减少运动模仿RL中的分布偏差。

链接: https://arxiv.org/abs/2505.03738
作者: Jialong Li,Xuxin Cheng,Tianshu Huang,Shiqi Yang,Ri-Zhao Qiu,Xiaolong Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: website: this https URL

点击查看摘要

Abstract:Humanoid robots derive much of their dexterity from hyper-dexterous whole-body movements, enabling tasks that require a large operational workspace: such as picking objects off the ground. However, achieving these capabilities on real humanoids remains challenging due to their high degrees of freedom (DoF) and nonlinear dynamics. We propose Adaptive Motion Optimization (AMO), a framework that integrates sim-to-real reinforcement learning (RL) with trajectory optimization for real-time, adaptive whole-body control. To mitigate distribution bias in motion imitation RL, we construct a hybrid AMO dataset and train a network capable of robust, on-demand adaptation to potentially O.O.D. commands. We validate AMO in simulation and on a 29-DoF Unitree G1 humanoid robot, demonstrating superior stability and an expanded workspace compared to strong baselines. Finally, we show that AMO’s consistent performance supports autonomous task execution via imitation learning, underscoring the system’s versatility and robustness.
zh

[AI-1] Demonstrating ViSafe: Vision-enabled Safety for High-speed Detect and Avoid

【速读】:该论文旨在解决高密度空域中飞行器实现无缝安全分离的问题,特别是为资源受限的空中系统提供关键的安全保障能力。其解决方案的关键在于提出ViSafe,一个基于视觉的高速空中避撞系统,通过将学习驱动的边缘AI框架与符合SWaP-C约束的定制多摄像头硬件原型紧密集成,实现端到端的检测与避让(DAA)功能。ViSafe利用感知输入导向的控制屏障函数(CBF)设计、编码和执行安全阈值,从而在高速空中操作中提供可证明的安全运行保障。

链接: https://arxiv.org/abs/2505.03694
作者: Parv Kapoor,Ian Higgins,Nikhil Keetha,Jay Patrikar,Brady Moon,Zelin Ye,Yao He,Ivan Cisneros,Yaoyu Hu,Changliu Liu,Eunsuk Kang,Sebastian Scherer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 13 pages, RSS 2025 Demo track

点击查看摘要

Abstract:Assured safe-separation is essential for achieving seamless high-density operation of airborne vehicles in a shared airspace. To equip resource-constrained aerial systems with this safety-critical capability, we present ViSafe, a high-speed vision-only airborne collision avoidance system. ViSafe offers a full-stack solution to the Detect and Avoid (DAA) problem by tightly integrating a learning-based edge-AI framework with a custom multi-camera hardware prototype designed under SWaP-C constraints. By leveraging perceptual input-focused control barrier functions (CBF) to design, encode, and enforce safety thresholds, ViSafe can provide provably safe runtime guarantees for self-separation in high-speed aerial operations. We evaluate ViSafe’s performance through an extensive test campaign involving both simulated digital twins and real-world flight scenarios. By independently varying agent types, closure rates, interaction geometries, and environmental conditions (e.g., weather and lighting), we demonstrate that ViSafe consistently ensures self-separation across diverse scenarios. In first-of-its-kind real-world high-speed collision avoidance tests with closure rates reaching 144 km/h, ViSafe sets a new benchmark for vision-only autonomous collision avoidance, establishing a new standard for safety in high-speed aerial navigation.
zh

[AI-2] Graph Drawing for LLM s: An Empirical Evaluation

【速读】:该论文试图解决如何在基于视觉模态的图相关任务中,提升大型语言模型(Large Language Models, LLMs)的性能问题。其解决方案的关键在于优化输入图的布局范式与可读性,并选择有效的提示技术(prompting technique),以从人类视角提升模型对图结构的理解与处理能力。

链接: https://arxiv.org/abs/2505.03678
作者: Walter Didimo,Fabrizio Montecchiani,Tommaso Piselli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our work contributes to the fast-growing literature on the use of Large Language Models (LLMs) to perform graph-related tasks. In particular, we focus on usage scenarios that rely on the visual modality, feeding the model with a drawing of the graph under analysis. We investigate how the model’s performance is affected by the chosen layout paradigm, the aesthetics of the drawing, and the prompting technique used for the queries. We formulate three corresponding research questions and present the results of a thorough experimental analysis. Our findings reveal that choosing the right layout paradigm and optimizing the readability of the input drawing from a human perspective can significantly improve the performance of the model on the given task. Moreover, selecting the most effective prompting technique is a challenging yet crucial task for achieving optimal performance.
zh

[AI-3] Gap the (Theory of) Mind: Sharing Beliefs About Teammates Goals Boosts Collaboration Perception Not Performance

【速读】:该论文试图解决在人机协作团队中,由于直接沟通目标不可行,如何通过AI代理对人类队友目标的推断来提升任务表现和协作感知的问题。解决方案的关键在于评估AI代理主动分享其对人类队友目标的推断理解是否能促进战略适应和主观协作感知,尽管实验结果未显示任务表现和总体满意度有显著提升,但表明这种信息共享有助于增强信任和协作感。

链接: https://arxiv.org/abs/2505.03674
作者: Yotam Amitai,Reuth Mirsky,Ofra Amir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In human-agent teams, openly sharing goals is often assumed to enhance planning, collaboration, and effectiveness. However, direct communication of these goals is not always feasible, requiring teammates to infer their partner’s intentions through actions. Building on this, we investigate whether an AI agent’s ability to share its inferred understanding of a human teammate’s goals can improve task performance and perceived collaboration. Through an experiment comparing three conditions-no recognition (NR), viable goals (VG), and viable goals on-demand (VGod) - we find that while goal-sharing information did not yield significant improvements in task performance or overall satisfaction scores, thematic analysis suggests that it supported strategic adaptations and subjective perceptions of collaboration. Cognitive load assessments revealed no additional burden across conditions, highlighting the challenge of balancing informativeness and simplicity in human-agent interactions. These findings highlight the nuanced trade-off of goal-sharing: while it fosters trust and enhances perceived collaboration, it can occasionally hinder objective performance gains.
zh

[AI-4] Learning Symbolic Persistent Macro-Actions for POMDP Solving Over Time

【速读】:该论文试图解决在不确定性环境下实现可解释决策的问题,特别是在需要使用宏观动作(macro-actions)的情况下。其解决方案的关键在于将时间逻辑推理与部分可观测马尔可夫决策过程(POMDPs)相结合,利用基于事件演算(EC)的线性时序逻辑(LTL)片段生成持久性(persistent)宏观动作,从而指导基于蒙特卡洛树搜索(MCTS)的POMDP求解器,在时间范围内显著减少推理时间并保持鲁棒性能。这些宏观动作通过归纳逻辑编程(ILP)从少量执行轨迹中学习,无需手动设计启发式策略,仅需指定POMDP转移模型。

链接: https://arxiv.org/abs/2505.03668
作者: Celeste Veronese,Daniele Meli,Alessandro Farinelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at 9th Conference on Neurosymbolic Learning and Reasoning

点击查看摘要

Abstract:This paper proposes an integration of temporal logical reasoning and Partially Observable Markov Decision Processes (POMDPs) to achieve interpretable decision-making under uncertainty with macro-actions. Our method leverages a fragment of Linear Temporal Logic (LTL) based on Event Calculus (EC) to generate \emphpersistent (i.e., constant) macro-actions, which guide Monte Carlo Tree Search (MCTS)-based POMDP solvers over a time horizon, significantly reducing inference time while ensuring robust performance. Such macro-actions are learnt via Inductive Logic Programming (ILP) from a few traces of execution (belief-action pairs), thus eliminating the need for manually designed heuristics and requiring only the specification of the POMDP transition model. In the Pocman and Rocksample benchmark scenarios, our learned macro-actions demonstrate increased expressiveness and generality when compared to time-independent heuristics, indeed offering substantial computational efficiency improvements.
zh

[AI-5] Counterfactual Inference for Eliminating Sentiment Bias in Recommender Systems

【速读】:该论文试图解决推荐系统(Recommender Systems, RSs)中由于情感偏差(sentiment bias)导致的推荐准确性下降问题,特别是在负面评价的用户或物品上表现更差,从而对关键用户和小众物品产生不公平的影响。解决方案的关键在于从反事实推断(counterfactual inference)的角度出发,分两个阶段进行处理:在模型训练阶段构建因果图以建模情感对最终评分的影响;在推理阶段解耦直接效应与间接效应,并通过反事实推断消除情感偏差的间接影响。

链接: https://arxiv.org/abs/2505.03655
作者: Le Pan,Yuanjiang Cao,Chengkai Huang,Wenjie Zhang,Lina Yao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender Systems (RSs) aim to provide personalized recommendations for users. A newly discovered bias, known as sentiment bias, uncovers a common phenomenon within Review-based RSs (RRSs): the recommendation accuracy of users or items with negative reviews deteriorates compared with users or items with positive reviews. Critical users and niche items are disadvantaged by such unfair recommendations. We study this problem from the perspective of counterfactual inference with two stages. At the model training stage, we build a causal graph and model how sentiment influences the final rating score. During the inference stage, we decouple the direct and indirect effects to mitigate the impact of sentiment bias and remove the indirect effect using counterfactual inference. We have conducted extensive experiments, and the results validate that our model can achieve comparable performance on rating prediction for better recommendations and effective mitigation of sentiment bias. To the best of our knowledge, this is the first work to employ counterfactual inference on sentiment bias mitigation in RSs.
zh

[AI-6] BURNS: Backward Underapproximate Reachability for Neural-Feedback-Loop Systems

【速读】:该论文试图解决学习增强型规划与控制算法在性能或安全性方面缺乏严格保证的问题。其解决方案的关键在于提出一种计算非线性离散时间神经反馈回路的下近似后向可达集的算法,通过系统动态函数的上近似,利用混合整数线性规划求解下近似后向可达集,从而验证目标可达性属性。

链接: https://arxiv.org/abs/2505.03643
作者: Chelsea Sidrane,Jana Tumova
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Learning-enabled planning and control algorithms are increasingly popular, but they often lack rigorous guarantees of performance or safety. We introduce an algorithm for computing underapproximate backward reachable sets of nonlinear discrete time neural feedback loops. We then use the backward reachable sets to check goal-reaching properties. Our algorithm is based on overapproximating the system dynamics function to enable computation of underapproximate backward reachable sets through solutions of mixed-integer linear programs. We rigorously analyze the soundness of our algorithm and demonstrate it on a numerical example. Our work expands the class of properties that can be verified for learning-enabled systems.
zh

[AI-7] Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering and Manipulating Human Perceptual Variability ICML2025

【速读】:该论文旨在解决人类在面对不确定性和模糊性时,其感知与决策机制中存在的个体差异问题。其核心挑战在于如何系统地研究和建模这种个体间的感知变异性。解决方案的关键在于提出一种计算框架BAM(Boundary Alignment Manipulation framework),该框架结合了人工神经网络(ANN)中的感知边界采样与人类行为实验,通过生成沿ANN决策边界具有显著感知变异性的刺激,并利用大规模行为实验验证其有效性,从而实现对个体感知决策的预测与操控。

链接: https://arxiv.org/abs/2505.03641
作者: Chen Wei,Chi Zhang,Jiachen Zou,Haotian Deng,Dietmar Heinke,Quanying Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted at ICML 2025

点击查看摘要

Abstract:Human decision-making in cognitive tasks and daily life exhibits considerable variability, shaped by factors such as task difficulty, individual preferences, and personal experiences. Understanding this variability across individuals is essential for uncovering the perceptual and decision-making mechanisms that humans rely on when faced with uncertainty and ambiguity. We present a computational framework BAM (Boundary Alignment Manipulation framework) that combines perceptual boundary sampling in ANNs and human behavioral experiments to systematically investigate this phenomenon. Our perceptual boundary sampling algorithm generates stimuli along ANN decision boundaries that intrinsically induce significant perceptual variability. The efficacy of these stimuli is empirically validated through large-scale behavioral experiments involving 246 participants across 116,715 trials, culminating in the variMNIST dataset containing 19,943 systematically annotated images. Through personalized model alignment and adversarial generation, we establish a reliable method for simultaneously predicting and manipulating the divergent perceptual decisions of pairs of participants. This work bridges the gap between computational models and human individual difference research, providing new tools for personalized perception analysis.
zh

[AI-8] Rainbow Delay Compensation: A Multi-Agent Reinforcement Learning Framework for Mitigating Delayed Observation

【速读】:该论文旨在解决多智能体强化学习(MARL)中由于观测延迟带来的性能下降问题,特别是在存在随机个体延迟的情况下。其核心挑战在于,每个智能体的局部观测通常由来自其他智能体或环境中的动态实体的多个组件构成,这些具有不同延迟特性的离散观测组件使得决策过程变得复杂。论文提出了一种名为Rainbow Delay Compensation (RDC)的MARL训练框架,其关键在于通过补偿随机个体延迟来提升算法在延迟环境下的表现,从而在某些延迟场景中实现接近无延迟的理想性能,同时保持良好的泛化能力。

链接: https://arxiv.org/abs/2505.03586
作者: Songchen Fu,Siang Chen,Shaojing Zhao,Letian Bai,Ta Li,Yonghong Yan
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: The code will be open-sourced in the RDC-pymarl project under this https URL

点击查看摘要

Abstract:In real-world multi-agent systems (MASs), observation delays are ubiquitous, preventing agents from making decisions based on the environment’s true state. An individual agent’s local observation often consists of multiple components from other agents or dynamic entities in the environment. These discrete observation components with varying delay characteristics pose significant challenges for multi-agent reinforcement learning (MARL). In this paper, we first formulate the decentralized stochastic individual delay partially observable Markov decision process (DSID-POMDP) by extending the standard Dec-POMDP. We then propose the Rainbow Delay Compensation (RDC), a MARL training framework for addressing stochastic individual delays, along with recommended implementations for its constituent modules. We implement the DSID-POMDP’s observation generation pattern using standard MARL benchmarks, including MPE and SMAC. Experiments demonstrate that baseline MARL methods suffer severe performance degradation under fixed and unfixed delays. The RDC-enhanced approach mitigates this issue, remarkably achieving ideal delay-free performance in certain delay scenarios while maintaining generalization capability. Our work provides a novel perspective on multi-agent delayed observation problems and offers an effective solution framework.
zh

[AI-9] BCause: Human-AI collaboration to improve hybrid mapping and ideation in argumentation-grounded deliberation

【速读】:该论文试图解决公共讨论(public deliberation)中存在的话语分散、浅层理解以及与可操作性政策结果脱节的问题。其解决方案的关键在于构建一个结合生成式 AI (Generative AI) 与人机协作的讨论系统 BCause,通过将非结构化的对话转化为结构化、可操作的民主过程,实现对公共议题的深入分析与有效响应。

链接: https://arxiv.org/abs/2505.03584
作者: Lucas Anastasiou,Anna De Liddo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Public deliberation, as in open discussion of issues of public concern, often suffers from scattered and shallow discourse, poor sensemaking, and a disconnect from actionable policy outcomes. This paper introduces BCause, a discussion system leveraging generative AI and human-machine collaboration to transform unstructured dialogue around public issues (such as urban living, policy changes, and current socio-economic transformations) into structured, actionable democratic processes. We present three innovations: (i) importing and transforming unstructured transcripts into argumentative discussions, (ii) geo-deliberated problem-sensing via a Telegram bot for local issue reporting, and (iii) smart reporting with customizable widgets (e.g., summaries, topic modelling, policy recommendations, clustered arguments). The system’s human-AI partnership preserves critical human participation to ensure ethical oversight, contextual relevance, and creative synthesis.
zh

[AI-10] LlamaFirewall: An open source guardrail system for building secure AI agents

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)作为自主代理在执行复杂任务时引入的新安全风险,这些问题包括提示注入、代理对齐偏差和不安全代码生成等,而现有安全措施如模型微调或面向聊天机器人的防护机制无法全面应对。解决方案的关键在于构建一个实时的防护监控系统,作为最后一道防线,并支持系统级和用例特定的安全策略定义与执行。论文提出LlamaFirewall,一个开源的安全防护框架,通过三个核心防护机制:PromptGuard 2(一种先进的越狱检测器)、Agent Alignment Checks(一种链式思维审计器)以及CodeShield(一种在线静态分析引擎),实现对上述安全风险的有效缓解。

链接: https://arxiv.org/abs/2505.03574
作者: Sahana Chennabasappa,Cyrus Nikolaidis,Daniel Song,David Molnar,Stephanie Ding,Shengye Wan,Spencer Whitman,Lauren Deason,Nicholas Doucette,Abraham Montilla,Alekhya Gampa,Beto de Paola,Dominik Gabi,James Crnkovich,Jean-Christophe Testud,Kat He,Rashnil Chaturvedi,Wu Zhou,Joshua Saxe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent’s security guardrails.
zh

[AI-11] OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

【速读】:该论文试图解决当前GUI-navigation AI代理在复杂、多模态桌面任务上的评估缺乏统一、可扩展且自动化验证的基准问题。解决方案的关键在于构建OSUniverse基准,该基准通过分层的任务设计,从基础的精确点击到需要灵活性、精确度和清晰思维的多步骤、多应用测试,全面覆盖了不同复杂度的任务,并确保当前SOTA(State of the Art)代理在基准测试中的表现不超过50%,而普通办公人员可实现100%准确率。此外,该基准引入了自动化验证机制,平均误差率低于2%,为短期和中期评估AI代理的进展、能力和有效性提供了可靠的基础。

链接: https://arxiv.org/abs/2505.03570
作者: Mariya Davydova,Daniel Jeffries,Patrick Barker,Arturo Márquez Flores,Sinéad Ryan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce OSUniverse: a benchmark of complex, multimodal desktop-oriented tasks for advanced GUI-navigation AI agents that focuses on ease of use, extensibility, comprehensive coverage of test cases, and automated validation. We divide the tasks in increasing levels of complexity, from basic precision clicking to multistep, multiapplication tests requiring dexterity, precision, and clear thinking from the agent. In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy. The benchmark can be scored manually, but we also introduce an automated validation mechanism that has an average error rate less than 2%. Therefore, this benchmark presents solid ground for fully automated measuring of progress, capabilities and the effectiveness of GUI-navigation AI agents over the short and medium-term horizon. The source code of the benchmark is available at this https URL.
zh

[AI-12] Ergodic Generative Flows ICML2025

【速读】:该论文旨在解决生成式流网络(Generative Flow Networks, GFNs)在连续设置和模仿学习(Imitation Learning, IL)中的训练挑战,包括流匹配损失的不可解性、非有向无环训练的有限测试以及模仿学习中对独立奖励模型的依赖。其解决方案的关键在于提出了一类称为遍历生成流(Ergodic Generative Flows, EGFs)的生成流方法,通过利用遍历性构建具有全局定义变换的简单生成流,确保通用性并实现可处理的流匹配损失(FM loss),同时引入一种结合交叉熵与弱流匹配控制的新损失函数——KL-weakFM loss,用于无需独立奖励模型的模仿学习训练。

链接: https://arxiv.org/abs/2505.03561
作者: Leo Maxime Brunswic,Mateo Clemente,Rui Heng Yang,Adam Sigal,Amir Rasouli,Yinchuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Differential Geometry (math.DG); Dynamical Systems (math.DS)
备注: 20 pages, 5 figures, 1 table, accepted at ICML 2025

点击查看摘要

Abstract:Generative Flow Networks (GFNs) were initially introduced on directed acyclic graphs to sample from an unnormalized distribution density. Recent works have extended the theoretical framework for generative methods allowing more flexibility and enhancing application range. However, many challenges remain in training GFNs in continuous settings and for imitation learning (IL), including intractability of flow-matching loss, limited tests of non-acyclic training, and the need for a separate reward model in imitation learning. The present work proposes a family of generative flows called Ergodic Generative Flows (EGFs) which are used to address the aforementioned issues. First, we leverage ergodicity to build simple generative flows with finitely many globally defined transformations (diffeomorphisms) with universality guarantees and tractable flow-matching loss (FM loss). Second, we introduce a new loss involving cross-entropy coupled to weak flow-matching control, coined KL-weakFM loss. It is designed for IL training without a separate reward model. We evaluate IL-EGFs on toy 2D tasks and real-world datasets from NASA on the sphere, using the KL-weakFM loss. Additionally, we conduct toy 2D reinforcement learning experiments with a target reward, using the FM loss.
zh

[AI-13] Rapid AI-based generation of coverag e paths for dispensing applications

【速读】:该论文试图解决热界面材料(Thermal Interface Materials, TIM)的覆盖路径规划问题,该问题在功率电子和电子控制单元的设计中起着关键作用。传统方法依赖专家手动操作或计算成本较高的优化方法,而本文提出了一种基于生成式AI的新方法,其关键在于使用人工神经网络(Artificial Neural Network, ANN)直接根据目标冷却区域生成点胶路径,无需标签数据,并且能够直接应用于自动化制造设备,避免空气夹带问题。

链接: https://arxiv.org/abs/2505.03560
作者: Simon Baeuerle,Ian F. Mendonca,Kristof Van Laerhoven,Ralf Mikut,Andreas Steimer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coverage Path Planning of Thermal Interface Materials (TIM) plays a crucial role in the design of power electronics and electronic control units. Up to now, this is done manually by experts or by using optimization approaches with a high computational effort. We propose a novel AI-based approach to generate dispense paths for TIM and similar dispensing applications. It is a drop-in replacement for optimization-based approaches. An Artificial Neural Network (ANN) receives the target cooling area as input and directly outputs the dispense path. Our proposed setup does not require labels and we show its feasibility on multiple target areas. The resulting dispense paths can be directly transferred to automated manufacturing equipment and do not exhibit air entrapments. The approach of using an ANN to predict process parameters for a desired target state in real-time could potentially be transferred to other manufacturing processes.
zh

[AI-14] A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)输出不一致和幻觉问题,这些问题阻碍了可靠AI系统的构建。当不同的专有推理模型(Reasoning Models, RMs)接收到相同的复杂请求时,由于训练和推理的差异,往往会产生分歧结果。该研究提出了一种受分布式账本技术启发的共识机制,将每个RM视为黑盒节点,通过Gossip-about-Gossip通信和虚拟投票实现多个RMs之间的共识。其关键在于利用Hashgraph算法架构,使RMs迭代交换并更新答案,结合每轮信息提升后续预测的准确性和置信度,从而超越传统多数投票方法,实现更有效的模型知识整合与交叉验证。

链接: https://arxiv.org/abs/2505.03553
作者: Kolawole E. Ogunsina,Morayo A. Ogunsina
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 15 pages

点击查看摘要

Abstract:Inconsistent outputs and hallucinations from large language models (LLMs) are major obstacles to reliable AI systems. When different proprietary reasoning models (RMs), such as those by OpenAI, Google, Anthropic, DeepSeek, and xAI, are given the same complex request, they often produce divergent results due to variations in training and inference. This paper proposes a novel consensus mechanism, inspired by distributed ledger technology, to validate and converge these outputs, treating each RM as a black-box peer. Building on the Hashgraph consensus algorithm, our approach employs gossip-about-gossip communication and virtual voting to achieve agreement among an ensemble of RMs. We present an architectural design for a prototype system in which RMs iteratively exchange and update their answers, using information from each round to improve accuracy and confidence in subsequent rounds. This approach goes beyond simple majority voting by incorporating the knowledge and cross-verification content of every model. We justify the feasibility of this Hashgraph-inspired consensus for AI ensembles and outline its advantages over traditional ensembling techniques in reducing nonfactual outputs. Preliminary considerations for implementation, evaluation criteria for convergence and accuracy, and potential challenges are discussed. The proposed mechanism demonstrates a promising direction for multi-agent AI systems to self-validate and deliver high-fidelity responses in complex tasks.
zh

[AI-15] STORY2GAME: Generating (Almost) Everything in an Interactive Fiction Game

【速读】:该论文试图解决如何利用大型语言模型生成基于文本的互动小说游戏的问题,特别是在生成故事后如何构建游戏引擎中的动作代码以实现交互性。解决方案的关键在于使用大语言模型生成的动作的前提条件和效果作为指导,确定游戏状态中需要跟踪和修改的方面,从而实现更开放但又与游戏状态紧密相关的故事情节。

链接: https://arxiv.org/abs/2505.03547
作者: Eric Zhou,Shreyas Basavatia,Moontashir Siam,Zexin Chen,Mark O. Riedl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce STORY2GAME, a novel approach to using Large Language Models to generate text-based interactive fiction games that starts by generating a story, populates the world, and builds the code for actions in a game engine that enables the story to play out interactively. Whereas a given set of hard-coded actions can artificially constrain story generation, the ability to generate actions means the story generation process can be more open-ended but still allow for experiences that are grounded in a game state. The key to successful action generation is to use LLM-generated preconditions and effects of actions in the stories as guides for what aspects of the game state must be tracked and changed by the game engine when a player performs an action. We also introduce a technique for dynamically generating new actions to accommodate the player’s desire to perform actions that they think of that are not part of the story. Dynamic action generation may require on-the-fly updates to the game engine’s state representation and revision of previously generated actions. We evaluate the success rate of action code generation with respect to whether a player can interactively play through the entire generated story.
zh

[AI-16] Augmenting Human Cognition through Everyday AR

【速读】:该论文试图解决如何将增强现实(AR)从传统的信息展示工具转变为一种能够主动感知环境并支持人类任务执行的智能“思维工具”。其解决方案的关键在于通过集成语义理解和上下文感知的人工智能技术,使AR能够在持续运行状态下无缝连接数字认知与物理环境,从而实现更具主动性与情境敏感性的交互体验。

链接: https://arxiv.org/abs/2505.03492
作者: Xiaoan Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 3 pages, 4 figures. Position paper accepted to CHI’25 Workshop ‘Everyday AR through AI-in-the-Loop’

点击查看摘要

Abstract:As spatial computing and multimodal LLMs mature, AR is tending to become an intuitive “thinking tool,” embedding semantic and context-aware intelligence directly into everyday environments. This paper explores how always-on AR can seamlessly bridge digital cognition and physical affordances, enabling proactive, context-sensitive interactions that enhance human task performance and understanding.
zh

[AI-17] A new membership inference attack that spots memorization in generative and predictive models: Loss-Based with Reference Model algorithm (LBRM)

【速读】:该论文旨在解决时间序列插补模型中可能无意记忆训练数据的问题,从而降低由此引发的隐私风险。其解决方案的关键在于引入基于损失的参考模型(Loss-Based with Reference Model, LBRM)算法,该算法通过利用参考模型提升成员推理攻击的准确性,以区分训练数据与测试数据。该方法在不进行微调的情况下,平均使AUROC提升了约40%,在进行微调后提升了约60%,展示了其在检测准确性和适用性上的显著优势。

链接: https://arxiv.org/abs/2505.03490
作者: Faiz Taleb,Ivan Gazeau,Maryline Laurent
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models can unintentionally memorize training data, posing significant privacy risks. This paper addresses the memorization phenomenon in time series imputation models, introducing the Loss-Based with Reference Model (LBRM) algorithm. The LBRM method leverages a reference model to enhance the accuracy of membership inference attacks, distinguishing between training and test data. Our contributions are twofold: first, we propose an innovative method to effectively extract and identify memorized training data, significantly improving detection accuracy. On average, without fine-tuning, the AUROC improved by approximately 40%. With fine-tuning, the AUROC increased by approximately 60%. Second, we validate our approach through membership inference attacks on two types of architectures designed for time series imputation, demonstrating the robustness and versatility of the LBRM approach in different contexts. These results highlight the significant enhancement in detection accuracy provided by the LBRM approach, addressing privacy risks in time series imputation models.
zh

[AI-18] am-ELO: A Stable Framework for Arena-based LLM Evaluation ICML2025

【速读】:该论文旨在解决基于ELO评分系统的竞技场评估框架在现代AI模型,尤其是大语言模型(LLMs)评估中存在排名不一致导致的不稳定问题,以及未充分考虑标注者能力差异的问题。其解决方案的关键在于引入一种改进的ELO评分系统——m-ELO,通过最大似然估计(Maximum Likelihood Estimation, MLE)方法替代传统的迭代更新方式,并提供了该方法在模型排序中的一致性和稳定性的理论证明;同时提出了am-ELO,通过修改ELO评分的概率函数以纳入标注者能力,从而实现对模型得分和标注者可靠性的联合估计。

链接: https://arxiv.org/abs/2505.03475
作者: Zirui Liu,Jiatong Li,Yan Zhuang,Qi Liu,Shuanghong Shen,Jie Ouyang,Mingyue Cheng,Shijin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML2025 Accepted

点击查看摘要

Abstract:Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
zh

[AI-19] Detecting Quishing Attacks with Machine Learning Techniques Through QR Code Analysis

【速读】:该论文试图解决基于二维码的网络钓鱼(Quishing)问题,即攻击者利用二维码绕过传统钓鱼防御机制所带来的网络安全威胁。现有检测方法主要依赖于对二维码中嵌入的URL进行分析,这不仅可能暴露用户至恶意内容,且无法应对二维码中编码的非URL数据类型。论文提出的解决方案的关键在于直接分析二维码的结构和像素模式,而不需提取其嵌入内容,从而实现更全面的检测能力。通过构建包含恶意与良性二维码的数据集,并训练多种机器学习模型,最终采用XGBoost模型实现了较高的检测性能,验证了以二维码为中心的检测方法的可行性。

链接: https://arxiv.org/abs/2505.03451
作者: Fouad Trad,Ali Chehab
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted in 8th International Conference on Optimization and Learning (OLA2025)

点击查看摘要

Abstract:The rise of QR code based phishing (“Quishing”) poses a growing cybersecurity threat, as attackers increasingly exploit QR codes to bypass traditional phishing defenses. Existing detection methods predominantly focus on URL analysis, which requires the extraction of the QR code payload, and may inadvertently expose users to malicious content. Moreover, QR codes can encode various types of data beyond URLs, such as Wi-Fi credentials and payment information, making URL-based detection insufficient for broader security concerns. To address these gaps, we propose the first framework for quishing detection that directly analyzes QR code structure and pixel patterns without extracting the embedded content. We generated a dataset of phishing and benign QR codes and we used it to train and evaluate multiple machine learning models, including Logistic Regression, Decision Trees, Random Forest, Naive Bayes, LightGBM, and XGBoost. Our best-performing model (XGBoost) achieves an AUC of 0.9106, demonstrating the feasibility of QR-centric detection. Through feature importance analysis, we identify key visual indicators of malicious intent and refine our feature set by removing non-informative pixels, improving performance to an AUC of 0.9133 with a reduced feature space. Our findings reveal that the structural features of QR code correlate strongly with phishing risk. This work establishes a foundation for quishing mitigation and highlights the potential of direct QR analysis as a critical layer in modern phishing defenses.
zh

[AI-20] he Steganographic Potentials of Language Models ICLR2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在文本中隐藏信息(即隐写术)所带来的检测难题,以及这种能力对AI代理对齐性与模型推理可信度的潜在威胁。其解决方案的关键在于通过强化学习(Reinforcement Learning, RL)微调LLMs,以提升其在三种场景下的隐写能力:开发隐蔽编码方案、在提示下进行隐写操作,以及在真实场景中隐藏推理过程而不受明确提示。研究发现,尽管当前模型在安全性和容量方面仅具备基础的隐写能力,但显式的算法指导显著增强了其信息隐藏的能力。

链接: https://arxiv.org/abs/2505.03439
作者: Artem Karpov,Tinuade Adeleke,Seong Hah Cho,Natalia Perez-Campanero
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Published at Building Trust Workshop at ICLR 2025

点击查看摘要

Abstract:The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.
zh

[AI-21] Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM -Based Agents

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、不可预测环境中的局限性问题,这些模型虽然在程序性任务中表现出色,但其依赖程序性记忆的特性限制了其在需要适应性智能的“棘手”学习环境中的表现。论文提出的解决方案之关键在于通过引入语义记忆和关联学习系统,构建一种模块化架构,以解耦认知功能,从而弥合狭窄的程序性专长与现实世界问题解决所需的适应性智能之间的差距。

链接: https://arxiv.org/abs/2505.03434
作者: Schaun Wheeler,Olivier Jeunen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the workshop on Hybrid AI for Human-Centric Personalization (HyPer), co-located with ACM UMAP '25

点击查看摘要

Abstract:Large Language Models (LLMs) represent a landmark achievement in Artificial Intelligence (AI), demonstrating unprecedented proficiency in procedural tasks such as text generation, code completion, and conversational coherence. These capabilities stem from their architecture, which mirrors human procedural memory – the brain’s ability to automate repetitive, pattern-driven tasks through practice. However, as LLMs are increasingly deployed in real-world applications, it becomes impossible to ignore their limitations operating in complex, unpredictable environments. This paper argues that LLMs, while transformative, are fundamentally constrained by their reliance on procedural memory. To create agents capable of navigating ``wicked’’ learning environments – where rules shift, feedback is ambiguous, and novelty is the norm – we must augment LLMs with semantic memory and associative learning systems. By adopting a modular architecture that decouples these cognitive functions, we can bridge the gap between narrow procedural expertise and the adaptive intelligence required for real-world problem-solving.
zh

[AI-22] Framework GNN-AID: Graph Neural Network Analysis Interpretation and Defense

【速读】:该论文试图解决图数据中机器学习模型的可解释性与鲁棒性不足的问题,现有工具普遍缺乏对图数据的支持,并且很少将这两个方面整合到统一的解决方案中。其关键解决方案是提出GNN-AID(Graph Neural Network Analysis, Interpretation, and Defense),一个开源框架,旨在通过攻击、防御和可解释性方法分析图数据和图神经网络(GNN)行为,支持多种信任增强技术,并提供可视化工具和无代码功能,以简化GNN的探索与分析。

链接: https://arxiv.org/abs/2505.03424
作者: Kirill Lukyanov,Mikhail Drobyshevskiy,Georgii Sazonov,Mikhail Soloviov,Ilya Makarov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing need for Trusted AI (TAI) highlights the importance of interpretability and robustness in machine learning models. However, many existing tools overlook graph data and rarely combine these two aspects into a single solution. Graph Neural Networks (GNNs) have become a popular approach, achieving top results across various tasks. We introduce GNN-AID (Graph Neural Network Analysis, Interpretation, and Defense), an open-source framework designed for graph data to address this gap. Built as a Python library, GNN-AID supports advanced trust methods and architectural layers, allowing users to analyze graph datasets and GNN behavior using attacks, defenses, and interpretability methods. GNN-AID is built on PyTorch-Geometric, offering preloaded datasets, models, and support for any GNNs through customizable interfaces. It also includes a web interface with tools for graph visualization and no-code features like an interactive model builder, simplifying the exploration and analysis of GNNs. The framework also supports MLOps techniques, ensuring reproducibility and result versioning to track and revisit analyses efficiently. GNN-AID is a flexible tool for developers and researchers. It helps developers create, analyze, and customize graph models, while also providing access to prebuilt datasets and models for quick experimentation. Researchers can use the framework to explore advanced topics on the relationship between interpretability and robustness, test defense strategies, and combine methods to protect against different types of attacks. We also show how defenses against evasion and poisoning attacks can conflict when applied to graph data, highlighting the complex connections between defense strategies. GNN-AID is available at \hrefthis https URLthis http URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.03424 [cs.LG] (or arXiv:2505.03424v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.03424 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mikhail Drobyshevskiy [view email] [v1] Tue, 6 May 2025 11:03:19 UTC (2,363 KB) Full-text links: Access Paper: View a PDF of the paper titled Framework GNN-AID: Graph Neural Network Analysis Interpretation and Defense, by Kirill Lukyanov and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-23] Automatic Calibration for Membership Inference Attack on Large Language Models

【速读】:该论文试图解决现有Membership Inference Attacks (MIAs)在判断特定文本是否属于大型语言模型(Large Language Models, LLMs)预训练数据时存在的高误报率问题,以及依赖额外参考模型进行概率校准所带来的实用性限制。解决方案的关键在于提出一种名为Automatic Calibration Membership Inference Attack (ACMIA)的新框架,该框架通过引入可调温度参数对输出概率进行有效校准,从而提升成员与非成员之间的概率差异,增强推理的可靠性和鲁棒性。

链接: https://arxiv.org/abs/2505.03392
作者: Saleh Zare Zade,Yao Qiang,Xiangyu Zhou,Hui Zhu,Mohammad Amin Roshani,Prashant Khanduri,Dongxiao Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Membership Inference Attacks (MIAs) have recently been employed to determine whether a specific text was part of the pre-training data of Large Language Models (LLMs). However, existing methods often misinfer non-members as members, leading to a high false positive rate, or depend on additional reference models for probability calibration, which limits their practicality. To overcome these challenges, we introduce a novel framework called Automatic Calibration Membership Inference Attack (ACMIA), which utilizes a tunable temperature to calibrate output probabilities effectively. This approach is inspired by our theoretical insights into maximum likelihood estimation during the pre-training of LLMs. We introduce ACMIA in three configurations designed to accommodate different levels of model access and increase the probability gap between members and non-members, improving the reliability and robustness of membership inference. Extensive experiments on various open-source LLMs demonstrate that our proposed attack is highly effective, robust, and generalizable, surpassing state-of-the-art baselines across three widely used benchmarks. Our code is available at: \hrefthis https URL\textcolorblueGithub.
zh

[AI-24] SPAP: Structured Pruning via Alternating Optimization and Penalty Methods

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因计算和内存需求大而受到的限制问题。现有结构化剪枝方法存在性能下降、依赖启发式度量或需要昂贵微调的缺陷。论文提出的解决方案是SPAP(Structured Pruning via Alternating Optimization and Penalty Methods),其关键在于通过混合整数优化模型建模剪枝问题,采用惩罚方法有效进行剪枝决策以最小化剪枝误差,并引入针对可分割问题结构的交替最小化算法,实现高效的权重更新和性能恢复。

链接: https://arxiv.org/abs/2505.03373
作者: Hanyu Hu,Xiaoming Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The deployment of large language models (LLMs) is often constrained by their substantial computational and memory demands. While structured pruning presents a viable approach by eliminating entire network components, existing methods suffer from performance degradation, reliance on heuristic metrics, or expensive finetuning. To address these challenges, we propose SPAP (Structured Pruning via Alternating Optimization and Penalty Methods), a novel and efficient structured pruning framework for LLMs grounded in optimization theory. SPAP formulates the pruning problem through a mixed-integer optimization model, employs a penalty method that effectively makes pruning decisions to minimize pruning errors, and introduces an alternating minimization algorithm tailored to the splittable problem structure for efficient weight updates and performance recovery. Extensive experiments on OPT, LLaMA-3/3.1/3.2, and Qwen2.5 models demonstrate SPAP’s superiority over state-of-the-art methods, delivering linear inference speedups (1.29 \times at 30% sparsity) and proportional memory reductions. Our work offers a practical, optimization-driven solution for pruning LLMs while preserving model performance.
zh

[AI-25] Validating the Effectiveness of a Large Language Model-based Approach for Identifying Childrens Development across Various Free Play Settings in Kindergarten

【速读】:该论文试图解决在自由游戏(free play)中评估儿童发展的难题,传统方法依赖教师、家长或研究者的直接观察,难以全面捕捉儿童在自由游戏中的发展信息并提供及时反馈。解决方案的关键在于结合生成式 AI(Generative AI)与学习分析(learning analytics)技术,通过分析儿童对游戏经历的自我叙述,识别其认知、运动和社交能力,并利用学习分析方法计算不同游戏场景下的表现分数,从而实现对儿童发展轨迹的精准评估。

链接: https://arxiv.org/abs/2505.03369
作者: Yuanyuan Yang,Yuan Shen,Tianchen Sun,Yangbin Xie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Free play is a fundamental aspect of early childhood education, supporting children’s cognitive, social, emotional, and motor development. However, assessing children’s development during free play poses significant challenges due to the unstructured and spontaneous nature of the activity. Traditional assessment methods often rely on direct observations by teachers, parents, or researchers, which may fail to capture comprehensive insights from free play and provide timely feedback to educators. This study proposes an innovative approach combining Large Language Models (LLMs) with learning analytics to analyze children’s self-narratives of their play experiences. The LLM identifies developmental abilities, while performance scores across different play settings are calculated using learning analytics techniques. We collected 2,224 play narratives from 29 children in a kindergarten, covering four distinct play areas over one semester. According to the evaluation results from eight professionals, the LLM-based approach achieved high accuracy in identifying cognitive, motor, and social abilities, with accuracy exceeding 90% in most domains. Moreover, significant differences in developmental outcomes were observed across play settings, highlighting each area’s unique contributions to specific abilities. These findings confirm that the proposed approach is effective in identifying children’s development across various free play settings. This study demonstrates the potential of integrating LLMs and learning analytics to provide child-centered insights into developmental trajectories, offering educators valuable data to support personalized learning and enhance early childhood education practices.
zh

[AI-26] Domain Adversarial Training for Mitigating Gender Bias in Speech-based Mental Health Detection

【速读】:该论文试图解决基于语音的抑郁症和创伤后应激障碍(PTSD)检测模型中存在的性别偏差问题,该问题可能导致不公平和不准确的预测。解决方案的关键在于引入一种领域对抗训练方法,将不同性别视为不同的领域,并将其信息整合到预训练的语音基础模型中,从而减少性别差异对模型性能的影响。

链接: https://arxiv.org/abs/2505.03359
作者: June-Woo Kim,Haram Yoon,Wonkyo Oh,Dawoon Jung,Sung-Hoon Yoon,Dae-Jin Kim,Dong-Ho Lee,Sang-Yeol Lee,Chan-Mo Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMBC 2025

点击查看摘要

Abstract:Speech-based AI models are emerging as powerful tools for detecting depression and the presence of Post-traumatic stress disorder (PTSD), offering a non-invasive and cost-effective way to assess mental health. However, these models often struggle with gender bias, which can lead to unfair and inaccurate predictions. In this study, our study addresses this issue by introducing a domain adversarial training approach that explicitly considers gender differences in speech-based depression and PTSD detection. Specifically, we treat different genders as distinct domains and integrate this information into a pretrained speech foundation model. We then validate its effectiveness on the E-DAIC dataset to assess its impact on performance. Experimental results show that our method notably improves detection performance, increasing the F1-score by up to 13.29 percentage points compared to the baseline. This highlights the importance of addressing demographic disparities in AI-driven mental health assessment.
zh

[AI-27] Safer Prompts: Reducing IP Risk in Visual Generative AI

【速读】:该论文试图解决生成式 AI 在图像生成过程中可能因记忆和再现特定内容而引发的知识产权(IP)侵权问题。解决方案的关键在于利用提示工程(prompt engineering)技术,特别是链式思维提示(Chain of Thought Prompting)和任务指令提示(Task Instruction Prompting),以降低生成图像与扩散模型训练数据之间的相似性,从而有效减少 IP 侵权风险。

链接: https://arxiv.org/abs/2505.03338
作者: Lena Reissinger,Yuanyuan Li,Anna-Carolina Haensch,Neeraj Sarna
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Generative AI models have demonstrated remarkable capability in generating high-quality images from simple inputs like text prompts. However, because these models are trained on images from diverse sources, they risk memorizing and reproducing specific content, raising concerns about intellectual property (IP) infringement. Recent advances in prompt engineering offer a cost-effective way to enhance generative AI performance. In this paper, we evaluate the effectiveness of prompt engineering techniques in mitigating IP infringement risks in image generation. Our findings show that Chain of Thought Prompting and Task Instruction Prompting significantly reduce the similarity between generated images and the training data of diffusion models, thereby lowering the risk of IP infringement.
zh

[AI-28] Avoid Recommending Out-of-Domain Items: Constrained Generative Recommendation with LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成式推荐系统中可能推荐领域外(out-of-domain, OOD)物品的问题。解决方案的关键在于提出两种方法:基于检索的RecLM-ret和受限生成的RecLM-cgen,其中RecLM-cgen通过约束生成过程确保推荐结果保持在领域内,同时在准确性和消除OOD推荐方面优于RecLM-ret及其他LLM-based推荐模型,且具备轻量级、易集成的特点。

链接: https://arxiv.org/abs/2505.03336
作者: Hao Liao,Wensheng Lu,Jianxun Lian,Mingqi Wu,Shuo Wang,Yong Zhang,Yitian Huang,Mingyang Zhou,Xing Xie
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise for generative recommender systems due to their transformative capabilities in user interaction. However, ensuring they do not recommend out-of-domain (OOD) items remains a challenge. We study two distinct methods to address this issue: RecLM-ret, a retrieval-based method, and RecLM-cgen, a constrained generation method. Both methods integrate seamlessly with existing LLMs to ensure in-domain recommendations. Comprehensive experiments on three recommendation datasets demonstrate that RecLM-cgen consistently outperforms RecLM-ret and existing LLM-based recommender models in accuracy while eliminating OOD recommendations, making it the preferred method for adoption. Additionally, RecLM-cgen maintains strong generalist capabilities and is a lightweight plug-and-play module for easy integration into LLMs, offering valuable practical benefits for the community. Source code is available at this https URL
zh

[AI-29] AI-Driven Scholarly Peer Review via Persistent Workflow Prompting Meta-Prompting and Meta-Reasoning

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在科学手稿的批判性同行评审中面临的挑战,这一挑战部分源于数据限制和专家推理的复杂性。解决方案的关键在于提出一种名为持久工作流提示(Persistent Workflow Prompting, PWP)的提示工程方法,该方法通过标准LLM聊天界面实现无需代码和API的广泛适用性。PWP的核心是采用分层模块化架构,定义详细的分析流程,并通过元提示技术和元推理进行迭代优化,以系统化地编码专家评审流程,包括隐性知识。该方法在会话开始时一次性提交,随后通过后续查询触发持久工作流,引导现代推理型LLMs进行系统性、多模态评估。

链接: https://arxiv.org/abs/2505.03332
作者: Evgeny Markhasin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 36 pages (references and appendixes)

点击查看摘要

Abstract:Critical peer review of scientific manuscripts presents a significant challenge for Large Language Models (LLMs), partly due to data limitations and the complexity of expert reasoning. This report introduces Persistent Workflow Prompting (PWP), a potentially broadly applicable prompt engineering methodology designed to bridge this gap using standard LLM chat interfaces (zero-code, no APIs). We present a proof-of-concept PWP prompt for the critical analysis of experimental chemistry manuscripts, featuring a hierarchical, modular architecture (structured via Markdown) that defines detailed analysis workflows. We develop this PWP prompt through iterative application of meta-prompting techniques and meta-reasoning aimed at systematically codifying expert review workflows, including tacit knowledge. Submitted once at the start of a session, this PWP prompt equips the LLM with persistent workflows triggered by subsequent queries, guiding modern reasoning LLMs through systematic, multimodal evaluations. Demonstrations show the PWP-guided LLM identifying major methodological flaws in a test case while mitigating LLM input bias and performing complex tasks, including distinguishing claims from evidence, integrating text/photo/figure analysis to infer parameters, executing quantitative feasibility checks, comparing estimates against claims, and assessing a priori plausibility. To ensure transparency and facilitate replication, we provide full prompts, detailed demonstration analyses, and logs of interactive chats as supplementary resources. Beyond the specific application, this work offers insights into the meta-development process itself, highlighting the potential of PWP, informed by detailed workflow formalization, to enable sophisticated analysis using readily available LLMs for complex scientific tasks.
zh

[AI-30] Artificial Behavior Intelligence: Technology Challenges and Future Directions

【速读】:该论文旨在解决如何在多种AI应用领域中准确理解和预测人类行为的问题,其核心挑战包括从有限数据中学习行为智能、量化复杂行为预测中的不确定性以及优化模型结构以实现低功耗、实时推理。解决方案的关键在于利用大规模预训练模型(如大语言模型、视觉基础模型和多模态集成模型)提升行为识别的准确性与可解释性,并通过轻量级Transformer、基于图的识别架构、能量感知损失函数及多模态知识蒸馏等优化策略来应对实际部署中的技术难题。

链接: https://arxiv.org/abs/2505.03315
作者: Kanghyun Jo,Jehwan Choi,Kwanho Kim,Seongmin Kim,Duy-Linh Nguyen,Xuan-Thuy Vo,Adri Priadana,Tien-Dat Tran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, Pre-print for IWIS2025

点击查看摘要

Abstract:Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues. It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling. Furthermore, we highlight the transformative potential of recent advances in large-scale pretrained models, such as large language models (LLMs), vision foundation models, and multimodal integration models, in significantly improving the accuracy and interpretability of behavior recognition. Our research team has a strong interest in the ABI domain and is actively conducting research, particularly focusing on the development of intelligent lightweight models capable of efficiently inferring complex human behaviors. This paper identifies several technical challenges that must be addressed to deploy ABI in real-world applications including learning behavioral intelligence from limited data, quantifying uncertainty in complex behavior prediction, and optimizing model structures for low-power, real-time inference. To tackle these challenges, our team is exploring various optimization strategies including lightweight transformers, graph-based recognition architectures, energy-aware loss functions, and multimodal knowledge distillation, while validating their applicability in real-time environments.
zh

[AI-31] Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation

【速读】:该论文试图解决符号音乐生成中由于符号音乐通常以离散事件序列表示,而标准扩散模型不适用于离散数据所带来的挑战。其解决方案的关键在于将符号音乐表示为类似图像的钢琴卷(pianoroll),从而使得扩散模型能够有效应用于符号音乐生成。此外,该研究引入了一种新型扩散模型,结合了所提出的Transformer-Mamba块和可学习小波变换,并利用无分类器指导(classifier-free guidance)实现目标和弦的符号音乐生成。

链接: https://arxiv.org/abs/2505.03314
作者: Jincheng Zhang,György Fazekas,Charalampos Saitis
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The recent surge in the popularity of diffusion models for image synthesis has attracted new attention to their potential for generation tasks in other domains. However, their applications to symbolic music generation remain largely under-explored because symbolic music is typically represented as sequences of discrete events and standard diffusion models are not well-suited for discrete data. We represent symbolic music as image-like pianorolls, facilitating the use of diffusion models for the generation of symbolic music. Moreover, this study introduces a novel diffusion model that incorporates our proposed Transformer-Mamba block and learnable wavelet transform. Classifier-free guidance is utilised to generate symbolic music with target chords. Our evaluation shows that our method achieves compelling results in terms of music quality and controllability, outperforming the strong baseline in pianoroll generation. Our code is available at this https URL.
zh

[AI-32] he Unreason able Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

【速读】:该论文试图解决机器人操作中的灵活策略表示与模仿学习问题,特别是在少量示范数据下实现跨多种复杂任务的泛化能力。解决方案的关键在于提出一种名为Mixture of Discrete-time Gaussian Processes (MiDiGap)的新方法,该方法能够利用仅有的相机观测信息进行高效学习,并在CPU上快速完成训练,同时具备线性扩展性。MiDiGap通过引入推理时的引导工具,如碰撞信号和机器人运动学约束,实现了障碍物避让和跨实体策略迁移等新型泛化能力。

链接: https://arxiv.org/abs/2505.03296
作者: Jan Ole von Hartz,Adrian Röfer,Joschka Boedecker,Abhinav Valada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted for publication to IEEE Transaction on Robotics

点击查看摘要

Abstract:We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at this https URL.
zh

[AI-33] Capability-Driven Skill Generation with LLM s: A RAG -Based Approach for Reusing Existing Libraries and Interfaces

【速读】:该论文试图解决在现代自动化系统中,开发符合特定能力(capability)要求的技能(skill)实现所面临的耗时且具有挑战性的问题。解决方案的关键在于将能力视为技能实现的契约,并利用大语言模型根据自然语言用户输入生成可执行代码。此外,该方法通过集成现有的软件库和接口技术,实现了跨目标语言的技能实现生成,并引入了一个框架以支持用户自定义库和资源接口的集成。

链接: https://arxiv.org/abs/2505.03295
作者: Luis Miguel Vieira da Silva,Aljosha Köcher,Nicolas König,Felix Gehlhoff,Alexander Fay
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Modern automation systems increasingly rely on modular architectures, with capabilities and skills as one solution approach. Capabilities define the functions of resources in a machine-readable form and skills provide the concrete implementations that realize those capabilities. However, the development of a skill implementation conforming to a corresponding capability remains a time-consuming and challenging task. In this paper, we present a method that treats capabilities as contracts for skill implementations and leverages large language models to generate executable code based on natural language user input. A key feature of our approach is the integration of existing software libraries and interface technologies, enabling the generation of skill implementations across different target languages. We introduce a framework that allows users to incorporate their own libraries and resource interfaces into the code generation process through a retrieval-augmented generation architecture. The proposed method is evaluated using an autonomous mobile robot controlled via Python and ROS 2, demonstrating the feasibility and flexibility of the approach.
zh

[AI-34] Physics-inspired Energy Transition Neural Network for Sequence Learning

【速读】:该论文试图解决Transformer模型在捕捉长期依赖关系时主要依赖于全面的成对建模过程,而非内在的序列语义归纳偏置的问题。其解决方案的关键在于提出一种受物理能量转换模型启发的循环结构——“Physics-inspired Energy Transition Neural Network”(PETNN),该结构通过有效的记忆机制在长期依赖中存储信息,并展现出优于基于Transformer的方法性能以及更低的计算复杂度。

链接: https://arxiv.org/abs/2505.03281
作者: Zhou Wu,Junyi An,Baile Xu,Furao Shen,Jian Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, the superior performance of Transformers has made them a more robust and scalable solution for sequence modeling than traditional recurrent neural networks (RNNs). However, the effectiveness of Transformer in capturing long-term dependencies is primarily attributed to their comprehensive pair-modeling process rather than inherent inductive biases toward sequence semantics. In this study, we explore the capabilities of pure RNNs and reassess their long-term learning mechanisms. Inspired by the physics energy transition models that track energy changes over time, we propose a effective recurrent structure called the``Physics-inspired Energy Transition Neural Network" (PETNN). We demonstrate that PETNN’s memory mechanism effectively stores information over long-term dependencies. Experimental results indicate that PETNN outperforms transformer-based methods across various sequence tasks. Furthermore, owing to its recurrent nature, PETNN exhibits significantly lower complexity. Our study presents an optimal foundational recurrent architecture and highlights the potential for developing effective recurrent neural networks in fields currently dominated by Transformer.
zh

[AI-35] RAG -MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在使用越来越多的外部工具时面临的提示膨胀(prompt bloat)和选择复杂性问题,特别是当这些工具由Model Context Protocol (MCP)定义时。解决方案的关键在于提出RAG-MCP框架,该框架通过卸载工具发现过程来克服上述挑战,利用语义检索从外部索引中识别与给定查询最相关的MCP,仅将选定的工具描述传递给模型,从而显著减少提示长度并简化决策过程。

链接: https://arxiv.org/abs/2505.03275
作者: Tiantian Gan,Qiyao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) struggle to effectively utilize a growing number of external tools, such as those defined by the Model Context Protocol (MCP)\citeIntroducingMCP, due to prompt bloat and selection complexity. We introduce RAG-MCP, a Retrieval-Augmented Generation framework that overcomes this challenge by offloading tool discovery. RAG-MCP uses semantic retrieval to identify the most relevant MCP(s) for a given query from an external index before engaging the LLM. Only the selected tool descriptions are passed to the model, drastically reducing prompt size and simplifying decision-making. Experiments, including an MCP stress test, demonstrate RAG-MCP significantly cuts prompt tokens (e.g., by over 50%) and more than triples tool selection accuracy (43.13% vs 13.62% baseline) on benchmark tasks. RAG-MCP enables scalable and accurate tool integration for LLMs.
zh

[AI-36] Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models

【速读】:该论文试图解决需求工程(Requirements Engineering, RE)领域中由于高质量数据集稀缺而导致的自然语言处理和机器学习(Machine Learning, ML)技术效果受限的问题。解决方案的关键在于提出一种基于产品线(Product Line, PL)的方法——Synthline,该方法利用大型语言模型系统地生成用于分类任务的合成RE数据,以补充真实数据的不足。通过实证评估表明,尽管合成数据的多样性低于真实数据,但其仍可作为有效的训练资源,并且与真实数据结合使用时能显著提升下游模型的性能。

链接: https://arxiv.org/abs/2505.03265
作者: Abdelkarim El-Hajjami,Camille Salinesi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While modern Requirements Engineering (RE) heavily relies on natural language processing and Machine Learning (ML) techniques, their effectiveness is limited by the scarcity of high-quality datasets. This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to systematically generate synthetic RE data for classification-based use cases. Through an empirical evaluation conducted in the context of using ML for the identification of requirements specification defects, we investigated both the diversity of the generated data and its utility for training downstream models. Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources. Moreover, our evaluation shows that combining synthetic and real data leads to substantial performance improvements. Specifically, hybrid approaches achieve up to 85% improvement in precision and a 2x increase in recall compared to models trained exclusively on real data. These findings demonstrate the potential of PL-based synthetic data generation to address data scarcity in RE. We make both our implementation and generated datasets publicly available to support reproducibility and advancement in the field.
zh

[AI-37] Accelerating Evolution: Integrating PSO Principles into Real-Coded Genetic Algorithm Crossover

【速读】:该论文旨在解决实数编码遗传算法中交叉操作效率与多样性维持之间的平衡问题,以提升算法在复杂优化问题中的性能。其解决方案的关键在于提出了一种受粒子群优化启发的交叉算子(Particle Swarm Optimization-inspired Crossover, PSOX),该算子不仅借鉴了当前全局最优解的信息,还融合了多代历史最优解的指导,从而在保持种群多样性的同时加速算法向搜索空间中有前景区域的收敛。

链接: https://arxiv.org/abs/2505.03217
作者: Xiaobo Jin,JiaShu Tu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 14 pages,2 figures,4 tables

点击查看摘要

Abstract:This study introduces an innovative crossover operator named Particle Swarm Optimization-inspired Crossover (PSOX), which is specifically developed for real-coded genetic algorithms. Departing from conventional crossover approaches that only exchange information between individuals within the same generation, PSOX uniquely incorporates guidance from both the current global best solution and historical optimal solutions across multiple generations. This novel mechanism enables the algorithm to maintain population diversity while simultaneously accelerating convergence toward promising regions of the search space. The effectiveness of PSOX is rigorously evaluated through comprehensive experiments on 15 benchmark test functions with diverse characteristics, including unimodal, multimodal, and highly complex landscapes. Comparative analysis against five state-of-the-art crossover operators reveals that PSOX consistently delivers superior performance in terms of solution accuracy, algorithmic stability, and convergence speed, especially when combined with an appropriate mutation strategy. Furthermore, the study provides an in-depth investigation of how different mutation rates influence PSOX’s performance, yielding practical guidelines for parameter tuning when addressing optimization problems with varying landscape properties.
zh

[AI-38] DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral

【速读】:该论文旨在解决从领域特定的图像文档中获取结构化数据的问题,这类文档如扫描报告在许多下游任务中至关重要,但由于文档的多样性而难以处理。解决方案的关键在于提出DocSpiral,这是首个将人类纳入螺旋循环的辅助文档标注平台,通过迭代机制使模型逐步减少对人工干预的依赖,从而显著降低标注时间并提升模型性能。

链接: https://arxiv.org/abs/2505.03214
作者: Qiang Sun,Sirui Li,Tingting Bi,Du Huynh,Mark Reynolds,Yuanyi Luo,Wei Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Acquiring structured data from domain-specific, image-based documents such as scanned reports is crucial for many downstream tasks but remains challenging due to document variability. Many of these documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems. We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform, designed to address the challenge of extracting structured information from domain-specific, image-based document collections. Our spiral design establishes an iterative cycle in which human annotations train models that progressively require less manual intervention. DocSpiral integrates document format normalization, comprehensive annotation interfaces, evaluation metrics dashboard, and API endpoints for the development of AI / ML models into a unified workflow. Experiments demonstrate that our framework reduces annotation time by at least 41% while showing consistent performance gains across three iterations during model training. By making this annotation platform freely accessible, we aim to lower barriers to AI/ML models development in document processing, facilitating the adoption of large language models in image-based, document-intensive fields such as geoscience and healthcare. The system is freely available at: this https URL. The demonstration video is available: this https URL.
zh

[AI-39] A Trustworthy Multi-LLM Network: ChallengesSolutions and A Use Case

【速读】:该论文试图解决在通信与网络领域中,由于不同大型语言模型(Large Language Models, LLMs)的结构和训练数据差异导致的优化策略不一致,以及单个LLM训练数据的局限性和潜在恶意设备带来的低置信度或偏差响应问题。解决方案的关键在于提出一种基于区块链的协作框架,将多个LLM连接成一个可信的多LLM网络(Trustworthy Multi-LLM Network, MultiLLMN),通过协同评估和选择最可靠、高质量的响应来解决复杂的网络优化问题。

链接: https://arxiv.org/abs/2505.03196
作者: Haoxiang Luo,Gang Sun,Yinqiu Liu,Dusit Niyato,Hongfang Yu,Mohammed Atiquzzaman,Schahram Dustdar
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong potential across a variety of tasks in communications and networking due to their advanced reasoning capabilities. However, because different LLMs have different model structures and are trained using distinct corpora and methods, they may offer varying optimization strategies for the same network issues. Moreover, the limitations of an individual LLM’s training data, aggravated by the potential maliciousness of its hosting device, can result in responses with low confidence or even bias. To address these challenges, we propose a blockchain-enabled collaborative framework that connects multiple LLMs into a Trustworthy Multi-LLM Network (MultiLLMN). This architecture enables the cooperative evaluation and selection of the most reliable and high-quality responses to complex network optimization problems. Specifically, we begin by reviewing related work and highlighting the limitations of existing LLMs in collaboration and trust, emphasizing the need for trustworthiness in LLM-based systems. We then introduce the workflow and design of the proposed Trustworthy MultiLLMN framework. Given the severity of False Base Station (FBS) attacks in B5G and 6G communication systems and the difficulty of addressing such threats through traditional modeling techniques, we present FBS defense as a case study to empirically validate the effectiveness of our approach. Finally, we outline promising future research directions in this emerging area.
zh

[AI-40] A study on audio synchronous steganography detection and distributed guide inference model based on sliding spectral features and intelligent inference drive

【速读】:该论文试图解决传统技术在检测同步隐写术方面的局限性问题,具体是针对短视频平台中通过音频同步流嵌入隐写数据的隐蔽通信方法进行检测与重建。解决方案的关键在于提出一种基于短视频“玉盘”样本的检测与分布式引导重构模型,其核心是结合滑动频谱特征提取和智能推理机制,利用25 ms滑动窗口与短时傅里叶变换(STFT)提取主频轨迹,并构建同步帧检测模型(M1),随后通过结构化模型(M2)解码32字节的有效载荷以推断分布式引导指令,从而实现对同步隐写内容的有效识别与分析。

链接: https://arxiv.org/abs/2505.03193
作者: Wei Meng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注: This paper proposes a novel framework for detecting steganographic content in short video audio streams using sliding spectral features and distributed inference models, combining STFT analysis, entropy-based synchronization, and deep learning-driven decoding strategies

点击查看摘要

Abstract:With the rise of short video platforms in global communication, embedding steganographic data in audio synchronization streams has emerged as a new covert communication method. To address the limitations of traditional techniques in detecting synchronized steganography, this paper proposes a detection and distributed guidance reconstruction model based on short video “Yupan” samples released by China’s South Sea Fleet on TikTok. The method integrates sliding spectrum feature extraction and intelligent inference mechanisms. A 25 ms sliding window with short-time Fourier transform (STFT) is used to extract the main frequency trajectory and construct the synchronization frame detection model (M1), identifying a frame flag “FFFFFFFFFFFFFFFFFF80”. The subsequent 32-byte payload is decoded by a structured model (M2) to infer distributed guidance commands. Analysis reveals a low-entropy, repetitive byte sequence in the 36 to 45 second audio segment with highly concentrated spectral energy, confirming the presence of synchronization frames. Although plaintext semantics are not restored, the consistency in command field layout suggests features of military communication protocols. The multi-segment splicing model further shows cross-video embedding and centralized decoding capabilities. The proposed framework validates the effectiveness of sliding spectral features for synchronized steganography detection and builds an extensible inference model for covert communication analysis and tactical guidance simulation on open platforms.
zh

[AI-41] Patterns and Mechanisms of Contrastive Activation Engineering ICLR2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)行为控制的问题,这一问题由于模型的复杂性和不透明性而尤为突出。传统方法如微调虽然可以改变模型行为,但通常需要大量的计算资源。本文提出的解决方案是对比激活工程(Contrastive Activation Engineering, CAE),其关键在于通过针对模型内部表示的定向修改,在推理阶段以零成本引导LLM输出,从而实现灵活且任务特定的行为调整。

链接: https://arxiv.org/abs/2505.03189
作者: Yixiong Hao,Ayush Panda,Stepan Shabalin,Sheikh Abdur Raheem Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at the ICLR 2025 Bi-Align, HAIC, and Building Trust workshops

点击查看摘要

Abstract:Controlling the behavior of Large Language Models (LLMs) remains a significant challenge due to their inherent complexity and opacity. While techniques like fine-tuning can modify model behavior, they typically require extensive computational resources. Recent work has introduced a class of contrastive activation engineering (CAE) techniques as promising approaches for steering LLM outputs through targeted modifications to their internal representations. Applied at inference-time with zero cost, CAE has the potential to introduce a new paradigm of flexible, task-specific LLM behavior tuning. We analyze the performance of CAE in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment. We find that 1. CAE is only reliably effective when applied to in-distribution contexts. 2. Increasing the number of samples used to generate steering vectors has diminishing returns at around 80 samples. 3. Steering vectors are susceptible to adversarial inputs that reverses the behavior that is steered for. 4. Steering vectors harm the overall model perplexity. 5. Larger models are more resistant to steering-induced degradation.
zh

[AI-42] Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning ICLR2025

【速读】:该论文试图解决目标条件强化学习(goal-conditioned reinforcement learning, GCRL)中由于目标空间稀疏性导致的样本效率低下问题,尤其是在以物体为中心的领域中,传统事后重标注(hindsight relabeling)方法可能产生误导性的奖励信号,从而影响学习效果。解决方案的关键在于引入基于交互的事后重标注(Hindsight Relabeling using Interactions, HInt),该方法通过定义和识别物体间的交互来提升样本效率,其核心是利用“零化”操作与学习模型进行空想反事实交互推断(Null Counterfactual Interaction Inference, NCII),以准确捕捉物体间的因果交互关系。

链接: https://arxiv.org/abs/2505.03172
作者: Caleb Chuck,Fan Feng,Carl Qi,Chang Shi,Siddhant Agarwal,Amy Zhang,Scott Niekum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at ICLR 2025

点击查看摘要

Abstract:Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL), especially in certain domains such as navigation and locomotion. However, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robotic arm pushing a particular target block to a goal location. In this case, hindsight relabeling will give high rewards to any trajectory that does not interact with the block. However, these behaviors are only useful when the object is already at the goal – an extremely rare case in practice. A dataset dominated by these kinds of trajectories can complicate learning and lead to failures. In object-centric domains, one key intuition is that meaningful trajectories are often characterized by object-object interactions such as pushing the block with the gripper. To leverage this intuition, we introduce Hindsight Relabeling using Interactions (HInt), which combines interactions with hindsight relabeling to improve the sample efficiency of downstream RL. However because interactions do not have a consensus statistical definition tractable for downstream GCRL, we propose a definition of interactions based on the concept of null counterfactual: a cause object is interacting with a target object if, in a world where the cause object did not exist, the target object would have different transition dynamics. We leverage this definition to infer interactions in Null Counterfactual Interaction Inference (NCII), which uses a "nulling’’ operation with a learned model to infer interactions. NCII is able to achieve significantly improved interaction inference accuracy in both simple linear dynamics domains and dynamic robotic domains in Robosuite, Robot Air Hockey, and Franka Kitchen and HInt improves sample efficiency by up to 4x.
zh

[AI-43] CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

【速读】:该论文旨在解决组合数学领域缺乏合适基准和定理库的问题,从而推动形式化推理与大型语言模型(Large Language Models, LLMs)在该领域的应用。其解决方案的关键在于引入CombiBench,这是一个涵盖100个组合问题的综合性基准,所有问题均在Lean 4中形式化,并配有对应的非形式化陈述,覆盖从中学生到国际数学奥林匹克竞赛(IMO)及大学水平的难度范围。此外,论文还提出了Fine-Eval评估框架,用于标准化评估形式化数学问题,包括首次支持填空题的评估。

链接: https://arxiv.org/abs/2505.03171
作者: Junqi Liu,Xiaohan Lin,Jonas Bayer,Yael Dillies,Weijie Jiang,Xiaodan Liang,Roman Soletskyi,Haiming Wang,Yunzhou Xie,Beibei Xiong,Zhengfeng Yang,Jujian Zhang,Lihong Zhi,Jia Li,Zhengying Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval (for \textbfF ill-in-the-blank \textbfin L \textbfe an Evaluation), for formal mathematics. It accommodates not only proof-based problems but also, for the first time, the evaluation of fill-in-the-blank questions. Using Fine-Eval as the evaluation method and Kimina Lean Server as the backend, we benchmark several LLMs on CombiBench and observe that their capabilities for formally solving combinatorial problems remain limited. Among all models tested (none of which has been trained for this particular task), Kimina-Prover attains the best results, solving 7 problems (out of 100) under both with solution'' and without solution’’ scenarios. We open source the benchmark dataset alongside with the code of the proposed evaluation method at this https URL.
zh

[AI-44] Soft Best-of-n Sampling for Model Alignment

【速读】:该论文旨在解决如何在不进行昂贵微调的情况下,使语言模型输出与人类偏好对齐的问题。其解决方案的关键在于提出一种称为Soft Best-of- n 采样的方法,该方法通过引入温度参数λ,实现了从原始分布到奖励最大化分布的平滑插值,从而在保持较低偏差的同时提升奖励值。

链接: https://arxiv.org/abs/2505.03156
作者: Claudio Mayrink Verdun,Alex Oesterling,Himabindu Lakkaraju,Flavio P. Calmon
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the 2025 IEEE International Symposium on Information Theory (ISIT 2025)

点击查看摘要

Abstract:Best-of- n (BoN) sampling is a practical approach for aligning language model outputs with human preferences without expensive fine-tuning. BoN sampling is performed by generating n responses to a prompt and then selecting the sample that maximizes a reward function. BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution. This distortion is coarsely controlled by varying the number of samples: larger n yields a higher reward at a higher distortion cost. We introduce Soft Best-of- n sampling, a generalization of BoN that allows for smooth interpolation between the original distribution and reward-maximizing distribution through a temperature parameter \lambda . We establish theoretical guarantees showing that Soft Best-of- n sampling converges sharply to the optimal tilted distribution at a rate of O(1/n) in KL and the expected (relative) reward. For sequences of discrete outputs, we analyze an additive reward model that reveals the fundamental limitations of blockwise sampling.
zh

[AI-45] Holmes: Automated Fact Check with Large Language Models

【速读】:该论文旨在解决多模态虚假信息(multimodal disinformation)检测中传统深度学习模型难以有效处理复杂信息组合的问题。其核心挑战在于现有方法无法充分捕捉文本与图像等多模态信息之间的关联性,导致检测效果受限。论文提出的解决方案关键在于引入生成式 AI(Generative AI)驱动的框架 Holmes,该框架通过结合大语言模型(LLMs)与新型证据检索机制,提升虚假信息验证的准确性。Holmes 的核心创新包括基于 LLM 的摘要技术以提取关键信息,以及新的算法和指标用于评估证据质量,从而显著提高事实核查的准确率。

链接: https://arxiv.org/abs/2505.03135
作者: Haoran Ou,Gelei Deng,Xingshuo Han,Jie Zhang,Xinlei He,Han Qiu,Shangwei Guo,Tianwei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of Internet connectivity has accelerated the spread of disinformation, threatening societal trust, decision-making, and national security. Disinformation has evolved from simple text to complex multimodal forms combining images and text, challenging existing detection methods. Traditional deep learning models struggle to capture the complexity of multimodal disinformation. Inspired by advances in AI, this study explores using Large Language Models (LLMs) for automated disinformation detection. The empirical study shows that (1) LLMs alone cannot reliably assess the truthfulness of claims; (2) providing relevant evidence significantly improves their performance; (3) however, LLMs cannot autonomously search for accurate evidence. To address this, we propose Holmes, an end-to-end framework featuring a novel evidence retrieval method that assists LLMs in collecting high-quality evidence. Our approach uses (1) LLM-powered summarization to extract key information from open sources and (2) a new algorithm and metrics to evaluate evidence quality. Holmes enables LLMs to verify claims and generate justifications effectively. Experiments show Holmes achieves 88.3% accuracy on two open-source datasets and 90.2% in real-time verification tasks. Notably, our improved evidence retrieval boosts fact-checking accuracy by 30.8% over existing methods
zh

[AI-46] Is AI currently capable of identifying wild oysters? A comparison of human annotators against the AI model ODYSSEE

【速读】:该论文旨在解决传统监测牡蛎礁方法中存在破坏性采样和高人力成本的问题,从而难以适用于小规模或敏感环境。其解决方案的关键是引入ODYSSEE模型,该模型利用深度学习技术,通过分析现场拍摄的视频或图像来识别活体牡蛎,以评估种群数量。尽管该模型在推理速度上优于专家和非专家标注者,但其在识别活体牡蛎的准确性(63%)低于专家(74%)和非专家(75%),且图像质量对模型和人工标注的准确性均有显著影响。研究认为,通过使用更高质量的图像、增加活体图像数据以及引入更多标注训练类别,可以显著提升模型的预测能力。

链接: https://arxiv.org/abs/2505.03108
作者: Brendan Campbell,Alan Williams,Kleio Baxevani,Alyssa Campbell,Rushabh Dhoke,Rileigh E. Hudock,Xiaomin Lin,Vivek Mange,Bernhard Neuberger,Arjun Suresh,Alhim Vera,Arthur Trembanis,Herbert G. Tanner,Edward Hale
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Oysters are ecologically and commercially important species that require frequent monitoring to track population demographics (e.g. abundance, growth, mortality). Current methods of monitoring oyster reefs often require destructive sampling methods and extensive manual effort. Therefore, they are suboptimal for small-scale or sensitive environments. A recent alternative, the ODYSSEE model, was developed to use deep learning techniques to identify live oysters using video or images taken in the field of oyster reefs to assess abundance. The validity of this model in identifying live oysters on a reef was compared to expert and non-expert annotators. In addition, we identified potential sources of prediction error. Although the model can make inferences significantly faster than expert and non-expert annotators (39.6 s, 2.34 \pm 0.61 h, 4.50 \pm 1.46 h, respectively), the model overpredicted the number of live oysters, achieving lower accuracy (63%) in identifying live oysters compared to experts (74%) and non-experts (75%) alike. Image quality was an important factor in determining the accuracy of the model and the annotators. Better quality images improved human accuracy and worsened model accuracy. Although ODYSSEE was not sufficiently accurate, we anticipate that future training on higher-quality images, utilizing additional live imagery, and incorporating additional annotation training classes will greatly improve the model’s predictive power based on the results of this analysis. Future research should address methods that improve the detection of living vs. dead oysters.
zh

[AI-47] Cognitio Emergens: Agency Dimensions and Dynamics in Human-AI Knowledge Co-Creation

【速读】:该论文试图解决现有模型在描述人类与人工智能(AI)之间科学知识共创过程中的局限性,这些模型往往局限于静态角色或狭窄的度量标准,未能捕捉到科学理解通过时间递归的人机交互而产生的动态机制。其解决方案的关键在于提出Cognitio Emergens(CE)框架,该框架整合了三个核心组件:Agency Configurations(权力配置),用于描述人类与AI之间权威分布的动态变化;Epistemic Dimensions(认识论维度),捕捉协作中涌现的六种特定能力;以及Partnership Dynamics(伙伴关系动力学),识别影响关系演化的驱动力,特别是知识异化风险。CE通过引入自创生理论、社会系统理论和组织模块化概念,揭示了知识共构如何通过持续的角色、价值和组织结构协商而产生。

链接: https://arxiv.org/abs/2505.03105
作者: Xule Lin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 62 pages (31 appendix pages for guidance), 2 figures

点击查看摘要

Abstract:Scientific knowledge creation is fundamentally transforming as humans and AI systems evolve beyond tool-user relationships into co-evolutionary epistemic partnerships. When AlphaFold revolutionized protein structure prediction, researchers described engaging with an epistemic partner that reshaped how they conceptualized fundamental relationships. This article introduces Cognitio Emergens (CE), a framework addressing critical limitations in existing models that focus on static roles or narrow metrics while failing to capture how scientific understanding emerges through recursive human-AI interaction over time. CE integrates three components addressing these limitations: Agency Configurations describing how authority distributes between humans and AI (Directed, Contributory, Partnership), with partnerships dynamically oscillating between configurations rather than following linear progression; Epistemic Dimensions capturing six specific capabilities emerging through collaboration across Discovery, Integration, and Projection axes, creating distinctive “capability signatures” that guide development; and Partnership Dynamics identifying forces shaping how these relationships evolve, particularly the risk of epistemic alienation where researchers lose interpretive control over knowledge they formally endorse. Drawing from autopoiesis theory, social systems theory, and organizational modularity, CE reveals how knowledge co-creation emerges through continuous negotiation of roles, values, and organizational structures. By reconceptualizing human-AI scientific collaboration as fundamentally co-evolutionary, CE offers a balanced perspective that neither uncritically celebrates nor unnecessarily fears AI’s evolving role, instead providing conceptual tools for cultivating partnerships that maintain meaningful human participation while enabling transformative scientific breakthroughs.
zh

[AI-48] Assessing and Enhancing the Robustness of LLM -based Multi-Agent Systems Through Chaos Engineering

【速读】:该论文试图解决大型语言模型基础的多智能体系统(Large Language Model-Based Multi-Agent Systems, LLM-MAS)在真实环境中的鲁棒性问题,特别是其在生产或预生产环境中可能面临的突发错误或中断,如幻觉、智能体故障和智能体通信故障。解决方案的关键在于提出一种混沌工程框架,通过主动识别LLM-MAS中的脆弱性,评估并构建应对这些脆弱性的弹性,从而确保关键应用中的可靠性能。

链接: https://arxiv.org/abs/2505.03096
作者: Joshua Owotogbe
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This study explores the application of chaos engineering to enhance the robustness of Large Language Model-Based Multi-Agent Systems (LLM-MAS) in production-like environments under real-world conditions. LLM-MAS can potentially improve a wide range of tasks, from answering questions and generating content to automating customer support and improving decision-making processes. However, LLM-MAS in production or preproduction environments can be vulnerable to emergent errors or disruptions, such as hallucinations, agent failures, and agent communication failures. This study proposes a chaos engineering framework to proactively identify such vulnerabilities in LLM-MAS, assess and build resilience against them, and ensure reliable performance in critical applications.
zh

[AI-49] Latent Adaptive Planner for Dynamic Manipulation

【速读】:该论文试图解决动态非抓取操作任务中视觉-运动策略学习的关键挑战,特别是在环境变化下的实时适应性和轨迹平滑性问题。解决方案的关键在于提出一种基于潜在空间推理的规划方法——Latent Adaptive Planner (LAP),其通过从人类示范视频中有效学习,结合变分重规划框架保持时间一致性,并利用潜在空间中的贝叶斯更新逐步优化计划,从而在计算效率与实时适应性之间取得平衡。

链接: https://arxiv.org/abs/2505.03077
作者: Donghun Noh,Deqian Kong,Minglu Zhao,Andrew Lizarraga,Jianwen Xie,Ying Nian Wu,Dennis Hong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents Latent Adaptive Planner (LAP), a novel approach for dynamic nonprehensile manipulation tasks that formulates planning as latent space inference, effectively learned from human demonstration videos. Our method addresses key challenges in visuomotor policy learning through a principled variational replanning framework that maintains temporal consistency while efficiently adapting to environmental changes. LAP employs Bayesian updating in latent space to incrementally refine plans as new observations become available, striking an optimal balance between computational efficiency and real-time adaptability. We bridge the embodiment gap between humans and robots through model-based proportional mapping that regenerates accurate kinematic-dynamic joint states and object positions from human demonstrations. Experimental evaluations across multiple complex manipulation benchmarks demonstrate that LAP achieves state-of-the-art performance, outperforming existing approaches in success rate, trajectory smoothness, and energy efficiency, particularly in dynamic adaptation scenarios. Our approach enables robots to perform complex interactions with human-like adaptability while providing an expandable framework applicable to diverse robotic platforms using the same human demonstration videos.
zh

[AI-50] MORE: Mobile Manipulation Rearrangement Through Grounded Language Reasoning

【速读】:该论文旨在解决自主长时程移动操作中的挑战,包括场景动态性、未探索区域以及错误恢复等问题。现有方法在处理大量物体和大规模环境时性能下降,因此论文提出MORE,一种增强语言模型解决零样本移动操作规划能力的新方法。MORE的关键在于利用场景图表示环境,结合实例区分,并引入主动过滤机制以提取任务相关的物体和区域实例子图,从而将问题转化为一个有限的规划问题,有效缓解幻觉现象并提高可靠性。

链接: https://arxiv.org/abs/2505.03035
作者: Mohammad Mohammadi,Daniel Honerkamp,Martin Büchner,Matteo Cassinelli,Tim Welschehold,Fabien Despinoy,Igor Gilitschenski,Abhinav Valada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous long-horizon mobile manipulation encompasses a multitude of challenges, including scene dynamics, unexplored areas, and error recovery. Recent works have leveraged foundation models for scene-level robotic reasoning and planning. However, the performance of these methods degrades when dealing with a large number of objects and large-scale environments. To address these limitations, we propose MORE, a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning for rearrangement tasks. MORE leverages scene graphs to represent environments, incorporates instance differentiation, and introduces an active filtering scheme that extracts task-relevant subgraphs of object and region instances. These steps yield a bounded planning problem, effectively mitigating hallucinations and improving reliability. Additionally, we introduce several enhancements that enable planning across both indoor and outdoor environments. We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark, outperforming recent foundation model-based approaches. Furthermore, we demonstrate the capabilities of our approach in several complex real-world tasks, mimicking everyday activities. We make the code publicly available at this https URL.
zh

[AI-51] Evaluating the Impact of AI-Powered Audiovisual Personalization on Learner Emotion Focus and Learning Outcomes

【速读】:该论文试图解决独立学习者在非结构化或干扰性环境中难以维持注意力和情绪调节的问题,以及现有教育技术在情感和感官背景方面的不足。解决方案的关键在于引入一种基于大语言模型(Large Language Models, LLMs)的AI系统,该系统能够生成个性化的多感官学习环境,通过用户自定义的视觉主题和听觉元素,构建沉浸式学习场景以减少干扰并提升情绪稳定性。

链接: https://arxiv.org/abs/2505.03033
作者: George Xi Wang,Jingying Deng,Safinah Ali
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Independent learners often struggle with sustaining focus and emotional regulation in unstructured or distracting settings. Although some rely on ambient aids such as music, ASMR, or visual backgrounds to support concentration, these tools are rarely integrated into cohesive, learner-centered systems. Moreover, existing educational technologies focus primarily on content adaptation and feedback, overlooking the emotional and sensory context in which learning takes place. Large language models have demonstrated powerful multimodal capabilities including the ability to generate and adapt text, audio, and visual content. Educational research has yet to fully explore their potential in creating personalized audiovisual learning environments. To address this gap, we introduce an AI-powered system that uses LLMs to generate personalized multisensory study environments. Users select or generate customized visual themes (e.g., abstract vs. realistic, static vs. animated) and auditory elements (e.g., white noise, ambient ASMR, familiar vs. novel sounds) to create immersive settings aimed at reducing distraction and enhancing emotional stability. Our primary research question investigates how combinations of personalized audiovisual elements affect learner cognitive load and engagement. Using a mixed-methods design that incorporates biometric measures and performance outcomes, this study evaluates the effectiveness of LLM-driven sensory personalization. The findings aim to advance emotionally responsive educational technologies and extend the application of multimodal LLMs into the sensory dimension of self-directed learning.
zh

[AI-52] he Multimodal Paradox: How Added and Missing Modalities Shape Bias and Performance in Multimodal AI CVPR2025

【速读】:该论文试图解决多模态学习系统在性能提升之外,对偏差和鲁棒性的关注不足的问题,特别是探讨新增模态对模型公平性的影响以及推理阶段模态缺失对模型性能和公平性的影响。解决方案的关键在于通过实验分析多模态数据的引入与缺失对模型表现及公平性指标的影响,从而揭示多模态融合在实际应用中的潜在优势与局限性。

链接: https://arxiv.org/abs/2505.03020
作者: Kishore Sampath,Pratheesh,Ayaazuddin Mohammad,Resmi Ramachandranpillai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CVPR 2025 Second Workshop on Responsible Generative AI

点击查看摘要

Abstract:Multimodal learning, which integrates diverse data sources such as images, text, and structured data, has proven superior to unimodal counterparts in high-stakes decision-making. However, while performance gains remain the gold standard for evaluating multimodal systems, concerns around bias and robustness are frequently overlooked. In this context, this paper explores two key research questions (RQs): (i) RQ1 examines whether adding a modality con-sistently enhances performance and investigates its role in shaping fairness measures, assessing whether it mitigates or amplifies bias in multimodal models; (ii) RQ2 investigates the impact of missing modalities at inference time, analyzing how multimodal models generalize in terms of both performance and fairness. Our analysis reveals that incorporating new modalities during training consistently enhances the performance of multimodal models, while fairness trends exhibit variability across different evaluation measures and datasets. Additionally, the absence of modalities at inference degrades performance and fairness, raising concerns about its robustness in real-world deployment. We conduct extensive experiments using multimodal healthcare datasets containing images, time series, and structured information to validate our findings.
zh

[AI-53] he Cognitive Foundations of Economic Exchange: A Modular Framework Grounded in Behavioral Evidence

【速读】:该论文试图解决多智能体人工智能中在现实行为约束下建模社会合作的关键问题。传统经济学和伦理学中的概念如“信任”或“道德”通常缺乏操作性标准或认知基础,这限制了其在人工代理中的可测试性和实现性。论文提出的解决方案关键在于构建一个由三个认知最小机制组成的概念框架:个体识别、互惠信任和成本回报敏感性。该框架将信任重新定义为一种分级的认知期望,为人工代理中的互惠交换提供了可模拟的基础,并促进了可扩展合作和制度动态的自下而上生成。

链接: https://arxiv.org/abs/2505.02945
作者: Egil Diau
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This is a position paper. Theoretical framework is finalized; minor language revisions are ongoing. Submitted for feedback and public discussion

点击查看摘要

Abstract:A key challenge in multi-agent AI is modeling social cooperation under realistic behavioral constraints. Many foundational concepts in economics and ethics such as “trust” or “morality” are often defined informally, without operational criteria or cognitive grounding, which limits their testability and implementation in artificial agents. Drawing on converging empirical evidence from primate behavior, infant cognition, and economic anthropology, we propose a conceptual framework composed of three cognitively minimal mechanisms: individual recognition, reciprocal credence, and cost return sensitivity. This framework reframes trust as a graded cognitive expectation, providing a simulateable basis for reciprocal exchange in artificial agents, and enabling the bottom-up emergence of scalable cooperation and institutional dynamics.
zh

[AI-54] Early Prediction of Sepsis: Feature-Aligned Transfer Learning ALT

【速读】:该论文试图解决当前诊断方法在脓毒症(sepsis)早期检测中的不足,即现有方法通常在机体已遭受显著损伤后才能识别脓毒症,从而延误了治疗时机。其解决方案的关键在于提出一种名为特征对齐迁移学习(Feature Aligned Transfer Learning, FATL)的方法,该方法通过识别和聚焦于多研究中最重要的、普遍报告的临床特征,确保模型的一致性和临床相关性,同时通过结合来自多样化人群的模型知识,采用加权方法以减少人口偏差,提升模型在不同患者群体和临床环境中的泛化能力与有效性。

链接: https://arxiv.org/abs/2505.02889
作者: Oyindolapo O. Komolafe,Zhimin Mei,David Morales Zarate,Gregory William Spangenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: A project implemented for MACHINE LEARNING IN HEALTH AND BIOMEDICAL SCIENCE

点击查看摘要

Abstract:Sepsis is a life threatening medical condition that occurs when the body has an extreme response to infection, leading to widespread inflammation, organ failure, and potentially death. Because sepsis can worsen rapidly, early detection is critical to saving lives. However, current diagnostic methods often identify sepsis only after significant damage has already occurred. Our project aims to address this challenge by developing a machine learning based system to predict sepsis in its early stages, giving healthcare providers more time to intervene. A major problem with existing models is the wide variability in the patient information or features they use, such as heart rate, temperature, and lab results. This inconsistency makes models difficult to compare and limits their ability to work across different hospitals and settings. To solve this, we propose a method called Feature Aligned Transfer Learning (FATL), which identifies and focuses on the most important and commonly reported features across multiple studies, ensuring the model remains consistent and clinically relevant. Most existing models are trained on narrow patient groups, leading to population bias. FATL addresses this by combining knowledge from models trained on diverse populations, using a weighted approach that reflects each models contribution. This makes the system more generalizable and effective across different patient demographics and clinical environments. FATL offers a practical and scalable solution for early sepsis detection, particularly in hospitals with limited resources, and has the potential to improve patient outcomes, reduce healthcare costs, and support more equitable healthcare delivery. Comments: A project implemented for MACHINE LEARNING IN HEALTH AND BIOMEDICAL SCIENCE Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.02889 [cs.LG] (or arXiv:2505.02889v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.02889 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-55] Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)中知识遗忘(unlearning)的问题,即如何有效移除模型中关于特定个体的信息以满足数据隐私、合规性和伦理要求。现有方法通常通过注入错误或无关信息来实现知识掩盖(obfuscation),但这实际上属于知识添加而非真正删除,导致模型在面对探测时仍存在漏洞。论文的关键解决方案是提出一种基于探测的评估框架,用于验证现有方法是否真正移除了目标信息,并引入DF-MCQ方法,该方法通过KL散度(KL-divergence)对自动生成的多项选择题进行预测分布扁平化处理,从而有效移除目标个体的知识并触发适当的拒绝行为。

链接: https://arxiv.org/abs/2505.02884
作者: Guangzhi Sun,Potsawee Manakul,Xiao Zhan,Mark Gales
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unlearning has emerged as a critical capability for large language models (LLMs) to support data privacy, regulatory compliance, and ethical AI deployment. Recent techniques often rely on obfuscation by injecting incorrect or irrelevant information to suppress knowledge. Such methods effectively constitute knowledge addition rather than true removal, often leaving models vulnerable to probing. In this paper, we formally distinguish unlearning from obfuscation and introduce a probing-based evaluation framework to assess whether existing approaches genuinely remove targeted information. Moreover, we propose DF-MCQ, a novel unlearning method that flattens the model predictive distribution over automatically generated multiple-choice questions using KL-divergence, effectively removing knowledge about target individuals and triggering appropriate refusal behaviour. Experimental results demonstrate that DF-MCQ achieves unlearning with over 90% refusal rate and a random choice-level uncertainty that is much higher than obfuscation on probing questions.
zh

[AI-56] Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在程序合成和数学推理任务中性能受限的问题,其根本原因在于预训练语料的质量不足。论文提出的解决方案关键在于通过系统性重写公共数据来提升数据质量,从而显著增强LLM的性能。具体而言,作者引入了两个开源数据集:SwallowCode和SwallowMath,分别针对代码和数学问题进行优化。SwallowCode通过四阶段流水线对The-Stack-v2中的Python代码片段进行重构,包括语法验证、风格过滤及两阶段的LLM重写,以提高代码质量和实用性;SwallowMath则通过去除冗余内容、恢复上下文并重新格式化解答,提升数学问题的可读性和准确性。这种“转换并保留”的方法相较于以往依赖排除性过滤或有限变换的方法,能够更有效地提升低质量数据的价值。

链接: https://arxiv.org/abs/2505.02881
作者: Kazuki Fujii,Yukito Tajima,Sakae Mizuki,Hinari Shimada,Taihei Shiotani,Koshiro Saito,Masanari Ohi,Masaki Kawamura,Taishi Nakamura,Takumi Okamoto,Shigeki Ishida,Kakeru Hattori,Youmi Ma,Hiroya Takamura,Rio Yokota,Naoaki Okazaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27pages(including appendix), 10 figures

点击查看摘要

Abstract:The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode (approximately 16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath (approximately 2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu, surpassing the baseline model’s code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.
zh

[AI-57] Uncertainty Quantification for Machine Learning in Healthcare: A Survey ALT

【速读】:该论文试图解决机器学习(Machine Learning, ML)在医疗领域中不确定性量化(Uncertainty Quantification, UQ)的缺乏系统性评估与有效整合的问题,尤其是在模型开发的不同阶段,如数据处理、训练和评估中。其解决方案的关键在于提供一个全面的UQ分析框架,明确不同方法如何嵌入到ML流程的各个阶段,并突出当前医疗领域常用的主流方法以及来自其他领域的创新方法,以推动UQ在医疗应用中的实际落地与推广。

链接: https://arxiv.org/abs/2505.02874
作者: L. Julián Lechuga López,Shaza Elsharief,Dhiyaa Al Jorf,Firas Darwish,Congbo Ma,Farah E. Shamout
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 46 pages, 3 figures, 2 tables, AHLI Conference on Health, Inference, and Learning (CHIL)

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is pivotal in enhancing the robustness, reliability, and interpretability of Machine Learning (ML) systems for healthcare, optimizing resources and improving patient care. Despite the emergence of ML-based clinical decision support tools, the lack of principled quantification of uncertainty in ML models remains a major challenge. Current reviews have a narrow focus on analyzing the state-of-the-art UQ in specific healthcare domains without systematically evaluating method efficacy across different stages of model development, and despite a growing body of research, its implementation in healthcare applications remains limited. Therefore, in this survey, we provide a comprehensive analysis of current UQ in healthcare, offering an informed framework that highlights how different methods can be integrated into each stage of the ML pipeline including data processing, training and evaluation. We also highlight the most popular methods used in healthcare and novel approaches from other domains that hold potential for future adoption in the medical context. We expect this study will provide a clear overview of the challenges and opportunities of implementing UQ in the ML pipeline for healthcare, guiding researchers and practitioners in selecting suitable techniques to enhance the reliability, safety and trust from patients and clinicians on ML-driven healthcare solutions.
zh

[AI-58] Understanding University Students Use of Generative AI: The Roles of Demographics and Personality Traits

【速读】:该论文试图解决当前关于大学生使用生成式 AI(Generative AI, GAI)的实证研究不足的问题,旨在探讨学生使用 GAI 的情况及其影响因素。其解决方案的关键在于通过调查美国363名本科和研究生,分析 GAI 使用与人口统计变量及大五人格特质(即外向性、宜人性、尽责性、情绪稳定性和智力/想象力)之间的关系,从而揭示不同群体在 GAI 使用上的差异及其背后的心理因素。

链接: https://arxiv.org/abs/2505.02863
作者: Newnew Deng,Edward Jiusi Liu,Xiaoming Zhai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of generative AI (GAI) among university students is rapidly increasing, yet empirical research on students’ GAI use and the factors influencing it remains limited. To address this gap, we surveyed 363 undergraduate and graduate students in the United States, examining their GAI usage and how it relates to demographic variables and personality traits based on the Big Five model (i.e., extraversion, agreeableness, conscientiousness, and emotional stability, and intellect/imagination). Our findings reveal: (a) Students in higher academic years are more inclined to use GAI and prefer it over traditional resources. (b) Non-native English speakers use and adopt GAI more readily than native speakers. © Compared to White, Asian students report higher GAI usage, perceive greater academic benefits, and express a stronger preference for it. Similarly, Black students report a more positive impact of GAI on their academic performance. Personality traits also play a significant role in shaping perceptions and usage of GAI. After controlling demographic factors, we found that personality still significantly predicts GAI use and attitudes: (a) Students with higher conscientiousness use GAI less. (b) Students who are higher in agreeableness perceive a less positive impact of GAI on academic performance and express more ethical concerns about using it for academic work. © Students with higher emotional stability report a more positive impact of GAI on learning and fewer concerns about its academic use. (d) Students with higher extraversion show a stronger preference for GAI over traditional resources. (e) Students with higher intellect/imagination tend to prefer traditional resources. These insights highlight the need for universities to provide personalized guidance to ensure students use GAI effectively, ethically, and equitably in their academic pursuits.
zh

[AI-59] Neural Orchestration for Multi-Agent Systems: A Deep Learning Framework for Optimal Agent Selection in Multi-Domain Task Environments

【速读】:该论文旨在解决多智能体系统(Multi-agent systems, MAS)中传统架构在协调机制僵化和动态任务适应性不足的问题。其解决方案的关键在于提出MetaOrch,一个基于监督学习的神经编排框架,通过建模任务上下文、代理历史和预期响应质量来选择最优代理。该框架引入了一种新颖的模糊评估模块,从完整性、相关性和置信度维度对代理响应进行评分,生成软监督标签以训练编排器,从而实现动态代理选择并估计选择置信度。

链接: https://arxiv.org/abs/2505.02861
作者: Kushagra Agrawal,Nisharg Nargund
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) are foundational in simulating complex real-world scenarios involving autonomous, interacting entities. However, traditional MAS architectures often suffer from rigid coordination mechanisms and difficulty adapting to dynamic tasks. We propose MetaOrch, a neural orchestration framework for optimal agent selection in multi-domain task environments. Our system implements a supervised learning approach that models task context, agent histories, and expected response quality to select the most appropriate agent for each task. A novel fuzzy evaluation module scores agent responses along completeness, relevance, and confidence dimensions, generating soft supervision labels for training the orchestrator. Unlike previous methods that hard-code agent-task mappings, MetaOrch dynamically predicts the most suitable agent while estimating selection confidence. Experiments in simulated environments with heterogeneous agents demonstrate that our approach achieves 86.3% selection accuracy, significantly outperforming baseline strategies including random selection and round-robin scheduling. The modular architecture emphasizes extensibility, allowing agents to be registered, updated, and queried independently. Results suggest that neural orchestration offers a powerful approach to enhancing the autonomy, interpretability, and adaptability of multi-agent systems across diverse task domains.
zh

[AI-60] AI Education in a Mirror: Challenges Faced by Academic and Industry Experts

【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)教育与实际产业挑战之间的差距问题。研究通过半结构化访谈14位AI专家(包括8位来自产业界、6位来自学术界),初步揭示了AI专业人士在学术与产业环境中面临的关键挑战,如数据质量与可用性、模型可扩展性、实践约束、用户行为及可解释性等。研究发现,产业界更关注部署约束、资源限制和外部依赖,而学术界则更强调理论适应性和标准化问题。解决方案的关键在于优化AI课程体系,更好地融入现实复杂性、软件工程原则和跨学科学习,同时注重基础理论与伦理推理能力的培养。

链接: https://arxiv.org/abs/2505.02856
作者: Mahir Akgun,Hadi Hosseini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To appear in AIED 2025

点击查看摘要

Abstract:As Artificial Intelligence (AI) technologies continue to evolve, the gap between academic AI education and real-world industry challenges remains an important area of investigation. This study provides preliminary insights into challenges AI professionals encounter in both academia and industry, based on semi-structured interviews with 14 AI experts - eight from industry and six from academia. We identify key challenges related to data quality and availability, model scalability, practical constraints, user behavior, and explainability. While both groups experience data and model adaptation difficulties, industry professionals more frequently highlight deployment constraints, resource limitations, and external dependencies, whereas academics emphasize theoretical adaptation and standardization issues. These exploratory findings suggest that AI curricula could better integrate real-world complexities, software engineering principles, and interdisciplinary learning, while recognizing the broader educational goals of building foundational and ethical reasoning skills.
zh

[AI-61] A Computational Model of Inclusive Pedagogy: From Understanding to Application

【速读】:该论文试图解决教育科学中对协同适应性师生互动(co-adaptive teacher-student interactions, T-SI)的计算建模不足的问题,这一问题限制了教育洞察力在不同情境下的验证与扩展,同时也制约了机器学习系统在模拟和适应性支持人类学习过程方面的潜力。解决方案的关键在于提出一个整合人类教育情境洞察的可测试计算T-SI模型,该模型通过在现实合成课堂环境中评估多种T-SI策略,证明了包含协同适应原则(如双向代理)的策略优于单边方法,从而提升了所有学习类型的学习效果。

链接: https://arxiv.org/abs/2505.02853
作者: Francesco Balzan,Pedro P. Santos,Maurizio Gabbrielli,Mahault Albarracin,Manuel Lopes
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This is a preprint version of a manuscript intended for submission to the International Journal of Artificial Intelligence in Education (IJAIED)

点击查看摘要

Abstract:Human education transcends mere knowledge transfer, it relies on co-adaptation dynamics – the mutual adjustment of teaching and learning strategies between agents. Despite its centrality, computational models of co-adaptive teacher-student interactions (T-SI) remain underdeveloped. We argue that this gap impedes Educational Science in testing and scaling contextual insights across diverse settings, and limits the potential of Machine Learning systems, which struggle to emulate and adaptively support human learning processes. To address this, we present a computational T-SI model that integrates contextual insights on human education into a testable framework. We use the model to evaluate diverse T-SI strategies in a realistic synthetic classroom setting, simulating student groups with unequal access to sensory information. Results show that strategies incorporating co-adaptation principles (e.g., bidirectional agency) outperform unilateral approaches (i.e., where only the teacher or the student is active), improving the learning outcomes for all learning types. Beyond the testing and scaling of context-dependent educational insights, our model enables hypothesis generation in controlled yet adaptable environments. This work bridges non-computational theories of human education with scalable, inclusive AI in Education systems, providing a foundation for equitable technologies that dynamically adapt to learner needs.
zh

[AI-62] Enhancing tutoring systems by leverag ing tailored promptings and domain knowledge with Large Language Models

【速读】:该论文试图解决当前基于计算机的学习(Computer-based Learning, CBL)中个性化反馈不足及适应不同学习风格的挑战,特别是在编程教学环境中缺乏实时、情境感知的反馈机制。其解决方案的关键在于将与技能对齐的反馈通过检索增强生成(Retrieval Augmented Generation, RAG)整合到大型语言模型(Large Language Models, LLMs)的提示工程中,并开发一个应用以实现个性化的编程辅导。该方法通过提升反馈的针对性和适应性,显著优于传统通用方法。

链接: https://arxiv.org/abs/2505.02849
作者: Mohsen Balavar,Wenli Yang,David Herbert,Soonja Yeom
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in artificial intelligence (AI) and machine learning have reignited interest in their impact on Computer-based Learning (CBL). AI-driven tools like ChatGPT and Intelligent Tutoring Systems (ITS) have enhanced learning experiences through personalisation and flexibility. ITSs can adapt to individual learning needs and provide customised feedback based on a student’s performance, cognitive state, and learning path. Despite these advances, challenges remain in accommodating diverse learning styles and delivering real-time, context-aware feedback. Our research aims to address these gaps by integrating skill-aligned feedback via Retrieval Augmented Generation (RAG) into prompt engineering for Large Language Models (LLMs) and developing an application to enhance learning through personalised tutoring in a computer science programming context. The pilot study evaluated a proposed system using three quantitative metrics: readability score, response time, and feedback depth, across three programming tasks of varying complexity. The system successfully sorted simulated students into three skill-level categories and provided context-aware feedback. This targeted approach demonstrated better effectiveness and adaptability compared to general methods.
zh

[AI-63] he Precautionary Principle and the Innovation Principle: Incompatible Guides for AI Innovation Governance?

【速读】:该论文试图解决在人工智能(Artificial Intelligence, AI)治理与监管政策辩论中,预防原则(Precautionary Principle, PP)与创新原则(Innovation Principle, IP)之间是否存在根本性冲突的问题。论文提出,若将关注范围限定于弱形式的PP与IP,则二者并非完全不兼容或相互否定。解决方案的关键在于通过信号检测理论(Signal Detection Theory, SDT)模型,平衡类型I误差成本(因错误阻止创新扩散而产生的成本)与类型II误差成本(因错误允许创新扩散而产生的成本),并根据预期成本比率确定最优的监管策略:当预期成本比率较小时采用红灯(PP)决策,较大时采用绿灯(IP)决策,而中间比率则采用黄灯“等待与监测”策略。监管沙盒工具通过在有限时间和规模的结构化环境中进行AI测试与实验,帮助监管机构和创新企业更准确地评估预期成本比率,并采取相应的制度、技术或商业模式调整以避免进入红灯区域。

链接: https://arxiv.org/abs/2505.02846
作者: Kim Kaivanto
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 47 pages

点击查看摘要

Abstract:In policy debates concerning the governance and regulation of Artificial Intelligence (AI), both the Precautionary Principle (PP) and the Innovation Principle (IP) are advocated by their respective interest groups. Do these principles offer wholly incompatible and contradictory guidance? Does one necessarily negate the other? I argue here that provided attention is restricted to weak-form PP and IP, the answer to both of these questions is “No.” The essence of these weak formulations is the requirement to fully account for type-I error costs arising from erroneously preventing the innovation’s diffusion through society (i.e. mistaken regulatory red-lighting) as well as the type-II error costs arising from erroneously allowing the innovation to diffuse through society (i.e. mistaken regulatory green-lighting). Within the Signal Detection Theory (SDT) model developed here, weak-PP red-light (weak-IP green-light) determinations are optimal for sufficiently small (large) ratios of expected type-I to type-II error costs. For intermediate expected cost ratios, an amber-light ‘wait-and-monitor’ policy is optimal. Regulatory sandbox instruments allow AI testing and experimentation to take place within a structured environment of limited duration and societal scale, whereby the expected cost ratio falls within the ‘wait-and-monitor’ range. Through sandboxing regulators and innovating firms learn more about the expected cost ratio, and what respective adaptations – of regulation, of technical solution, of business model, or combination thereof, if any – are needed to keep the ratio out of the weak-PP red-light zone.
zh

[AI-64] Snakemaker: Seamlessly transforming ad-hoc analyses into sustainable Snakemake workflows with generative AI

【速读】:该论文旨在解决生物信息学软件开发中可重复性和可持续性的问题,这些问题通常源于快速发展的工具和复杂的流程导致的短生命周期或难以适应的分析管道。解决方案的关键在于引入Snakemaker,该工具利用生成式AI(Generative AI)将非结构化代码转换为定义明确的Snakemake工作流,从而帮助研究人员构建可持续的数据分析管道。Snakemaker通过非侵入式跟踪终端操作、分析执行模式并生成可集成到现有流程中的Snakemake工作流,实现了这一目标。此外,其还支持将单体的Ipython Notebooks转换为模块化的Snakemake管道,并提供集成的聊天助手以实现自然语言指令的细粒度控制。

链接: https://arxiv.org/abs/2505.02841
作者: Marco Masera,Alessandro Leone,Johannes Köster,Ivan Molineris
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reproducibility and sustainability present significant challenges in bioinformatics software development, where rapidly evolving tools and complex workflows often result in short-lived or difficult-to-adapt pipelines. This paper introduces Snakemaker, a tool that leverages generative AI to facilitate researchers build sustainable data analysis pipelines by converting unstructured code into well-defined Snakemake workflows. Snakemaker non-invasively tracks the work performed in the terminal by the researcher, analyzes execution patterns, and generates Snakemake workflows that can be integrated into existing pipelines. Snakemaker also supports the transformation of monolithic Ipython Notebooks into modular Snakemake pipelines, resolving the global state of the notebook into discrete, file-based interactions between rules. An integrated chat assistant provides users with fine-grained control through natural language instructions. Snakemaker generates high-quality Snakemake workflows by adhering to the best practices, including Conda environment tracking, generic rule generation and loop unrolling. By lowering the barrier between prototype and production-quality code, Snakemaker addresses a critical gap in computational reproducibility for bioinformatics research.
zh

[AI-65] Actor-Critics Can Achieve Optimal Sample Efficiency ICML2025

【速读】:该论文试图解决在需要策略性探索的情况下,现有工作无法通过O(1/ε²)轨迹样本复杂度学习到ε-最优策略的问题。其解决方案的关键在于提出一种新颖的actor-critic算法,该算法结合了乐观性、针对最优Q函数的离策略critic估计以及罕见策略重置机制,从而实现了O(dH⁵ log|A|/ε² + dH⁴ log|F|/ε²)的轨迹样本复杂度,并在Bellman eluder维度d以不超过log T的速度增长时,伴随√T的遗憾。此外,该算法还扩展至Hybrid RL场景,通过利用离线数据提高了样本效率。

链接: https://arxiv.org/abs/2505.03710
作者: Kevin Tan,Wei Fan,Yuting Wei
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICML 2025

点击查看摘要

Abstract:Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an \epsilon -optimal policy with a sample complexity of O(1/\epsilon^2) trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of O(dH^5 \log|\mathcalA|/\epsilon^2 + d H^4 \log|\mathcalF|/ \epsilon^2) trajectories, and accompanying \sqrtT regret when the Bellman eluder dimension d does not increase with T at more than a \log T rate. Here, \mathcalF is the critic function class, \mathcalA is the action space, and H is the horizon in the finite horizon MDP setting. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, showing that initializing the critic with offline data yields sample efficiency gains compared to purely offline or online RL. Further, utilizing access to offline data, we provide a \textitnon-optimistic provably efficient actor-critic algorithm that only additionally requires N_\textoff \geq c_\textoff^dH^4/\epsilon^2 in exchange for omitting optimism, where c_\textoff^ is the single-policy concentrability coefficient and N_\textoff is the number of offline samples. This addresses another open problem in the literature. We further provide numerical experiments to support our theoretical findings. Comments: Accepted to ICML 2025 Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2505.03710 [stat.ML] (or arXiv:2505.03710v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2505.03710 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kevin Tan [view email] [v1] Tue, 6 May 2025 17:32:39 UTC (748 KB)
zh

[AI-66] Binding threshold units with artificial oscillatory neurons

【速读】:该论文试图解决传统阈值单元(threshold units)在某些任务中表现不足的问题,特别是在无监督目标发现和推理任务中的局限性。其解决方案的关键在于引入振荡神经元(oscillatory neurons),并通过一种异质耦合机制将振荡神经元与阈值单元相结合。该机制基于广义 Kuramoto 方程与标准阈值单元耦合方法的结合,并通过动力系统理论构建了具有李雅普诺夫函数(Lyapunov function)的模型,从而实现了振荡神经元与阈值单元之间的有效交互。这一框架最终形成了一个 Hopfield-Kuramoto 关联记忆模型,其中振荡神经元可作为对 Hopfield 网络权重矩阵的低秩修正,该修正既可视为赫布学习(Hebbian learning)的一种形式,也可对应于大语言模型微调中常用的 LoRA 方法。

链接: https://arxiv.org/abs/2505.03648
作者: Vladimir Fanaskov,Ivan Oseledets
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial Kuramoto oscillatory neurons were recently introduced as an alternative to threshold units. Empirical evidence suggests that oscillatory units outperform threshold units in several tasks including unsupervised object discovery and certain reasoning problems. The proposed coupling mechanism for these oscillatory neurons is heterogeneous, combining a generalized Kuramoto equation with standard coupling methods used for threshold units. In this research note, we present a theoretical framework that clearly distinguishes oscillatory neurons from threshold units and establishes a coupling mechanism between them. We argue that, from a biological standpoint, oscillatory and threshold units realise distinct aspects of neural coding: roughly, threshold units model intensity of neuron firing, while oscillatory units facilitate information exchange by frequency modulation. To derive interaction between these two types of units, we constrain their dynamics by focusing on dynamical systems that admit Lyapunov functions. For threshold units, this leads to Hopfield associative memory model, and for oscillatory units it yields a specific form of generalized Kuramoto model. The resulting dynamical systems can be naturally coupled to form a Hopfield-Kuramoto associative memory model, which also admits a Lyapunov function. Various forms of coupling are possible. Notably, oscillatory neurons can be employed to implement a low-rank correction to the weight matrix of a Hopfield network. This correction can be viewed either as a form of Hebbian learning or as a popular LoRA method used for fine-tuning of large language models. We demonstrate the practical realization of this particular coupling through illustrative toy experiments.
zh

[AI-67] CreoPep: A Universal Deep Learning Framework for Target-Specific Peptide Design and Optimization

【速读】:该论文旨在解决靶向肽(如锥虫毒素)在治疗应用中因天然变体多样性有限及传统优化策略耗时而难以充分发挥潜力的问题。其解决方案的关键在于提出CreoPep,这是一个基于深度学习的条件生成框架,通过结合掩码语言建模与渐进式掩码策略,设计高亲和力的肽突变体并发现新的结构基序。该方法整合了FoldX能量筛选与温度控制的多项式采样,生成具有结构和功能多样性的肽,同时保留关键药理特性。

链接: https://arxiv.org/abs/2505.02887
作者: Cheng Ge,Han-Shen Tae,Zhenqiang Zhang,Lu Lu,Zhijie Huang,Yilin Wang,Tao Jiang,Wenqing Cai,Shan Chang,David J. Adams,Rilei Yu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:Target-specific peptides, such as conotoxins, exhibit exceptional binding affinity and selectivity toward ion channels and receptors. However, their therapeutic potential remains underutilized due to the limited diversity of natural variants and the labor-intensive nature of traditional optimization strategies. Here, we present CreoPep, a deep learning-based conditional generative framework that integrates masked language modeling with a progressive masking scheme to design high-affinity peptide mutants while uncovering novel structural motifs. CreoPep employs an integrative augmentation pipeline, combining FoldX-based energy screening with temperature-controlled multinomial sampling, to generate structurally and functionally diverse peptides that retain key pharmacological properties. We validate this approach by designing conotoxin inhibitors targeting the \alpha 7 nicotinic acetylcholine receptor, achieving submicromolar potency in electrophysiological assays. Structural analysis reveals that CreoPep-generated variants engage in both conserved and novel binding modes, including disulfide-deficient forms, thus expanding beyond conventional design paradigms. Overall, CreoPep offers a robust and generalizable platform that bridges computational peptide design with experimental validation, accelerating the discovery of next-generation peptide therapeutics.
zh

[AI-68] askmaster Deconstructed: A Quantitative Look at Tension Volatility and Viewer Ratings

【速读】:该论文试图解决的问题是:在《Taskmaster》这一结合喜剧表演与正式评分系统的英国电视节目中,评分机制是否对观众参与度产生实质性影响。研究的关键在于通过统计分析162集共18季的节目数据,利用十五个集级别指标量化排名波动性、分数差距、领先变化和胜者主导性,并评估这些指标与IMDb评分之间的关系,从而判断评分动态是否真正影响观众兴趣。

链接: https://arxiv.org/abs/2505.02886
作者: David H. Silver
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI)
备注: 29 pages, includes 5 figures and 18 supplementary visualizations. Submitted as a preprint. Code and data available at github dot com slash silverdavi slash taskmaster-stats

点击查看摘要

Abstract:Taskmaster is a British television show that combines comedic performance with a formal scoring system. Despite the appearance of structured competition, it remains unclear whether scoring dynamics contribute meaningfully to audience engagement. We conducted a statistical analysis of 162 episodes across 18 series, using fifteen episode-level metrics to quantify rank volatility, point spread, lead changes, and winner dominance. None of these metrics showed a significant association with IMDb ratings, even after controlling for series effects. Long-term trends suggest that average points have increased over time, while volatility has slightly declined and rank spread has remained stable. These patterns indicate an attempt to enhance competitive visibility without altering the show’s structural equilibrium. We also analyzed contestant rank trajectories and identified five recurring archetypes describing performance styles. These patterns suggest that viewer interest is shaped more by contestant behavior than by game mechanics.
zh

机器学习

[LG-0] Sustainable Smart Farm Networks: Enhancing Resilience and Efficiency with Decision Theory-Guided Deep Reinforcement Learning

链接: https://arxiv.org/abs/2505.03721
作者: Dian Chen,Zelin Wan,Dong Sam Ha,Jin-Hee Cho
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Solar sensor-based monitoring systems have become a crucial agricultural innovation, advancing farm management and animal welfare through integrating sensor technology, Internet-of-Things, and edge and cloud computing. However, the resilience of these systems to cyber-attacks and their adaptability to dynamic and constrained energy supplies remain largely unexplored. To address these challenges, we propose a sustainable smart farm network designed to maintain high-quality animal monitoring under various cyber and adversarial threats, as well as fluctuating energy conditions. Our approach utilizes deep reinforcement learning (DRL) to devise optimal policies that maximize both monitoring effectiveness and energy efficiency. To overcome DRL’s inherent challenge of slow convergence, we integrate transfer learning (TL) and decision theory (DT) to accelerate the learning process. By incorporating DT-guided strategies, we optimize monitoring quality and energy sustainability, significantly reducing training time while achieving comparable performance rewards. Our experimental results prove that DT-guided DRL outperforms TL-enhanced DRL models, improving system performance and reducing training runtime by 47.5%.

[LG-1] Learning Survival Distributions with the Asymmetric Laplace Distribution ICML2025

链接: https://arxiv.org/abs/2505.03712
作者: Deming Sheng,Ricardo Henao
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Accepted to ICML 2025

点击查看摘要

Abstract:Probabilistic survival analysis models seek to estimate the distribution of the future occurrence (time) of an event given a set of covariates. In recent years, these models have preferred nonparametric specifications that avoid directly estimating survival distributions via discretization. Specifically, they estimate the probability of an individual event at fixed times or the time of an event at fixed probabilities (quantiles), using supervised learning. Borrowing ideas from the quantile regression literature, we propose a parametric survival analysis method based on the Asymmetric Laplace Distribution (ALD). This distribution allows for closed-form calculation of popular event summaries such as mean, median, mode, variation, and quantiles. The model is optimized by maximum likelihood to learn, at the individual level, the parameters (location, scale, and asymmetry) of the ALD distribution. Extensive results on synthetic and real-world data demonstrate that the proposed method outperforms parametric and nonparametric approaches in terms of accuracy, discrimination and calibration.

[LG-2] Neural Integral Operators for Inverse problems in Spectroscopy

链接: https://arxiv.org/abs/2505.03677
作者: Emanuele Zappala,Alice Giola,Andreas Kramer,Enrico Greco
类目: Machine Learning (cs.LG)
*备注: 13 pages. Codes available upon request

点击查看摘要

Abstract:Deep learning has shown high performance on spectroscopic inverse problems when sufficient data is available. However, it is often the case that data in spectroscopy is scarce, and this usually causes severe overfitting problems with deep learning methods. Traditional machine learning methods are viable when datasets are smaller, but the accuracy and applicability of these methods is generally more limited. We introduce a deep learning method for classification of molecular spectra based on learning integral operators via integral equations of the first kind, which results in an algorithm that is less affected by overfitting issues on small datasets, compared to other deep learning models. The problem formulation of the deep learning approach is based on inverse problems, which have traditionally found important applications in spectroscopy. We perform experiments on real world data to showcase our algorithm. It is seen that the model outperforms traditional machine learning approaches such as decision tree and support vector machine, and for small datasets it outperforms other deep learning models. Therefore, our methodology leverages the power of deep learning, still maintaining the performance when the available data is very limited, which is one of the main issues that deep learning faces in spectroscopy, where datasets are often times of small size. Comments: 13 pages. Codes available upon request Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.03677 [cs.LG] (or arXiv:2505.03677v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.03677 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Mitigating mode collapse in normalizing flows by annealing with an adaptive schedule: Application to parameter estimation

链接: https://arxiv.org/abs/2505.03652
作者: Yihang Wang,Chris Chi,Aaron R. Dinner
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:Normalizing flows (NFs) provide uncorrelated samples from complex distributions, making them an appealing tool for parameter estimation. However, the practical utility of NFs remains limited by their tendency to collapse to a single mode of a multimodal distribution. In this study, we show that annealing with an adaptive schedule based on the effective sample size (ESS) can mitigate mode collapse. We demonstrate that our approach can converge the marginal likelihood for a biochemical oscillator model fit to time-series data in ten-fold less computation time than a widely used ensemble Markov chain Monte Carlo (MCMC) method. We show that the ESS can also be used to reduce variance by pruning the samples. We expect these developments to be of general use for sampling with NFs and discuss potential opportunities for further improvements.

[LG-4] Understand the Effect of Importance Weighting in Deep Learning on Dataset Shift

链接: https://arxiv.org/abs/2505.03617
作者: Thien Nhan Vo,Thanh Xuan Truong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We evaluate the effectiveness of importance weighting in deep neural networks under label shift and covariate shift. On synthetic 2D data (linearly separable and moon-shaped) using logistic regression and MLPs, we observe that weighting strongly affects decision boundaries early in training but fades with prolonged optimization. On CIFAR-10 with various class imbalances, only L2 regularization (not dropout) helps preserve weighting effects. In a covariate-shift experiment, importance weighting yields no significant performance gain, highlighting challenges on complex data. Our results call into question the practical utility of importance weighting for real-world distribution shifts.

[LG-5] Anant-Net: Breaking the Curse of Dimensionality with Scalable and Interpretable Neural Surrogate for High-Dimensional PDEs

链接: https://arxiv.org/abs/2505.03595
作者: Sidharth S. Menon,Ameya D. Jagtap
类目: Machine Learning (cs.LG)
*备注: 27 pages, 13 figures

点击查看摘要

Abstract:High-dimensional partial differential equations (PDEs) arise in diverse scientific and engineering applications but remain computationally intractable due to the curse of dimensionality. Traditional numerical methods struggle with the exponential growth in computational complexity, particularly on hypercubic domains, where the number of required collocation points increases rapidly with dimensionality. Here, we introduce Anant-Net, an efficient neural surrogate that overcomes this challenge, enabling the solution of PDEs in high dimensions. Unlike hyperspheres, where the internal volume diminishes as dimensionality increases, hypercubes retain or expand their volume (for unit or larger length), making high-dimensional computations significantly more demanding. Anant-Net efficiently incorporates high-dimensional boundary conditions and minimizes the PDE residual at high-dimensional collocation points. To enhance interpretability, we integrate Kolmogorov-Arnold networks into the Anant-Net architecture. We benchmark Anant-Net’s performance on several linear and nonlinear high-dimensional equations, including the Poisson, Sine-Gordon, and Allen-Cahn equations, demonstrating high accuracy and robustness across randomly sampled test points from high-dimensional space. Importantly, Anant-Net achieves these results with remarkable efficiency, solving 300-dimensional problems on a single GPU within a few hours. We also compare Anant-Net’s results for accuracy and runtime with other state-of-the-art methods. Our findings establish Anant-Net as an accurate, interpretable, and scalable framework for efficiently solving high-dimensional PDEs.

[LG-6] Efficient Training of Physics-enhanced Neural ODEs via Direct Collocation and Nonlinear Programming

链接: https://arxiv.org/abs/2505.03552
作者: Linus Langenkamp,Philip Hannebohm,Bernhard Bachmann
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 16 pages, 9 figures, submitted to 16th International Modelica FMI Conference

点击查看摘要

Abstract:We propose a novel approach for training Physics-enhanced Neural ODEs (PeNODEs) by expressing the training process as a dynamic optimization problem. The full model, including neural components, is discretized using a high-order implicit Runge-Kutta method with flipped Legendre-Gauss-Radau points, resulting in a large-scale nonlinear program (NLP) efficiently solved by state-of-the-art NLP solvers such as Ipopt. This formulation enables simultaneous optimization of network parameters and state trajectories, addressing key limitations of ODE solver-based training in terms of stability, runtime, and accuracy. Extending on a recent direct collocation-based method for Neural ODEs, we generalize to PeNODEs, incorporate physical constraints, and present a custom, parallelized, open-source implementation. Benchmarks on a Quarter Vehicle Model and a Van-der-Pol oscillator demonstrate superior accuracy, speed, and generalization with smaller networks compared to other training techniques. We also outline a planned integration into OpenModelica to enable accessible training of Neural DAEs.

[LG-7] Small-Scale-Fading-Aware Resource Allocation in Wireless Federated Learning

链接: https://arxiv.org/abs/2505.03533
作者: Jiacheng Wang,Le Liang,Hao Ye,Chongtao Guo,Shi Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Judicious resource allocation can effectively enhance federated learning (FL) training performance in wireless networks by addressing both system and statistical heterogeneity. However, existing strategies typically rely on block fading assumptions, which overlooks rapid channel fluctuations within each round of FL gradient uploading, leading to a degradation in FL training performance. Therefore, this paper proposes a small-scale-fading-aware resource allocation strategy using a multi-agent reinforcement learning (MARL) framework. Specifically, we establish a one-step convergence bound of the FL algorithm and formulate the resource allocation problem as a decentralized partially observable Markov decision process (Dec-POMDP), which is subsequently solved using the QMIX algorithm. In our framework, each client serves as an agent that dynamically determines spectrum and power allocations within each coherence time slot, based on local observations and a reward derived from the convergence analysis. The MARL setting reduces the dimensionality of the action space and facilitates decentralized decision-making, enhancing the scalability and practicality of the solution. Experimental results demonstrate that our QMIX-based resource allocation strategy significantly outperforms baseline methods across various degrees of statistical heterogeneity. Additionally, ablation studies validate the critical importance of incorporating small-scale fading dynamics, highlighting its role in optimizing FL performance.

[LG-8] Causal Intervention Framework for Variational Auto Encoder Mechanistic Interpretability

链接: https://arxiv.org/abs/2505.03530
作者: Dip Roy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanistic interpretability of deep learning models has emerged as a crucial research direction for understanding the functioning of neural networks. While significant progress has been made in interpreting discriminative models like transformers, understanding generative models such as Variational Autoencoders (VAEs) remains challenging. This paper introduces a comprehensive causal intervention framework for mechanistic interpretability of VAEs. We develop techniques to identify and analyze “circuit motifs” in VAEs, examining how semantic factors are encoded, processed, and disentangled through the network layers. Our approach uses targeted interventions at different levels: input manipulations, latent space perturbations, activation patching, and causal mediation analysis. We apply our framework to both synthetic datasets with known causal relationships and standard disentanglement benchmarks. Results show that our interventions can successfully isolate functional circuits, map computational graphs to causal graphs of semantic factors, and distinguish between polysemantic and monosemantic units. Furthermore, we introduce metrics for causal effect strength, intervention specificity, and circuit modularity that quantify the interpretability of VAE components. Experimental results demonstrate clear differences between VAE variants, with FactorVAE achieving higher disentanglement scores (0.084) and effect strengths (mean 4.59) compared to standard VAE (0.064, 3.99) and Beta-VAE (0.051, 3.43). Our framework advances the mechanistic understanding of generative models and provides tools for more transparent and controllable VAE architectures.

[LG-9] Uncovering the Limitations of Model Inversion Evaluation: Benchmarks and Connection to Type-I Adversarial Attacks

链接: https://arxiv.org/abs/2505.03519
作者: Sy-Tuyen Ho,Koh Jun Hao,Ngoc-Bao Nguyen,Alexander Binder,Ngai-Man Cheung
类目: Machine Learning (cs.LG)
*备注: Our dataset and code are available in the Supp

点击查看摘要

Abstract:Model Inversion (MI) attacks aim to reconstruct information of private training data by exploiting access to machine learning models. The most common evaluation framework for MI attacks/defenses relies on an evaluation model that has been utilized to assess progress across almost all MI attacks and defenses proposed in recent years. In this paper, for the first time, we present an in-depth study of MI evaluation. Firstly, we construct the first comprehensive human-annotated dataset of MI attack samples, based on 28 setups of different MI attacks, defenses, private and public datasets. Secondly, using our dataset, we examine the accuracy of the MI evaluation framework and reveal that it suffers from a significant number of false positives. These findings raise questions about the previously reported success rates of SOTA MI attacks. Thirdly, we analyze the causes of these false positives, design controlled experiments, and discover the surprising effect of Type I adversarial features on MI evaluation, as well as adversarial transferability, highlighting a relationship between two previously distinct research areas. Our findings suggest that the performance of SOTA MI attacks has been overestimated, with the actual privacy leakage being significantly less than previously reported. In conclusion, we highlight critical limitations in the widely used MI evaluation framework and present our methods to mitigate false positive rates. We remark that prior research has shown that Type I adversarial attacks are very challenging, with no existing solution. Therefore, we urge to consider human evaluation as a primary MI evaluation framework rather than merely a supplement as in previous MI research. We also encourage further work on developing more robust and reliable automatic evaluation frameworks.

[LG-10] AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

链接: https://arxiv.org/abs/2505.03509
作者: Pablo Gómez,David O’Ryan
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注: Journal submission in preparation

点击查看摘要

Abstract:Anomaly detection in large datasets is essential in fields such as astronomy and computer vision; however, supervised methods typically require extensive anomaly labelling, which is often impractical. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. By treating anomaly detection as a semi-supervised binary classification problem, we efficiently utilise limited labelled and abundant unlabelled images. We allow iterative model refinement in a user interface for expert verification of high-confidence anomalies and correction of false positives. Built for astronomical data, AnomalyMatch generalises readily to other domains facing similar data challenges. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance (1% anomalies for miniImageNet) display strong performance: starting from five to ten labelled anomalies and after three active learning cycles, we achieve an average AUROC of 0.95 (miniImageNet) and 0.86 (GalaxyMNIST), with respective AUPRC of 0.77 and 0.71. After active learning cycles, anomalies are ranked with 71% (miniImageNet) to 93% precision in the 1% of the highest-ranked images. AnomalyMatch is tailored for large-scale applications, efficiently processing predictions for 100 million images within three days on a single GPU. Integrated into ESAs Datalabs platform, AnomalyMatch facilitates targeted discovery of scientifically valuable anomalies in vast astronomical datasets. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity.

[LG-11] Modeling Musical Genre Trajectories through Pathlet Learning

链接: https://arxiv.org/abs/2505.03480
作者: Lilian Marey,Charlotte Laclau,Bruno Sguerra,Tiphaine Viard,Manuel Moussallam
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct '25)

点击查看摘要

Abstract:The increasing availability of user data on music streaming platforms opens up new possibilities for analyzing music consumption. However, understanding the evolution of user preferences remains a complex challenge, particularly as their musical tastes change over time. This paper uses the dictionary learning paradigm to model user trajectories across different musical genres. We define a new framework that captures recurring patterns in genre trajectories, called pathlets, enabling the creation of comprehensible trajectory embeddings. We show that pathlet learning reveals relevant listening patterns that can be analyzed both qualitatively and quantitatively. This work improves our understanding of users’ interactions with music and opens up avenues of research into user behavior and fostering diversity in recommender systems. A dataset of 2000 user histories tagged by genre over 17 months, supplied by Deezer (a leading music streaming company), is also released with the code.

[LG-12] Knowledge Distillation for Speech Denoising by Latent Representation Alignment with Cosine Distance

链接: https://arxiv.org/abs/2505.03442
作者: Diep Luong,Mikko Heikkinen,Konstantinos Drossos,Tuomas Virtanen
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech denoising is a generally adopted and impactful task, appearing in many common and everyday-life use cases. Although there are very powerful methods published, most of those are too complex for deployment in everyday and low-resources computational environments, like hand-held devices, intelligent glasses, hearing aids, etc. Knowledge distillation (KD) is a prominent way for alleviating this complexity mismatch and is based on the transferring/distilling of knowledge from a pre-trained complex model, the teacher, to another less complex one, the student. Existing KD methods for speech denoising are based on processes that potentially hamper the KD by bounding the learning of the student to the distribution, information ordering, and feature dimensionality learned by the teacher. In this paper, we present and assess a method that tries to treat this issue, by exploiting the well-known denoising-autoencoder framework, the linear inverted bottlenecks, and the properties of the cosine similarity. We use a public dataset and conduct repeated experiments with different mismatching scenarios between the teacher and the student, reporting the mean and standard deviation of the metrics of our method and another, state-of-the-art method that is used as a baseline. Our results show that with the proposed method, the student can perform better and can also retain greater mismatching conditions compared to the teacher.

[LG-13] Wasserstein Convergence of Score-based Generative Models under Semiconvexity and Discontinuous Gradients

链接: https://arxiv.org/abs/2505.03432
作者: Stefano Bruno,Sotirios Sabanis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Score-based Generative Models (SGMs) approximate a data distribution by perturbing it with Gaussian noise and subsequently denoising it via a learned reverse diffusion process. These models excel at modeling complex data distributions and generating diverse samples, achieving state-of-the-art performance across domains such as computer vision, audio generation, reinforcement learning, and computational biology. Despite their empirical success, existing Wasserstein-2 convergence analysis typically assume strong regularity conditions-such as smoothness or strict log-concavity of the data distribution-that are rarely satisfied in practice. In this work, we establish the first non-asymptotic Wasserstein-2 convergence guarantees for SGMs targeting semiconvex distributions with potentially discontinuous gradients. Our upper bounds are explicit and sharp in key parameters, achieving optimal dependence of O(\sqrtd) on the data dimension d and convergence rate of order one. The framework accommodates a wide class of practically relevant distributions, including symmetric modified half-normal distributions, Gaussian mixtures, double-well potentials, and elastic net potentials. By leveraging semiconvexity without requiring smoothness assumptions on the potential such as differentiability, our results substantially broaden the theoretical foundations of SGMs, bridging the gap between empirical success and rigorous guarantees in non-smooth, complex data regimes.

[LG-14] Knowledge Augmented Complex Problem Solving with Large Language Models : A Survey

链接: https://arxiv.org/abs/2505.03418
作者: Da Zheng,Lun Du,Junwei Su,Yuchen Tian,Yuqi Zhu,Jintian Zhang,Lanning Wei,Ningyu Zhang,Huajun Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Problem-solving has been a fundamental driver of human progress in numerous domains. With advancements in artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools capable of tackling complex problems across diverse domains. Unlike traditional computational systems, LLMs combine raw computational power with an approximation of human reasoning, allowing them to generate solutions, make inferences, and even leverage external computational tools. However, applying LLMs to real-world problem-solving presents significant challenges, including multi-step reasoning, domain knowledge integration, and result verification. This survey explores the capabilities and limitations of LLMs in complex problem-solving, examining techniques including Chain-of-Thought (CoT) reasoning, knowledge augmentation, and various LLM-based and tool-based verification techniques. Additionally, we highlight domain-specific challenges in various domains, such as software engineering, mathematical reasoning and proving, data analysis and modeling, and scientific research. The paper further discusses the fundamental limitations of the current LLM solutions and the future directions of LLM-based complex problems solving from the perspective of multi-step reasoning, domain knowledge integration and result verification.

[LG-15] Prediction Models That Learn to Avoid Missing Values

链接: https://arxiv.org/abs/2505.03393
作者: Lena Stempfle,Anton Matsson,Newton Mwai,Fredrik D. Johansson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Established approaches often add bias through imputation or excessive model complexity via missingness indicators. Moreover, either method can obscure interpretability, making it harder to understand how the model utilizes the observed variables in predictions. We propose missingness-avoiding (MA) machine learning, a general framework for training models to rarely require the values of missing (or imputed) features at test time. We create tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms in their learning objectives. The tree-based models leverage contextual missingness by reducing reliance on missing values based on the observed context. Experiments on real-world datasets demonstrate that MA-DT, MA-LASSO, MA-RF, and MA-GBT effectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values.

[LG-16] Concept Factorization via Self-Representation and Adaptive Graph Structure Learning

链接: https://arxiv.org/abs/2505.03390
作者: Zhengqin Yang,Di Wu,Jia Chen,Xin Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept Factorization (CF) models have attracted widespread attention due to their excellent performance in data clustering. In recent years, many variant models based on CF have achieved great success in clustering by taking into account the internal geometric manifold structure of the dataset and using graph regularization techniques. However, their clustering performance depends greatly on the construction of the initial graph structure. In order to enable adaptive learning of the graph structure of the data, we propose a Concept Factorization Based on Self-Representation and Adaptive Graph Structure Learning (CFSRAG) Model. CFSRAG learns the affinity relationship between data through a self-representation method, and uses the learned affinity matrix to implement dynamic graph regularization constraints, thereby ensuring dynamic learning of the internal geometric structure of the data. Finally, we give the CFSRAG update rule and convergence analysis, and conduct comparative experiments on four real datasets. The results show that our model outperforms other state-of-the-art models.

[LG-17] Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation

链接: https://arxiv.org/abs/2505.03387
作者: Diego Perazzolo,Pietro Fanton,Ilaria Barison,Marny Fedrigo,Annalisa Angelini,Chiara Castellani,Enrico Grisan
类目: Machine Learning (cs.LG)
*备注: Paper accepted at the 47th Annual International Conference IEEE EMBC 2025 (Engineering in Medicine and Biology Society), Copenhagen, Denmark

点击查看摘要

Abstract:Given the increasing complexity of omics datasets, a key challenge is not only improving classification performance but also enhancing the transparency and reliability of model decisions. Effective model performance and feature selection are fundamental for explainability and reliability. In many cases, high dimensional omics datasets suffer from limited number of samples due to clinical constraints, patient conditions, phenotypes rarity and others conditions. Current omics based classification models often suffer from narrow interpretability, making it difficult to discern meaningful insights where trust and reproducibility are critical. This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques to achieve high standard classification accuracy while ensuring better interpretability. Using the publicly available dataset (E MTAB 8026), we explore a bootstrap analysis in six binary classification scenarios to evaluate the proposed model’s behaviour. We show that the proposed pipeline yields cross validated perfomance on small dataset that is conserved when the trained classifier is applied to a larger test set. Our findings emphasize the fundamental balance between accuracy and feature selection, highlighting the positive effect of introducing synthetic data for better generalization, even in scenarios with very limited samples availability.

[LG-18] Physics-informed neural network estimation of active material properties in time-dependent cardiac biomechanical models

链接: https://arxiv.org/abs/2505.03382
作者: Matthias Höfler,Francesco Regazzoni,Stefano Pagani,Elias Karabelas,Christoph Augustin,Gundolf Haase,Gernot Plank,Federica Caforio
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Active stress models in cardiac biomechanics account for the mechanical deformation caused by muscle activity, thus providing a link between the electrophysiological and mechanical properties of the tissue. The accurate assessment of active stress parameters is fundamental for a precise understanding of myocardial function but remains difficult to achieve in a clinical setting, especially when only displacement and strain data from medical imaging modalities are available. This work investigates, through an in-silico study, the application of physics-informed neural networks (PINNs) for inferring active contractility parameters in time-dependent cardiac biomechanical models from these types of imaging data. In particular, by parametrising the sought state and parameter field with two neural networks, respectively, and formulating an energy minimisation problem to search for the optimal network parameters, we are able to reconstruct in various settings active stress fields in the presence of noise and with a high spatial resolution. To this end, we also advance the vanilla PINN learning algorithm with the use of adaptive weighting schemes, ad-hoc regularisation strategies, Fourier features, and suitable network architectures. In addition, we thoroughly analyse the influence of the loss weights in the reconstruction of active stress parameters. Finally, we apply the method to the characterisation of tissue inhomogeneities and detection of fibrotic scars in myocardial tissue. This approach opens a new pathway to significantly improve the diagnosis, treatment planning, and management of heart conditions associated with cardiac fibrosis.

[LG-19] Geospatial Mechanistic Interpretability of Large Language Models

链接: https://arxiv.org/abs/2505.03368
作者: Stef De Sabbata,Stefano Mizzaro,Kevin Roitero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated unprecedented capabilities across various natural language processing tasks. Their ability to process and generate viable text and code has made them ubiquitous in many fields, while their deployment as knowledge bases and “reasoning” tools remains an area of ongoing research. In geography, a growing body of literature has been focusing on evaluating LLMs’ geographical knowledge and their ability to perform spatial reasoning. However, very little is still known about the internal functioning of these models, especially about how they process geographical information. In this chapter, we establish a novel framework for the study of geospatial mechanistic interpretability - using spatial analysis to reverse engineer how LLMs handle geographical information. Our aim is to advance our understanding of the internal representations that these complex models generate while processing geographical information - what one might call “how LLMs think about geographic information” if such phrasing was not an undue anthropomorphism. We first outline the use of probing in revealing internal structures within LLMs. We then introduce the field of mechanistic interpretability, discussing the superposition hypothesis and the role of sparse autoencoders in disentangling polysemantic internal representations of LLMs into more interpretable, monosemantic features. In our experiments, we use spatial autocorrelation to show how features obtained for placenames display spatial patterns related to their geographic location and can thus be interpreted geospatially, providing insights into how these models process geographical information. We conclude by discussing how our framework can help shape the study and use of foundation models in geography. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.03368 [cs.LG] (or arXiv:2505.03368v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.03368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation

链接: https://arxiv.org/abs/2505.03344
作者: Keyu Chen,Wenchao Sun,Hao Cheng,Sifa Zheng
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving both realism and controllability in interactive closed-loop traffic simulation remains a key challenge in autonomous driving. Data-driven simulation methods reproduce realistic trajectories but suffer from covariate shift in closed-loop deployment, compounded by simplified dynamics models that further reduce reliability. Conversely, physics-based simulation methods enhance reliable and controllable closed-loop interactions but often lack expert demonstrations, compromising realism. To address these challenges, we introduce a dual-stage AV-centered simulation framework that conducts open-loop imitation learning pre-training in a data-driven simulator to capture trajectory-level realism and multimodality, followed by closed-loop reinforcement learning fine-tuning in a physics-based simulator to enhance controllability and mitigate covariate shift. In the fine-tuning stage, we propose RIFT, a simple yet effective closed-loop RL fine-tuning strategy that preserves the trajectory-level multimodality through a GRPO-style group-relative advantage formulation, while enhancing controllability and training stability by replacing KL regularization with the dual-clip mechanism. Extensive experiments demonstrate that RIFT significantly improves the realism and controllability of generated traffic scenarios, providing a robust platform for evaluating autonomous vehicle performance in diverse and interactive scenarios.

[LG-21] Unraveling the Rainbow: can value-based methods schedule?

链接: https://arxiv.org/abs/2505.03323
作者: Arthur Corrêa,Alexandre Jesus,Cristóvão Silva,Samuel Moniz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, deep reinforcement learning has emerged as a promising approach for solving complex combinatorial optimization problems. Broadly, deep reinforcement learning methods fall into two categories: policy-based and value-based. While value-based approaches have achieved notable success in domains such as the Arcade Learning Environment, the combinatorial optimization community has predominantly favored policy-based methods, often overlooking the potential of value-based algorithms. In this work, we conduct a comprehensive empirical evaluation of value-based algorithms, including the deep q-network and several of its advanced extensions, within the context of two complex combinatorial problems: the job-shop and the flexible job-shop scheduling problems, two fundamental challenges with multiple industrial applications. Our results challenge the assumption that policy-based methods are inherently superior for combinatorial optimization. We show that several value-based approaches can match or even outperform the widely adopted proximal policy optimization algorithm, suggesting that value-based strategies deserve greater attention from the combinatorial optimization community. Our code is openly available at: this https URL.

[LG-22] MDPs with a State Sensing Cost

链接: https://arxiv.org/abs/2505.03280
作者: Vansh Kapoor,Jayakrishnan Nair
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:In many practical sequential decision-making problems, tracking the state of the environment incurs a sensing/communication/computation cost. In these settings, the agent’s interaction with its environment includes the additional component of deciding \textitwhen to sense the state, in a manner that balances the value associated with optimal (state-specific) actions and the cost of sensing. We formulate this as an expected discounted cost Markov Decision Process (MDP), wherein the agent incurs an additional cost for sensing its next state, but has the option to take actions while remaining ‘blind’ to the system state. We pose this problem as a classical discounted cost MDP with an expanded (countably infinite) state space. While computing the optimal policy for this MDP is intractable in general, we bound the sub-optimality gap associated with optimal policies in a restricted class, where the number of consecutive non-sensing (a.k.a., blind) actions is capped. We also design a computationally efficient heuristic algorithm based on policy improvement, which in practice performs close to the optimal policy. Finally, we benchmark against the state of the art via a numerical case study. Comments: 14 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.03280 [cs.LG] (or arXiv:2505.03280v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.03280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Joint Resource Management for Energy-efficient UAV-assisted SWIPT-MEC: A Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2505.03230
作者: Yue Chen,Hui Kang,Jiahui Li,Geng Su,Boxiong Wang,Jiacheng Wang,Cong Liang,Shuang Liang,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of simultaneous wireless information and power transfer (SWIPT) technology in 6G Internet of Things (IoT) networks faces significant challenges in remote areas and disaster scenarios where ground infrastructure is unavailable. This paper proposes a novel unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system enhanced by directional antennas to provide both computational resources and energy support for ground IoT terminals. However, such systems require multiple trade-off policies to balance UAV energy consumption, terminal battery levels, and computational resource allocation under various constraints, including limited UAV battery capacity, non-linear energy harvesting characteristics, and dynamic task arrivals. To address these challenges comprehensively, we formulate a bi-objective optimization problem that simultaneously considers system energy efficiency and terminal battery sustainability. We then reformulate this non-convex problem with a hybrid solution space as a Markov decision process (MDP) and propose an improved soft actor-critic (SAC) algorithm with an action simplification mechanism to enhance its convergence and generalization capabilities. Simulation results have demonstrated that our proposed approach outperforms various baselines in different scenarios, achieving efficient energy management while maintaining high computational performance. Furthermore, our method shows strong generalization ability across different scenarios, particularly in complex environments, validating the effectiveness of our designed boundary penalty and charging reward mechanisms.

[LG-24] DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning

链接: https://arxiv.org/abs/2505.03209
作者: Borui Wang,Kathleen McKeown,Rex Ying
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from expert demonstrations has long remained a challenging research problem, and existing state-of-the-art methods using behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability. Inspired by the strong reasoning abilities of large language models (LLMs), we propose a novel strategy-based reinforcement learning framework integrated with LLMs called DYnamic STrategy Induction with Llms for reinforcement learning (DYSTIL) to overcome these limitations. DYSTIL dynamically queries a strategy-generating LLM to induce textual strategies based on advantage estimations and expert demonstrations, and gradually internalizes induced strategies into the RL agent through policy optimization to improve its performance through boosting policy generalization and enhancing sample efficiency. It also provides a direct textual channel to observe and interpret the evolution of the policy’s underlying strategies during training. We test DYSTIL over challenging RL environments from Minigrid and BabyAI, and empirically demonstrate that DYSTIL significantly outperforms state-of-the-art baseline methods by 17.75% in average success rate while also enjoying higher sample efficiency during the learning process.

[LG-25] Partial Label Clustering IJCAI2025

链接: https://arxiv.org/abs/2505.03207
作者: Yutong Xie,Fuchao Yang,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semi-supervised PLL methods when only limited samples are annotated. The code is publicly available at this https URL.

[LG-26] ransformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

链接: https://arxiv.org/abs/2505.03205
作者: Zhaiming Shen,Alex Havrilla,Rongjie Lai,Alexander Cloninger,Wenjing Liao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers. This work establishes a theoretical foundation by analyzing the performance of transformers for regression tasks involving noisy input data on a manifold. Specifically, the input data are in a tubular neighborhood of a manifold, while the ground truth function depends on the projection of the noisy data onto the manifold. We prove approximation and generalization errors which crucially depend on the intrinsic dimension of the manifold. Our results demonstrate that transformers can leverage low-complexity structures in learning task even when the input data are perturbed by high-dimensional noise. Our novel proof technique constructs representations of basic arithmetic operations by transformers, which may hold independent interest.

[LG-27] Convergence Of Consistency Model With Multistep Sampling Under General Data Assumptions

链接: https://arxiv.org/abs/2505.03194
作者: Yiding Chen,Yiyi Zhang,Owen Oertell,Wen Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.

[LG-28] VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making ICLR2025

链接: https://arxiv.org/abs/2505.03181
作者: Jake Grigsby,Yuke Zhu,Michael Ryoo,Juan Carlos Niebles
类目: Machine Learning (cs.LG)
*备注: SSI-FM Workshop ICLR 2025

点击查看摘要

Abstract:Recent research looks to harness the general knowledge and reasoning of large language models (LLMs) into agents that accomplish user-specified goals in interactive environments. Vision-language models (VLMs) extend LLMs to multi-modal data and provide agents with the visual reasoning necessary for new applications in areas such as computer automation. However, agent tasks emphasize skills where accessible open-weight VLMs lag behind their LLM equivalents. For example, VLMs are less capable of following an environment’s strict output syntax requirements and are more focused on open-ended question answering. Overcoming these limitations requires supervised fine-tuning (SFT) on task-specific expert demonstrations. Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective. RL lets us fine-tune VLMs to agent tasks while learning from the unsuccessful decisions of our own model or more capable (larger) models. We explore an off-policy RL solution that retains the stability and simplicity of the widely used SFT workflow while allowing our agent to self-improve and learn from low-quality datasets. We demonstrate this technique with two open-weight VLMs across three multi-modal agent domains.

[LG-29] RADE: Learning Risk-Adjustable Driving Environment via Multi-Agent Conditional Diffusion

链接: https://arxiv.org/abs/2505.03178
作者: Jiawei Wang,Xintao Yan,Yao Mu,Haowei Sun,Zhong Cao,Henry X. Liu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Generating safety-critical scenarios in high-fidelity simulations offers a promising and cost-effective approach for efficient testing of autonomous vehicles. Existing methods typically rely on manipulating a single vehicle’s trajectory through sophisticated designed objectives to induce adversarial interactions, often at the cost of realism and scalability. In this work, we propose the Risk-Adjustable Driving Environment (RADE), a simulation framework that generates statistically realistic and risk-adjustable traffic scenes. Built upon a multi-agent diffusion architecture, RADE jointly models the behavior of all agents in the environment and conditions their trajectories on a surrogate risk measure. Unlike traditional adversarial methods, RADE learns risk-conditioned behaviors directly from data, preserving naturalistic multi-agent interactions with controllable risk levels. To ensure physical plausibility, we incorporate a tokenized dynamics check module that efficiently filters generated trajectories using a motion vocabulary. We validate RADE on the real-world rounD dataset, demonstrating that it preserves statistical realism across varying risk levels and naturally increases the likelihood of safety-critical events as the desired risk level grows up. Our results highlight RADE’s potential as a scalable and realistic tool for AV safety evaluation.

[LG-30] Improving the Reproducibility of Deep Learning Software: An Initial Investigation through a Case Study Analysis

链接: https://arxiv.org/abs/2505.03165
作者: Nikita Ravi,Abhinav Goel,James C. Davis,George K. Thiruvathukal
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The field of deep learning has witnessed significant breakthroughs, spanning various applications, and fundamentally transforming current software capabilities. However, alongside these advancements, there have been increasing concerns about reproducing the results of these deep learning methods. This is significant because reproducibility is the foundation of reliability and validity in software development, particularly in the rapidly evolving domain of deep learning. The difficulty of reproducibility may arise due to several reasons, including having differences from the original execution environment, incompatible software libraries, proprietary data and source code, lack of transparency, and the stochastic nature in some software. A study conducted by the Nature journal reveals that more than 70% of researchers failed to reproduce other researchers experiments and over 50% failed to reproduce their own experiments. Irreproducibility of deep learning poses significant challenges for researchers and practitioners. To address these concerns, this paper presents a systematic approach at analyzing and improving the reproducibility of deep learning models by demonstrating these guidelines using a case study. We illustrate the patterns and anti-patterns involved with these guidelines for improving the reproducibility of deep learning models. These guidelines encompass establishing a methodology to replicate the original software environment, implementing end-to-end training and testing algorithms, disclosing architectural designs, and enhancing transparency in data processing and training pipelines. We also conduct a sensitivity analysis to understand the model performance across diverse conditions. By implementing these strategies, we aim to bridge the gap between research and practice, so that innovations in deep learning can be effectively reproduced and deployed within software.

[LG-31] Systematic Evaluation of Initial States and Exploration-Exploitation Strategies in PID Auto-Tuning: A Framework-Driven Approach Applied on Mobile Robots

链接: https://arxiv.org/abs/2505.03159
作者: Zaid Ghazal,Ali Al-Bustami,Khouloud Gaaloul,Jaerock Kwon
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PID controllers are widely used in control systems because of their simplicity and effectiveness. Although advanced optimization techniques such as Bayesian Optimization and Differential Evolution have been applied to address the challenges of automatic tuning of PID controllers, the influence of initial system states on convergence and the balance between exploration and exploitation remains underexplored. Moreover, experimenting the influence directly on real cyber-physical systems such as mobile robots is crucial for deriving realistic insights. In the present paper, a novel framework is introduced to evaluate the impact of systematically varying these factors on the PID auto-tuning processes that utilize Bayesian Optimization and Differential Evolution. Testing was conducted on two distinct PID-controlled robotic platforms, an omnidirectional robot and a differential drive mobile robot, to assess the effects on convergence rate, settling time, rise time, and overshoot percentage. As a result, the experimental outcomes yield evidence on the effects of the systematic variations, thereby providing an empirical basis for future research studies in the field.

[LG-32] Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation

链接: https://arxiv.org/abs/2505.03155
作者: Max Qiushi Lin,Jincheng Mei,Matin Aghaei,Michael Lu,Bo Dai,Alekh Agarwal,Dale Schuurmans,Csaba Szepesvari,Sharan Vaswani
类目: Machine Learning (cs.LG)
*备注: 75 pages

点击查看摘要

Abstract:Policy gradient (PG) methods have played an essential role in the empirical successes of reinforcement learning. In order to handle large state-action spaces, PG methods are typically used with function approximation. In this setting, the approximation error in modeling problem-dependent quantities is a key notion for characterizing the global convergence of PG methods. We focus on Softmax PG with linear function approximation (referred to as \textttLin-SPG ) and demonstrate that the approximation error is irrelevant to the algorithm’s global convergence even for the stochastic bandit setting. Consequently, we first identify the necessary and sufficient conditions on the feature representation that can guarantee the asymptotic global convergence of \textttLin-SPG . Under these feature conditions, we prove that T iterations of \textttLin-SPG with a problem-specific learning rate result in an O(1/T) convergence to the optimal policy. Furthermore, we prove that \textttLin-SPG with any arbitrary constant learning rate can ensure asymptotic global convergence to the optimal policy.

[LG-33] Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization ICRA

链接: https://arxiv.org/abs/2505.03146
作者: Fei Han,Pengming Guo,Hao Chen,Weikun Li,Jingbo Ren,Naijun Liu,Ning Yang,Dixia Fan
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been accepted for publication in the IEEE International Conference on Robotics and Automation (ICRA) 2025. The final version will be available in IEEE Xplore (DOI to be assigned upon publication)

点击查看摘要

Abstract:This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FED-LSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional Empirical Formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straight-line and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model’s precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.

[LG-34] Adversarial Sample Generation for Anomaly Detection in Industrial Control Systems

链接: https://arxiv.org/abs/2505.03120
作者: Abdul Mustafa,Muhammad Talha Khan,Muhammad Azmi Umer,Zaki Masood,Chuadhry Mujeeb Ahmed
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted in the 1st Workshop on Modeling and Verification for Secure and Performant Cyber-Physical Systems in conjunction with Cyber-Physical Systems and Internet-of-Things Week, Irvine, USA, May 6-9, 2025

点击查看摘要

Abstract:Machine learning (ML)-based intrusion detection systems (IDS) are vulnerable to adversarial attacks. It is crucial for an IDS to learn to recognize adversarial examples before malicious entities exploit them. In this paper, we generated adversarial samples using the Jacobian Saliency Map Attack (JSMA). We validate the generalization and scalability of the adversarial samples to tackle a broad range of real attacks on Industrial Control Systems (ICS). We evaluated the impact by assessing multiple attacks generated using the proposed method. The model trained with adversarial samples detected attacks with 95% accuracy on real-world attack data not used during training. The study was conducted using an operational secure water treatment (SWaT) testbed.

[LG-35] Adaptive Thresholding for Multi-Label Classification via Global-Local Signal Fusion

链接: https://arxiv.org/abs/2505.03118
作者: Dmytro Shamatrin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-label classification (MLC) requires predicting multiple labels per sample, often under heavy class imbalance and noisy conditions. Traditional approaches apply fixed thresholds or treat labels independently, overlooking context and global rarity. We introduce an adaptive thresholding mechanism that fuses global (IDF-based) and local (KNN-based) signals to produce per-label, per-instance thresholds. Instead of applying these as hard cutoffs, we treat them as differentiable penalties in the loss, providing smooth supervision and better calibration. Our architecture is lightweight, interpretable, and highly modular. On the AmazonCat-13K benchmark, it achieves a macro-F1 of 0.1712, substantially outperforming tree-based and pretrained transformer-based methods. We release full code for reproducibility and future extensions.

[LG-36] Plug-and-Play AMC: Context Is King in Training-Free Open-Set Modulation with LLM s

链接: https://arxiv.org/abs/2505.03112
作者: Mohammad Rostami,Atik Faysal,Reihaneh Gh. Roshan,Huaxia Wang,Nikhil Muralidhar,Yu-Dong Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic Modulation Classification (AMC) is critical for efficient spectrum management and robust wireless communications. However, AMC remains challenging due to the complex interplay of signal interference and noise. In this work, we propose an innovative framework that integrates traditional signal processing techniques with Large-Language Models (LLMs) to address AMC. Our approach leverages higher-order statistics and cumulant estimation to convert quantitative signal features into structured natural language prompts. By incorporating exemplar contexts into these prompts, our method exploits the LLM’s inherent familiarity with classical signal processing, enabling effective one-shot classification without additional training or preprocessing (e.g., denoising). Experimental evaluations on synthetically generated datasets, spanning both noiseless and noisy conditions, demonstrate that our framework achieves competitive performance across diverse modulation schemes and Signal-to-Noise Ratios (SNRs). Moreover, our approach paves the way for robust foundation models in wireless communications across varying channel conditions, significantly reducing the expense associated with developing channel-specific models. This work lays the foundation for scalable, interpretable, and versatile signal classification systems in next-generation wireless networks. The source code is available at this https URL

[LG-37] Deep Learning in Renewable Energy Forecasting: A Cross-Dataset Evaluation of Temporal and Spatial Models

链接: https://arxiv.org/abs/2505.03109
作者: Lutfu Sua,Haibo Wang,Jun Huang
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注: 34 pages, 16 figures

点击查看摘要

Abstract:Unpredictability of renewable energy sources coupled with the complexity of those methods used for various purposes in this area calls for the development of robust methods such as DL models within the renewable energy domain. Given the nonlinear relationships among variables in renewable energy datasets, DL models are preferred over traditional machine learning (ML) models because they can effectively capture and model complex interactions between variables. This research aims to identify the factors responsible for the accuracy of DL techniques, such as sampling, stationarity, linearity, and hyperparameter optimization for different algorithms. The proposed DL framework compares various methods and alternative training/test ratios. Seven ML methods, such as Long-Short Term Memory (LSTM), Stacked LSTM, Convolutional Neural Network (CNN), CNN-LSTM, Deep Neural Network (DNN), Multilayer Perceptron (MLP), and Encoder-Decoder (ED), were evaluated on two different datasets. The first dataset contains the weather and power generation data. It encompasses two distinct datasets, hourly energy demand data and hourly weather data in Spain, while the second dataset includes power output generated by the photovoltaic panels at 12 locations. This study deploys regularization approaches, including early stopping, neuron dropping, and L2 regularization, to reduce the overfitting problem associated with DL models. The LSTM and MLP models show superior performance. Their validation data exhibit exceptionally low root mean square error values.

[LG-38] Adversarial Attacks in Multimodal Systems: A Practitioners Survey

链接: https://arxiv.org/abs/2505.03084
作者: Shashank Kapoor,Sanjay Surendranath Girija,Lakshit Arora,Dipen Pradhan,Ankit Shetgaonkar,Aman Raj
类目: Machine Learning (cs.LG)
*备注: Accepted in IEEE COMPSAC 2025

点击查看摘要

Abstract:The introduction of multimodal models is a huge step forward in Artificial Intelligence. A single model is trained to understand multiple modalities: text, image, video, and audio. Open-source multimodal models have made these breakthroughs more accessible. However, considering the vast landscape of adversarial attacks across these modalities, these models also inherit vulnerabilities of all the modalities, and ultimately, the adversarial threat amplifies. While broad research is available on possible attacks within or across these modalities, a practitioner-focused view that outlines attack types remains absent in the multimodal world. As more Machine Learning Practitioners adopt, fine-tune, and deploy open-source models in real-world applications, it’s crucial that they can view the threat landscape and take the preventive actions necessary. This paper addresses the gap by surveying adversarial attacks targeting all four modalities: text, image, video, and audio. This survey provides a view of the adversarial attack landscape and presents how multimodal adversarial threats have evolved. To the best of our knowledge, this survey is the first comprehensive summarization of the threat landscape in the multimodal world.

[LG-39] Robustly Invertible Nonlinear Dynamics and the BiLipREN: Contracting Neural Models with Contracting Inverses

链接: https://arxiv.org/abs/2505.03069
作者: Yurui Zhang,Ruigang Wang,Ian R. Manchester
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the invertibility of nonlinear dynamical systems from the perspective of contraction and incremental stability analysis and propose a new invertible recurrent neural model: the BiLipREN. In particular, we consider a nonlinear state space model to be robustly invertible if an inverse exists with a state space realisation, and both the forward model and its inverse are contracting, i.e. incrementally exponentially stable, and Lipschitz, i.e. have bounded incremental gain. This property of bi-Lipschitzness implies both robustness in the sense of sensitivity to input perturbations, as well as robust distinguishability of different inputs from their corresponding outputs, i.e. the inverse model robustly reconstructs the input sequence despite small perturbations to the initial conditions and measured output. Building on this foundation, we propose a parameterization of neural dynamic models: bi-Lipschitz recurrent equilibrium networks (biLipREN), which are robustly invertible by construction. Moreover, biLipRENs can be composed with orthogonal linear systems to construct more general bi-Lipschitz dynamic models, e.g., a nonlinear analogue of minimum-phase/all-pass (inner/outer) factorization. We illustrate the utility of our proposed approach with numerical examples.

[LG-40] 34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation Assistants Agents and Accelerated Scientific Discovery

链接: https://arxiv.org/abs/2505.03049
作者: Yoel Zimmermann,Adib Bazgir,Alexander Al-Feghali,Mehrad Ansari,L. Catherine Brinson,Yuan Chiang,Defne Circi,Min-Hsueh Chiu,Nathan Daelman,Matthew L. Evans,Abhijeet S. Gangan,Janine George,Hassan Harb,Ghazal Khalighinejad,Sartaaj Takrim Khan,Sascha Klawohn,Magdalena Lederbauer,Soroush Mahjoubi,Bernadette Mohr,Seyed Mohamad Moosavi,Aakash Naik,Aleyna Beste Ozhan,Dieter Plessers,Aritra Roy,Fabian Schöppach,Philippe Schwaller,Carla Terboven,Katharina Ueltzen,Shang Zhu,Jan Janssen,Calvin Li,Ian Foster,Ben Blaiszik
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: arXiv admin note: substantial text overlap with arXiv:2411.15221

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping many aspects of materials science and chemistry research, enabling advances in molecular property prediction, materials design, scientific automation, knowledge extraction, and more. Recent developments demonstrate that the latest class of models are able to integrate structured and unstructured data, assist in hypothesis generation, and streamline research workflows. To explore the frontier of LLM capabilities across the research lifecycle, we review applications of LLMs through 34 total projects developed during the second annual Large Language Model Hackathon for Applications in Materials Science and Chemistry, a global hybrid event. These projects spanned seven key research areas: (1) molecular and material property prediction, (2) molecular and material design, (3) automation and novel interfaces, (4) scientific communication and education, (5) research data management and automation, (6) hypothesis generation and evaluation, and (7) knowledge extraction and reasoning from the scientific literature. Collectively, these applications illustrate how LLMs serve as versatile predictive models, platforms for rapid prototyping of domain-specific tools, and much more. In particular, improvements in both open source and proprietary LLM performance through the addition of reasoning, additional training data, and new techniques have expanded effectiveness, particularly in low-data environments and interdisciplinary research. As LLMs continue to improve, their integration into scientific workflows presents both new opportunities and new challenges, requiring ongoing exploration, continued refinement, and further research to address reliability, interpretability, and reproducibility.

[LG-41] A New Perspective To Understanding Multi-resolution Hash Encoding For Neural Fields

链接: https://arxiv.org/abs/2505.03042
作者: Steven Tin Sui Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Instant-NGP has been the state-of-the-art architecture of neural fields in recent years. Its incredible signal-fitting capabilities are generally attributed to its multi-resolution hash grid structure and have been used and improved in numerous following works. However, it is unclear how and why such a hash grid structure improves the capabilities of a neural network by such great margins. A lack of principled understanding of the hash grid also implies that the large set of hyperparameters accompanying Instant-NGP could only be tuned empirically without much heuristics. To provide an intuitive explanation of the working principle of the hash grid, we propose a novel perspective, namely domain manipulation. This perspective provides a ground-up explanation of how the feature grid learns the target signal and increases the expressivity of the neural field by artificially creating multiples of pre-existing linear segments. We conducted numerous experiments on carefully constructed 1-dimensional signals to support our claims empirically and aid our illustrations. While our analysis mainly focuses on 1-dimensional signals, we show that the idea is generalizable to higher dimensions.

[LG-42] More Optimal Fractional-Order Stochastic Gradient Descent for Non-Convex Optimization Problems

链接: https://arxiv.org/abs/2505.02985
作者: Mohammad Partohaghighi,Roummel Marcia,YangQuan Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 8 pages submitted to IEEE CDC2025. arXiv admin note: substantial text overlap with arXiv:2503.13764

点击查看摘要

Abstract:Fractional-order stochastic gradient descent (FOSGD) leverages fractional exponents to capture long-memory effects in optimization. However, its utility is often limited by the difficulty of tuning and stabilizing these exponents. We propose 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), which integrates the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to adapt the fractional exponent in a data-driven manner. By tracking model sensitivity and effective dimensionality, 2SEDFOSGD dynamically modulates the exponent to mitigate oscillations and hasten convergence. Theoretically, for onoconvex optimization problems, this approach preserves the advantages of fractional memory without the sluggish or unstable behavior observed in naïve fractional SGD. Empirical evaluations in Gaussian and \alpha -stable noise scenarios using an autoregressive (AR) model highlight faster convergence and more robust parameter estimates compared to baseline methods, underscoring the potential of dimension-aware fractional techniques for advanced modeling and estimation tasks.

[LG-43] Physics-Learning AI Datamodel (PLAID) datasets: a collection of physics simulations for machine learning

链接: https://arxiv.org/abs/2505.02974
作者: Fabien Casenave,Xavier Roynard,Brian Staber,Nissrine Akkari,William Piat,Michele Alessandro Bucci,Abbas Kabalan,Xuan Minh Vuong Nguyen,Luca Saverio,Raphaël Carpintero Perez,Anthony Kalaydjian,Samy Fouché,Thierry Gonon,Ghassan Najjar,Emmanuel Menier,Matthieu Nastorg,Christian Rey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows. However, their widespread adoption is hindered by the lack of large-scale, diverse, and standardized datasets tailored to physics-based simulations. While existing initiatives provide valuable contributions, many are limited in scope-focusing on specific physics domains, relying on fragmented tooling, or adhering to overly simplistic datamodels that restrict generalization. To address these limitations, we introduce PLAID (Physics-Learning AI Datamodel), a flexible and extensible framework for representing and sharing datasets of physics simulations. PLAID defines a unified standard for describing simulation data and is accompanied by a library for creating, reading, and manipulating complex datasets across a wide range of physical use cases (this http URL). We release six carefully crafted datasets under the PLAID standard, covering structural mechanics and computational fluid dynamics, and provide baseline benchmarks using representative learning methods. Benchmarking tools are made available on Hugging Face, enabling direct participation by the community and contribution to ongoing evaluation efforts (this http URL).

[LG-44] Single-Sample and Robust Online Resource Allocation STOC2025

链接: https://arxiv.org/abs/2505.02963
作者: Rohan Ghuge,Sahil Singla,Yifan Wang
类目: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Full version of STOC 2025 paper

点击查看摘要

Abstract:Online Resource Allocation problem is a central problem in many areas of Computer Science, Operations Research, and Economics. In this problem, we sequentially receive n stochastic requests for m kinds of shared resources, where each request can be satisfied in multiple ways, consuming different amounts of resources and generating different values. The goal is to achieve a (1-\epsilon) -approximation to the hindsight optimum, where \epsilon0 is a small constant, assuming each resource has a large budget. In this paper, we investigate the learnability and robustness of online resource allocation. Our primary contribution is a novel Exponential Pricing algorithm with the following properties: 1. It requires only a \emphsingle sample from each of the n request distributions to achieve a (1-\epsilon) -approximation for online resource allocation with large budgets. Such an algorithm was previously unknown, even with access to polynomially many samples, as prior work either assumed full distributional knowledge or was limited to i.i.d.,or random-order arrivals. 2. It is robust to corruptions in the outliers model and the value augmentation model. Specifically, it maintains its (1 - \epsilon) -approximation guarantee under both these robustness models, resolving the open question posed in Argue, Gupta, Molinaro, and Singla (SODA’22). 3. It operates as a simple item-pricing algorithm that ensures incentive compatibility. The intuition behind our Exponential Pricing algorithm is that the price of a resource should adjust exponentially as it is overused or underused. It differs from conventional approaches that use an online learning algorithm for item pricing. This departure guarantees that the algorithm will never run out of any resource, but loses the usual no-regret properties of online learning algorithms, necessitating a new analytical approach. Comments: Full version of STOC 2025 paper Subjects: Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2505.02963 [cs.DS] (or arXiv:2505.02963v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2505.02963 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Smooth Quadratic Prediction Markets

链接: https://arxiv.org/abs/2505.02959
作者: Enrique Nueve,Bo Waggoner
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:When agents trade in a Duality-based Cost Function prediction market, they collectively implement the learning algorithm Follow-The-Regularized-Leader. We ask whether other learning algorithms could be used to inspire the design of prediction markets. By decomposing and modifying the Duality-based Cost Function Market Maker’s (DCFMM) pricing mechanism, we propose a new prediction market, called the Smooth Quadratic Prediction Market, the incentivizes agents to collectively implement general steepest gradient descent. Relative to the DCFMM, the Smooth Quadratic Prediction Market has a better worst-case monetary loss for AD securities while preserving axiom guarantees such as the existence of instantaneous price, information incorporation, expressiveness, no arbitrage, and a form of incentive compatibility. To motivate the application of the Smooth Quadratic Prediction Market, we independently examine agents’ trading behavior under two realistic constraints: bounded budgets and buy-only securities. Finally, we provide an introductory analysis of an approach to facilitate adaptive liquidity using the Smooth Quadratic AD Prediction Market. Our results suggest future designs where the price update rule is separate from the fee structure, yet guarantees are preserved.

[LG-46] RetroInfer: A Vector-Storag e Approach for Scalable Long-Context LLM Inference

链接: https://arxiv.org/abs/2505.02922
作者: Yaoqi Chen,Jinkai Zhang,Baotong Lu,Qianxi Zhang,Chengruidong Zhang,Jingjia Luo,Di Liu,Huiqiang Jiang,Qi Chen,Jing Liu,Bailu Ding,Xiao Yan,Jiawei Jiang,Chen Chen,Mingxing Zhang,Yuqing Yang,Fan Yang,Mao Yang
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

[LG-47] LLM 4FTS: Enhancing Large Language Models for Financial Time Series Prediction

链接: https://arxiv.org/abs/2505.02880
作者: Zian Liu,Renjun Jia
类目: Machine Learning (cs.LG)
*备注: 12 pages, 9figures

点击查看摘要

Abstract:Predicting financial time series presents significant challenges due to inherent low signal-to-noise ratios and intricate temporal patterns. Traditional machine learning models exhibit limitations in this forecasting task constrained by their restricted model capacity. Recent advances in large language models (LLMs), with their greatly expanded parameter spaces, demonstrate promising potential for modeling complex dependencies in temporal sequences. However, existing LLM-based approaches typically focus on fixed-length patch analysis due to the Transformer architecture, ignoring market data’s multi-scale pattern characteristics. In this study, we propose LLM4FTS , a novel framework that enhances LLM capabilities for temporal sequence modeling through learnable patch segmentation and dynamic wavelet convolution modules. Specifically,we first employ K-means++ clustering based on DTW distance to identify scale-invariant patterns in market data. Building upon pattern recognition results, we introduce adaptive patch segmentation that partitions temporal sequences while preserving maximal pattern integrity. To accommodate time-varying frequency characteristics, we devise a dynamic wavelet convolution module that emulates discrete wavelet transformation with enhanced flexibility in capturing time-frequency features. These three modules work together to improve large language model’s ability to handle scale-invariant patterns in financial time series. Extensive experiments on real-world financial datasets substantiate the framework’s efficacy, demonstrating superior performance in capturing complex market patterns and achieving state-of-the-art results in stock return prediction. The successful deployment in practical trading systems confirms its real-world applicability, representing a significant advancement in LLM applications for financial forecasting.

[LG-48] Feature Staleness Aware Incremental Learning for CTR Prediction ATC

链接: https://arxiv.org/abs/2505.02844
作者: Zhikai Wang,Yanyan Shen,Zibin Zhang,Kangyi Lin
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: The code can be found in this https URL

点击查看摘要

Abstract:Click-through Rate (CTR) prediction in real-world recommender systems often deals with billions of user interactions every day. To improve the training efficiency, it is common to update the CTR prediction model incrementally using the new incremental data and a subset of historical data. However, the feature embeddings of a CTR prediction model often get stale when the corresponding features do not appear in current incremental data. In the next period, the model would have a performance degradation on samples containing stale features, which we call the feature staleness problem. To mitigate this problem, we propose a Feature Staleness Aware Incremental Learning method for CTR prediction (FeSAIL) which adaptively replays samples containing stale features. We first introduce a staleness aware sampling algorithm (SAS) to sample a fixed number of stale samples with high sampling efficiency. We then introduce a staleness aware regularization mechanism (SAR) for a fine-grained control of the feature embedding updating. We instantiate FeSAIL with a general deep learning-based CTR prediction model and the experimental results demonstrate FeSAIL outperforms various state-of-the-art methods on four benchmark datasets.

[LG-49] Nonnegative Low-rank Matrix Recovery Can Have Spurious Local Minima

链接: https://arxiv.org/abs/2505.03717
作者: Richard Y. Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The classical low-rank matrix recovery problem is well-known to exhibit \emphbenign nonconvexity under the restricted isometry property (RIP): local optimization is guaranteed to converge to the global optimum, where the ground truth is recovered. We investigate whether benign nonconvexity continues to hold when the factor matrices are constrained to be elementwise nonnegative – a common practical requirement. In the simple setting of a rank-1 nonnegative ground truth, we confirm that benign nonconvexity holds in the fully-observed case with RIP constant \delta=0 . Surprisingly, however, this property fails to extend to the partially-observed case with any arbitrarily small RIP constant \delta\to0^+ , irrespective of rank overparameterization. This finding exposes a critical theoretical gap: the continuity argument widely used to explain the empirical robustness of low-rank matrix recovery fundamentally breaks down once nonnegative constraints are imposed.

[LG-50] Multi-modal cascade feature transfer for polymer property prediction

链接: https://arxiv.org/abs/2505.03704
作者: Kiichi Obuchi,Yuta Yahagi,Kiyohiko Toyama,Shukichi Tanaka,Kota Matsui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel transfer learning approach called multi-modal cascade model with feature transfer for polymer property this http URL are characterized by a composite of data in several different formats, including molecular descriptors and additive information as well as chemical structures. However, in conventional approaches, prediction models were often constructed using each type of data separately. Our model enables more accurate prediction of physical properties for polymers by combining features extracted from the chemical structure by graph convolutional neural networks (GCN) with features such as molecular descriptors and additive information. The predictive performance of the proposed method is empirically evaluated using several polymer datasets. We report that the proposed method shows high predictive performance compared to the baseline conventional approach using a single feature.

[LG-51] Vector valued optimal transport: from dynamic to static formulations

链接: https://arxiv.org/abs/2505.03670
作者: Katy Craig,Nicolás García Trillos,Đorđe Nikolić
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注:

点击查看摘要

Abstract:Motivated by applications in classification of vector valued measures and multispecies PDE, we develop a theory that unifies existing notions of vector valued optimal transport, from dynamic formulations (à la Benamou-Brenier) to static formulations (à la Kantorovich). In our framework, vector valued measures are modeled as probability measures on a product space \mathbbR^d \times G , where G is a weighted graph over a finite set of nodes and the graph geometry strongly influences the associated dynamic and static distances. We obtain sharp inequalities relating four notions of vector valued optimal transport and prove that the distances are mutually bi-Hölder equivalent. We discuss the theoretical and practical advantages of each metric and indicate potential applications in multispecies PDE and data analysis. In particular, one of the static formulations discussed in the paper is amenable to linearization, a technique that has been explored in recent years to accelerate the computation of pairwise optimal transport distances.

[LG-52] Weighted Random Dot Product Graphs

链接: https://arxiv.org/abs/2505.03649
作者: Bernardo Marenco,Paola Bermolen,Marcelo Fiori,Federico Larroca,Gonzalo Mateos
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR)
*备注: 30 pages, 12 figures, code to generate Figures 3 to 12 available at this https URL

点击查看摘要

Abstract:Modeling of intricate relational patterns % through the analysis structures of network data has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model’s scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights’ distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal’s latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model’s definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG’s effectiveness in various network analytic applications.

[LG-53] Physics-Informed Sylvester Normalizing Flows for Bayesian Inference in Magnetic Resonance Spectroscopy

链接: https://arxiv.org/abs/2505.03590
作者: Julian P. Merkofer,Dennis M. J. van de Sande,Alex A. Bhogal,Ruud J. G. van Sloun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Preprint submitted to IEEE MLSP 2025

点击查看摘要

Abstract:Magnetic resonance spectroscopy (MRS) is a non-invasive technique to measure the metabolic composition of tissues, offering valuable insights into neurological disorders, tumor detection, and other metabolic dysfunctions. However, accurate metabolite quantification is hindered by challenges such as spectral overlap, low signal-to-noise ratio, and various artifacts. Traditional methods like linear-combination modeling are susceptible to ambiguities and commonly only provide a theoretical lower bound on estimation accuracy in the form of the Cramér-Rao bound. This work introduces a Bayesian inference framework using Sylvester normalizing flows (SNFs) to approximate posterior distributions over metabolite concentrations, enhancing quantification reliability. A physics-based decoder incorporates prior knowledge of MRS signal formation, ensuring realistic distribution representations. We validate the method on simulated 7T proton MRS data, demonstrating accurate metabolite quantification, well-calibrated uncertainties, and insights into parameter correlations and multi-modal distributions.

[LG-54] Decision Making under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets

链接: https://arxiv.org/abs/2505.03585
作者: Charita Dellaporta,Patrick O’Hara,Theodoros Damoulas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributionally Robust Optimisation (DRO) protects risk-averse decision-makers by considering the worst-case risk within an ambiguity set of distributions based on the empirical distribution or a model. To further guard against finite, noisy data, model-based approaches admit Bayesian formulations that propagate uncertainty from the posterior to the decision-making problem. However, when the model is misspecified, the decision maker must stretch the ambiguity set to contain the data-generating process (DGP), leading to overly conservative decisions. We address this challenge by introducing DRO with Robust, to model misspecification, Bayesian Ambiguity Sets (DRO-RoBAS). These are Maximum Mean Discrepancy ambiguity sets centred at a robust posterior predictive distribution that incorporates beliefs about the DGP. We show that the resulting optimisation problem obtains a dual formulation in the Reproducing Kernel Hilbert Space and we give probabilistic guarantees on the tolerance level of the ambiguity set. Our method outperforms other Bayesian and empirical DRO approaches in out-of-sample performance on the Newsvendor and Portfolio problems with various cases of model misspecification.

[LG-55] Quantum Feature Space of a Qubit Coupled to an Arbitrary Bath

链接: https://arxiv.org/abs/2505.03397
作者: Chris Wise,Akram Youssry,Alberto Peruzzo,Jo Plested,Matt Woolley
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Qubit control protocols have traditionally leveraged a characterisation of the qubit-bath coupling via its power spectral density. Previous work proposed the inference of noise operators that characterise the influence of a classical bath using a grey-box approach that combines deep neural networks with physics-encoded layers. This overall structure is complex and poses challenges in scaling and real-time operations. Here, we show that no expensive neural networks are needed and that this noise operator description admits an efficient parameterisation. We refer to the resulting parameter space as the \textitquantum feature space of the qubit dynamics resulting from the coupled bath. We show that the Euclidean distance defined over the quantum feature space provides an effective method for classifying noise processes in the presence of a given set of controls. Using the quantum feature space as the input space for a simple machine learning algorithm (random forest, in this case), we demonstrate that it can effectively classify the stationarity and the broad class of noise processes perturbing a qubit. Finally, we explore how control pulse parameters map to the quantum feature space.

[LG-56] Solar Flare Forecast: A Comparative Analysis of Machine Learning Algorithms for Solar Flare Class Prediction

链接: https://arxiv.org/abs/2505.03385
作者: Julia Bringewald
类目: olar and Stellar Astrophysics (astro-ph.SR); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solar flares are among the most powerful and dynamic events in the solar system, resulting from the sudden release of magnetic energy stored in the Sun’s atmosphere. These energetic bursts of electromagnetic radiation can release up to 10^32 erg of energy, impacting space weather and posing risks to technological infrastructure and therefore require accurate forecasting of solar flare occurrences and intensities. This study evaluates the predictive performance of three machine learning algorithms: Random Forest, k-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost) for classifying solar flares into 4 categories (B, C, M, X). Using the dataset of 13 SHARP parameters, the effectiveness of the models was evaluated in binary and multiclass classification tasks. The analysis utilized 8 principal components (PC), capturing 95% of data variance, and 100 PCs, capturing 97.5% of variance. Our approach uniquely combines binary and multiclass classification with different levels of dimensionality reduction, an innovative methodology not previously explored in the context of solar flare prediction. Employing a 10-fold stratified cross-validation and grid search for hyperparameter tuning ensured robust model evaluation. Our findings indicate that Random Forest and XGBoost consistently demonstrate strong performance across all metrics, benefiting significantly from increased dimensionality. The insights of this study enhance future research by optimizing dimensionality reduction techniques and informing model selection for astrophysical tasks. By integrating this newly acquired knowledge into future research, more accurate space weather forecasting systems can be developed, along with a deeper understanding of solar physics.

[LG-57] An Active Inference perspective on Neurofeedback Training

链接: https://arxiv.org/abs/2505.03308
作者: Côme Annicchiarico,Fabien Lotte,Jérémie Mattout
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Preprint, 43 pages, 14 figures

点击查看摘要

Abstract:Neurofeedback training (NFT) aims to teach self-regulation of brain activity through real-time feedback, but suffers from highly variable outcomes and poorly understood mechanisms, hampering its validation. To address these issues, we propose a formal computational model of the NFT closed loop. Using Active Inference, a Bayesian framework modelling perception, action, and learning, we simulate agents interacting with an NFT environment. This enables us to test the impact of design choices (e.g., feedback quality, biomarker validity) and subject factors (e.g., prior beliefs) on training. Simulations show that training effectiveness is sensitive to feedback noise or bias, and to prior beliefs (highlighting the importance of guiding instructions), but also reveal that perfect feedback is insufficient to guarantee high performance. This approach provides a tool for assessing and predicting NFT variability, interpret empirical data, and potentially develop personalized training protocols.

[LG-58] Lower Bounds for Greedy Teaching Set Constructions

链接: https://arxiv.org/abs/2505.03223
作者: Spencer Compton,Chirag Pabbaraju,Nikita Zhivotovskiy
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:A fundamental open problem in learning theory is to characterize the best-case teaching dimension \operatornameTS_\min of a concept class \mathcalC with finite VC dimension d . Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon and Zilles; COLT 2015]. Prior work used a natural greedy algorithm to construct teaching sets recursively, thereby proving upper bounds on \operatornameTS_\min , with the best known bound being O(d^2) [Hu, Wu, Li, and Wang; COLT 2017]. In each iteration, this greedy algorithm chooses to add to the teaching set the k labeled points that restrict the concept class the most. In this work, we prove lower bounds on the performance of this greedy approach for small k . Specifically, we show that for k = 1 , the algorithm does not improve upon the halving-based bound of O(\log(|\mathcalC|)) . Furthermore, for k = 2 , we complement the upper bound of O\left(\log(\log(|\mathcalC|))\right) from [Moran, Shpilka, Wigderson, and Yuhudayoff; FOCS 2015] with a matching lower bound. Most consequentially, our lower bound extends up to k \le \lceil c d \rceil for small constant c0 : suggesting that studying higher-order interactions may be necessary to resolve the conjecture that \operatornameTS_\min = O(d) .

[LG-59] Weighted Averag e Gradients for Feature Attribution

链接: https://arxiv.org/abs/2505.03201
作者: Kien Tran Duc Tuan,Tam Nguyen Trong,Son Nguyen Hoang,Khoat Than,Anh Nguyen Duc
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In explainable AI, Integrated Gradients (IG) is a widely adopted technique for assessing the significance of feature attributes of the input on model outputs by evaluating contributions from a baseline input to the current input. The choice of the baseline input significantly influences the resulting explanation. While the traditional Expected Gradients (EG) method assumes baselines can be uniformly sampled and averaged with equal weights, this study argues that baselines should not be treated equivalently. We introduce Weighted Average Gradients (WG), a novel approach that unsupervisedly evaluates baseline suitability and incorporates a strategy for selecting effective baselines. Theoretical analysis demonstrates that WG satisfies essential explanation method criteria and offers greater stability than prior approaches. Experimental results further confirm that WG outperforms EG across diverse scenarios, achieving an improvement of 10-35% on main metrics. Moreover, by evaluating baselines, our method can filter a subset of effective baselines for each input to calculate explanations, maintaining high accuracy while reducing computational cost. The code is available at: this https URL.

[LG-60] A Symbolic and Statistical Learning Framework to Discover Bioprocessing Regulatory Mechanism: Cell Culture Example

链接: https://arxiv.org/abs/2505.03177
作者: Keilung Choy,Wei Xie,Keqi Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:Bioprocess mechanistic modeling is essential for advancing intelligent digital twin representation of biomanufacturing, yet challenges persist due to complex intracellular regulation, stochastic system behavior, and limited experimental data. This paper introduces a symbolic and statistical learning framework to identify key regulatory mechanisms and quantify model uncertainty. Bioprocess dynamics is formulated with stochastic differential equations characterizing intrinsic process variability, with a predefined set of candidate regulatory mechanisms constructed from biological knowledge. A Bayesian learning approach is developed, which is based on a joint learning of kinetic parameters and regulatory structure through a formulation of the mixture model. To enhance computational efficiency, a Metropolis-adjusted Langevin algorithm with adjoint sensitivity analysis is developed for posterior exploration. Compared to state-of-the-art Bayesian inference approaches, the proposed framework achieves improved sample efficiency and robust model selection. An empirical study demonstrates its ability to recover missing regulatory mechanisms and improve model fidelity under data-limited conditions.

[LG-61] HMAE: Self-Supervised Few-Shot Learning for Quantum Spin Systems

链接: https://arxiv.org/abs/2505.03140
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning for spin and molecular systems faces critical challenges of scarce labeled data and computationally expensive simulations. To address these limitations, we introduce Hamiltonian-Masked Autoencoding (HMAE), a novel self-supervised framework that pre-trains transformers on unlabeled quantum Hamiltonians, enabling efficient few-shot transfer learning. Unlike random masking approaches, HMAE employs a physics-informed strategy based on quantum information theory to selectively mask Hamiltonian terms based on their physical significance. Experiments on 12,500 quantum Hamiltonians (60% real-world, 40% synthetic) demonstrate that HMAE achieves 85.3% \pm 1.5% accuracy in phase classification and 0.15 \pm 0.02 eV MAE in ground state energy prediction with merely 10 labeled examples - a statistically significant improvement (p 0.01) over classical graph neural networks (78.1% \pm 2.1%) and quantum neural networks (76.8% \pm 2.3%). Our method’s primary advantage is exceptional sample efficiency - reducing required labeled examples by 3-5x compared to baseline methods - though we emphasize that ground truth values for fine-tuning and evaluation still require exact diagonalization or tensor networks. We explicitly acknowledge that our current approach is limited to small quantum systems (specifically limited to 12 qubits during training, with limited extension to 16-20 qubits in testing) and that, while promising within this regime, this size restriction prevents immediate application to larger systems of practical interest in materials science and quantum chemistry.

[LG-62] Modeling Spatial Extremes using Non-Gaussian Spatial Autoregressive Models via Convolutional Neural Networks

链接: https://arxiv.org/abs/2505.03034
作者: Sweta Rai,Douglas W. Nychka,Soutir Bandyopadhyay
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data derived from remote sensing or numerical simulations often have a regular gridded structure and are large in volume, making it challenging to find accurate spatial models that can fill in missing grid cells or simulate the process effectively, especially in the presence of spatial heterogeneity and heavy-tailed marginal distributions. To overcome this issue, we present a spatial autoregressive modeling framework, which maps observations at a location and its neighbors to independent random variables. This is a highly flexible modeling approach and well-suited for non-Gaussian fields, providing simpler interpretability. In particular, we consider the SAR model with Generalized Extreme Value distribution innovations to combine the observation at a central grid location with its neighbors, capturing extreme spatial behavior based on the heavy-tailed innovations. While these models are fast to simulate by exploiting the sparsity of the key matrices in the computations, the maximum likelihood estimation of the parameters is prohibitive due to the intractability of the likelihood, making optimization challenging. To overcome this, we train a convolutional neural network on a large training set that covers a useful parameter space, and then use the trained network for fast parameter estimation. Finally, we apply this model to analyze annual maximum precipitation data from ERA-Interim-driven Weather Research and Forecasting (WRF) simulations, allowing us to explore its spatial extreme behavior across North America.

[LG-63] New affine invariant ensemble samplers and their dimensional scaling

链接: https://arxiv.org/abs/2505.02987
作者: Yifan Chen
类目: Computation (stat.CO); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Any feedback welcome!

点击查看摘要

Abstract:We introduce some new affine invariant ensemble samplers that are easy to construct and improve upon existing widely used algorithms, especially for high-dimensional problems. Specifically, we propose a derivative-free ensemble side move sampler that performs favorably compared to popular samplers in the \textttemcee package. Additionally, we develop a class of derivative-based ensemble Hamiltonian Monte Carlo (HMC) samplers with affine invariance, which outperform standard HMC without affine invariance when sampling highly skewed distributions. We provide asymptotic scaling analysis for high-dimensional Gaussian targets to further elucidate the properties of these affine invariant ensemble samplers. In particular, with derivative information, the affine invariant ensemble HMC can scale much better with dimension compared to derivative-free ensemble samplers.

[LG-64] Parameter estimation for land-surface models using machine learning libraries

链接: https://arxiv.org/abs/2505.02979
作者: Ruiyue Huang,Claire E. Heaney,Maarten van Reeuwijk
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures, 3 tables

点击查看摘要

Abstract:The Neural Networks for Partial Differential Equations (NN4PDEs) approach is used to determine the parameters of a simple land-surface model using PyTorch’s backpropagation engine. In order to test the inverse model, a synthetic dataset is created by running the model in forward mode with known parameter values to create soil temperature time series that can be used as observations for the inverse model. We show that it is not possible to obtain a reliable parameter estimation using a single observed soil temperature time series. Using measurements at two depths, reliable parameter estimates can be obtained although it is not possible to differentiate between latent and sensible heat fluxes. We apply the inverse model to urban flux tower data in Phoenix, United States, and show that the thermal conductivity, volumetric heat capacity, and the combined sensible-latent heat transfer coefficient can be reliably estimated using an observed value for the effective surface albedo. The resulting model accurately predicts the outgoing longwave radiation, conductive soil fluxes and the combined sensible-latent heat fluxes.

[LG-65] GeoERM: Geometry-Aware Multi-Task Representation Learning on Riemannian Manifolds

链接: https://arxiv.org/abs/2505.02972
作者: Aoran Chen,Yang Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Multi-Task Learning (MTL) seeks to boost statistical power and learning efficiency by discovering structure shared across related tasks. State-of-the-art MTL representation methods, however, usually treat the latent representation matrix as a point in ordinary Euclidean space, ignoring its often non-Euclidean geometry, thus sacrificing robustness when tasks are heterogeneous or even adversarial. We propose GeoERM, a geometry-aware MTL framework that embeds the shared representation on its natural Riemannian manifold and optimizes it via explicit manifold operations. Each training cycle performs (i) a Riemannian gradient step that respects the intrinsic curvature of the search space, followed by (ii) an efficient polar retraction to remain on the manifold, guaranteeing geometric fidelity at every iteration. The procedure applies to a broad class of matrix-factorized MTL models and retains the same per-iteration cost as Euclidean baselines. Across a set of synthetic experiments with task heterogeneity and on a wearable-sensor activity-recognition benchmark, GeoERM consistently improves estimation accuracy, reduces negative transfer, and remains stable under adversarial label noise, outperforming leading MTL and single-task alternatives.

信息检索

[IR-0] Familiarizing with Music: Discovery Patterns for Different Music Discovery Needs

链接: https://arxiv.org/abs/2505.03568
作者: Marta Moscati,Darius Afchar,Markus Schedl,Bruno Sguerra
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Humans have the tendency to discover and explore. This natural tendency is reflected in data from streaming platforms as the amount of previously unknown content accessed by users. Additionally, in domains such as that of music streaming there is evidence that recommending novel content improves users’ experience with the platform. Therefore, understanding users’ discovery patterns, such as the amount to which and the way users access previously unknown content, is a topic of relevance for both the scientific community and the streaming industry, particularly the music one. Previous works studied how music consumption differs for users of different traits and looked at diversity, novelty, and consistency over time of users’ music preferences. However, very little is known about how users discover and explore previously unknown music, and how this behavior differs for users of varying discovery needs. In this paper we bridge this gap by analyzing data from a survey answered by users of the major music streaming platform Deezer in combination with their streaming data. We first address questions regarding whether users who declare a higher interest in unfamiliar music listen to more diverse music, have more stable music preferences over time, and explore more music within a same time window, compared to those who declare a lower interest. We then investigate which type of music tracks users choose to listen to when they explore unfamiliar music, identifying clear patterns of popularity and genre representativeness that vary for users of different discovery needs. Our findings open up possibilities to infer users’ interest in unfamiliar music from streaming data as well as possibilities to develop recommender systems that guide users in exploring music in a more natural way. Subjects: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2505.03568 [cs.IR] (or arXiv:2505.03568v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.03568 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] 1st Place Solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge WWW2025

链接: https://arxiv.org/abs/2505.03543
作者: Junwei Xu,Zehao Zhao,Xiaoyu Hu,Zhenjie Song
类目: Information Retrieval (cs.IR)
*备注: Technical report for the 1st^{st} place solution of WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge

点击查看摘要

Abstract:The WWW 2025 EReL@MIR Workshop Multimodal CTR Prediction Challenge focuses on effectively applying multimodal embedding features to improve click-through rate (CTR) prediction in recommender systems. This technical report presents our 1 ^st place winning solution for Task 2, combining sequential modeling and feature interaction learning to effectively capture user-item interactions. For multimodal information integration, we simply append the frozen multimodal embeddings to each item embedding. Experiments on the challenge dataset demonstrate the effectiveness of our method, achieving superior performance with a 0.9839 AUC on the leaderboard, much higher than the baseline model. Code and configuration are available in our GitHub repository and the checkpoint of our model can be found in HuggingFace.

[IR-2] STAR-Rec: Making Peace with Length Variance and Pattern Diversity in Sequential Recommendation SIGIR2025

链接: https://arxiv.org/abs/2505.03484
作者: Maolin Wang,Sheng Zhang,Ruocheng Guo,Wanyu Wang,Xuetao Wei,Zitao Liu,Hongzhi Yin,Yi Chang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:Recent deep sequential recommendation models often struggle to effectively model key characteristics of user behaviors, particularly in handling sequence length variations and capturing diverse interaction patterns. We propose STAR-Rec, a novel architecture that synergistically combines preference-aware attention and state-space modeling through a sequence-level mixture-of-experts framework. STAR-Rec addresses these challenges by: (1) employing preference-aware attention to capture both inherently similar item relationships and diverse preferences, (2) utilizing state-space modeling to efficiently process variable-length sequences with linear complexity, and (3) incorporating a mixture-of-experts component that adaptively routes different behavioral patterns to specialized experts, handling both focused category-specific browsing and diverse category exploration patterns. We theoretically demonstrate how the state space model and attention mechanisms can be naturally unified in recommendation scenarios, where SSM captures temporal dynamics through state compression while attention models both similar and diverse item relationships. Extensive experiments on four real-world datasets demonstrate that STAR-Rec consistently outperforms state-of-the-art sequential recommendation methods, particularly in scenarios involving diverse user behaviors and varying sequence lengths.

[IR-3] Advancing Remote and Continuous Cardiovascular Patient Monitoring through a Novel and Resource-efficient IoT-Driven Framework

链接: https://arxiv.org/abs/2505.03409
作者: Sanam Nayab,Sohail Raza Chohan,Aqsa Jameel,Syed Rehan Shah,Syed Ahsan Masud Zaidi,Aditya Nath Jha,Kamran Siddique
类目: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR)
*备注: 20 pages, and 8063 words and 14 figures

点击查看摘要

Abstract:Cardiovascular diseases are a leading cause of fatalities worldwide, often occurring suddenly with limited time for intervention. Current healthcare monitoring systems for cardiac patients rely heavily on hospitalization, which can be impractical for continuous monitoring. This paper presents a novel IoT-based solution for remote, real-time tracking of critical cardiac metrics, addressing the pressing need for accessible and continuous healthcare, particularly for the aging population in Pakistan. The proposed IoT kit measures essential parameters such as body temperature, heart rate (HR), blood pressure (BP), oxygen saturation (SPO2), and electrocardiography (ECG). A key innovation of the system is its integration with a cloud-based application, enabling constant remote monitoring and incorporating an alarm mechanism to alert medical professionals for timely intervention, reducing the risk of catastrophic incidents. The system was tested in a clinical environment with 20 participants, demonstrating results closely aligned with those obtained using standard medical devices. The findings validate the system’s potential for reliable remote monitoring, offering a significant step forward in proactive cardiac healthcare management. This novel approach combines IoT technology with cloud-based applications to provide a cost-effective and efficient solution for reducing unexpected fatalities among cardiac patients. Comments: 20 pages, and 8063 words and 14 figures Subjects: Networking and Internet Architecture (cs.NI); Information Retrieval (cs.IR) Cite as: arXiv:2505.03409 [cs.NI] (or arXiv:2505.03409v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2505.03409 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] CB-cPIR: Code-Based Computational Private Information Retrieval

链接: https://arxiv.org/abs/2505.03407
作者: Camilla Hollanti,Neehar Verma
类目: Information Retrieval (cs.IR); Information Theory (cs.IT)
*备注: This paper builds on the work done in arXiv: 2402.02871v1 (IEEE ISIT24) and arXiv: 2001.07049 (IEEE ISIT20)

点击查看摘要

Abstract:A private information retrieval (PIR) scheme is a protocol that allows a user to retrieve a file from a database without revealing the identity of the desired file to a curious database. Given a distributed data storage system, efficient PIR can be achieved by making assumptions about the colluding capabilities of the storage servers holding the database. If these assumptions turn out to be incorrect, privacy is lost. In this work, we focus on the worst-case assumption: full collusion or, equivalently, viewing the storage system virtually as a single honest-but-curious server. We present CB-cPIR, a single-server code-based computational private information retrieval (cPIR) scheme that derives security from code-based cryptography. Specifically, the queries are protected by the hardness of decoding a random linear code. The scheme is heavily inspired by the pioneering code-based cPIR scheme proposed by Holzbaur, Hollanti, and Wachter-Zeh in [Holzbaur et al., “Computational Code-Based Single-Server Private Information Retrieval”, 2020 IEEE ISIT] and fixes the vulnerabilities of the original scheme arising from highly probable rank differences in submatrices of the user’s query. For further validation, we draw comparisons to the state-of-the-art lattice-based cPIR schemes.

[IR-5] Me the Good Stuff: User Preferences in Movie Recommendation Explanations

链接: https://arxiv.org/abs/2505.03376
作者: Juan Ahmad,Jonas Hellgren,Alan Said
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization

点击查看摘要

Abstract:Recommender systems play a vital role in helping users discover content in streaming services, but their effectiveness depends on users understanding why items are recommended. In this study, explanations were based solely on item features rather than personalized data, simulating recommendation scenarios. We compared user perceptions of one-sided (purely positive) and two-sided (positive and negative) feature-based explanations for popular movie recommendations. Through an online study with 129 participants, we examined how explanation style affected perceived trust, transparency, effectiveness, and satisfaction. One-sided explanations consistently received higher ratings across all dimensions. Our findings suggest that in low-stakes entertainment domains such as popular movie recommendations, simpler positive explanations may be more effective. However, the results should be interpreted with caution due to potential confounding factors such as item familiarity and the placement of negative information in explanations. This work provides practical insights for explanation design in recommender interfaces and highlights the importance of context in shaping user preferences.

[IR-6] Soft Reasoning Paths for Knowledge Graph Completion

链接: https://arxiv.org/abs/2505.03285
作者: Yanning Hou,Sihang Zhou,Ke Liang,Lingyuan Meng,Xiaoshu Chen,Ke Xu,Siwei Wang,Xinwang Liu,Jian Huang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Reasoning paths are reliable information in knowledge graph completion (KGC) in which algorithms can find strong clues of the actual relation between entities. However, in real-world applications, it is difficult to guarantee that computationally affordable paths exist toward all candidate entities. According to our observation, the prediction accuracy drops significantly when paths are absent. To make the proposed algorithm more stable against the missing path circumstances, we introduce soft reasoning paths. Concretely, a specific learnable latent path embedding is concatenated to each relation to help better model the characteristics of the corresponding paths. The combination of the relation and the corresponding learnable embedding is termed a soft path in our paper. By aligning the soft paths with the reasoning paths, a learnable embedding is guided to learn a generalized path representation of the corresponding relation. In addition, we introduce a hierarchical ranking strategy to make full use of information about the entity, relation, path, and soft path to help improve both the efficiency and accuracy of the model. Extensive experimental results illustrate that our algorithm outperforms the compared state-of-the-art algorithms by a notable margin. The code will be made publicly available after the paper is officially accepted.

[IR-7] Characterising Topic Familiarity and Query Specificity Using Eye-Tracking Data

链接: https://arxiv.org/abs/2505.03136
作者: Jiaman He,Zikang Leng,Dana McKay,Johanne R. Trippas,Damiano Spina
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Eye-tracking data has been shown to correlate with a user’s knowledge level and query formulation behaviour. While previous work has focused primarily on eye gaze fixations for attention analysis, often requiring additional contextual information, our study investigates the memory-related cognitive dimension by relying solely on pupil dilation and gaze velocity to infer users’ topic familiarity and query specificity without needing any contextual information. Using eye-tracking data collected via a lab user study (N=18), we achieved a Macro F1 score of 71.25% for predicting topic familiarity with a Gradient Boosting classifier, and a Macro F1 score of 60.54% with a k-nearest neighbours (KNN) classifier for query specificity. Furthermore, we developed a novel annotation guideline – specifically tailored for question answering – to manually classify queries as Specific or Non-specific. This study demonstrates the feasibility of eye-tracking to better understand topic familiarity and query specificity in search.

[IR-8] Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models

链接: https://arxiv.org/abs/2505.03075
作者: Zhengliang Shi,Lingyong Yan,Weiwei Sun,Yue Feng,Pengjie Ren,Xinyu Ma,Shuaiqiang Wang,Dawei Yin,Maarten de Rijke,Zhaochun Ren
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) integrates large language models ( LLM s) with retrievers to access external knowledge, improving the factuality of LLM generation in knowledge-grounded tasks. To optimize the RAG performance, most previous work independently fine-tunes the retriever to adapt to frozen LLM s or trains the LLMs to use documents retrieved by off-the-shelf retrievers, lacking end-to-end training supervision. Recent work addresses this limitation by jointly training these two components but relies on overly simplifying assumptions of document independence, which has been criticized for being far from real-world scenarios. Thus, effectively optimizing the overall RAG performance remains a critical challenge. We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components: (i) a generative knowledge selection model and (ii) an LLM generator. DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted maximization, progressively improving RAG components through a variational approach. In the estimation step, we treat document permutation as a latent variable and directly estimate its distribution from the selection model by applying an importance sampling strategy. In the maximization step, we calibrate the optimization expectation using importance weights and jointly train the selection model and LLM generator. Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning. Extensive experiments conducted on five datasets illustrate that DRO outperforms the best baseline with 5%-15% improvements in EM and F1. We also provide in-depth experiments to qualitatively analyze the stability, convergence, and variance of DRO. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.03075 [cs.IR] (or arXiv:2505.03075v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.03075 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表